Failed CCA175 on 15 Sep 2018. this is why

failed the exam due to this error: “UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xxxx’ in position xxxx: ordinal not in range(128)” for PYSPARK when i trying to save my output to target HDFS location.

I performed all operations with DF/SPARKSQL then transferred back to RDD, but when trying to running yourrdd.saveAsTextFile function, this error shows up. any idea how to solve this??? Thanks…

Hi, did you face the same issue using df.write.text(“folder_path”)?

Hi,

I have not tried that.

They asked for special column delimiters for textfile output, so I guess need change DF to RDD to map again?

Let’s say your DF has 3 columns (cust_id, cust_fname, cust_lname) and we are asked to use | as the delimiter then as can do as follows.

df.as(“c”).selectExpr(“concat_ws(’|’, c.*) as result”).write.text(“folder_path”)

I got same error, I am sure there was some malformatted value was there which is not saving in text file

Got same error in exam. not able to save in text file.

was the source file to read from a csv format?

I got the same error. Any idea or how to solve this error while using pyspark.

Did anyone solve this issue? i gave the exam today and out of 7 spark question,for 4 question i got the same error while saving the data as text file. I was using pyspark and saving a rdd to text file using saveAsTextFile()

This issue is because,in python when we try to convert a non-ascii character to string and then save in a file.

final2 = final1.map(lambda x:str(x[0]+"\t"+x[1]+str(123)))

and if x[0] or x[1] contains a non ascii character and we try to convert them to str() , it throws unicodeencode error .
To solve this.
final2 = final1.map(lambda x:(x[0].encode(“utf-8”)+"\t"+x[1]+str(123)))
final2.saveAsTextFile(“path”)

Hope this helps.

1 Like

Thank you for the information and this “Unicode” error occurs mostly while using pyspark and not using Scala.Moreover when we are trying to save as textfile with bulk data and we are not sure while converting to string in which column( I mean x[0] or x[1] or x[2]) contains a non ascii character in this bulk data and we need to encode this columns and its like a trail and error. Already it is too slow and time consuming to execute during the exam and i think one should be cautious while using pyspark.

In Pyspark 2.4.6 (lab) , the only workaround I could found for this error is to bring the data to the driver node and perform the encoding to UTF-8 in the node.

I got into this error when studying Arun’s problems.

pyspark2 --master yarn
–conf spark.ui.port=12765
–num-executors 6
–executor-cores 2
–executor-memory 2G
–packages com.databricks:spark-avro_2.11:4.0.0
–jars /usr/share/java/mysql-connector-java.jar
–driver-class-path /usr/share/java/mysql-connector-java.jar

#LOAD MYSQL WITH PRODUCT DATA
spark.conf.set(“spark.sql.shuffle.partitions”,12)
spark.read.json("/public/retail_db_json/products").write.
jdbc(“jdbc:mysql://mysql.XXXX.com.br”,table=“XXXX.products”,
properties={“user”:“XXXX”,“password”:“XXXX”,“port”:“3306”}, mode=“overwrite”)

#DEFINE DRIVE NODE ENCODING FOR THE SESSION
import sys
reload(sys)
sys.setdefaultencoding(“utf-8”)
#GET DATA FROM MYSQL AND SAVE IT TO A VARIABLE LOCAL TO DRIVER NODE
local = spark.read.jdbc(“jdbc:mysql://mysql.XXXX.com.br”,table=“XXXX.products”,
properties={“user”:“XXXX”,“password”:“XXXX”,“port”:“3306”}).collect()
local2 = map(list,local)
local3=map(lambda r: ["|".join(map(str,r))],local2)
sc.parallelize(local3,1).map(lambda x: x[0]).write.text("/user/marceltoledo/arun/cloudera/products")

Are pyspark user still facing the same problem? Any update from recent exam-taker regarding this issue will be appreciated. Thanks!

I took the exam CCA 175 on 25/09/2020 and I did not face this issue

Hi,

After a couple of hours of searching for an answer, finally, executing the 2 simple commands below before launching pyspark worked for me. Most of the ones I saw suggested the 2nd command, but I realize the default Python version in ITVersity labs is python2, so I thought of changing the defaults to python3 would make sense. Hope this works for anyone who experiences this issue.

export PYSPARK_PYTHON=python3
export PYTHONIOENCODING=utf-8