problem description
Hi, I called the jieba participle when I was running pyspark on the company line, and found that I could successfully import, but when I called the participle function in RDD, it suggested that there was no module jieba, without these problems in the local virtual machine
the environmental background of the problems and what methods you have tried
attempted to replace root installation jieba
related codes
/ / Please paste the code text below (do not replace the code with pictures)
import jieba
[x for x in jieba.cut ("this is a test text")]
Building prefix dict from the default dictionary.
Loading model from cache / tmp/jieba.cache
Loading model cost 0.448 seconds.
Prefix dict has been built succesfully.
Ufolu6587u672c"]
/ / above is a common call to jieba that can successfully segment
cut = name.map (lambda x: [y for y in jieba.cut (x)])
cut.count ()
/ / the above code will not report an error when running in the local virtual machine. But an error will be reported when the fortress machine runs online.
what result do you expect? What is the error message actually seen?
18-07-13 10:16:17 WARN scheduler.TaskSetManager: Lost task 6.0 in stage 1.0 (TID 16, hadoop13, executor 17): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/ opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/spark/python/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/ opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/spark/python/pyspark/serializers.py", line 164, in _ read_with_length
return self.loads(obj)
File "/ opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/spark/python/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
File "/ opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/spark/python/pyspark/cloudpickle.py", line 664, in subimport
__import__(name)
ImportError: ("No module named jieba", < function subimport at 0x27a9488 >, ("jieba",))
) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)