spark - CodesHelper - Programming Question Answer

spark - Related information

Several problems of SparkStream checkpoint
I just used spark streaming. I have a few questions about checkpoint: There are two types of checkpoint, one for meta for driver and one for data. The manual says that the checkpoint of data will be written only if you use stateful transformation. So...

Big-data spark-streaming spark

Jul.14,2022
How to start writing spark code in the ubuntu integrated development environment?
Hello, everyone. Excuse me, a problem that has been bothering us for a long time. I have downloaded the hadoop2.6 and spark2.2,hadoop code in ubuntu and written it in eclipse. This is done, but I don t know how to write the spark code. Do you want to do...

Spark

Jul.11,2022
Why does the sortBy function of Spark generate 4 MapPartitionsRDD?
execute two programs in spark-shell: first paragraph sortBy: val list1: List[(String, Int)] = List(("the", 12), ("they", 2), ("do", 4), ("wild", 1), ("and", 5), ("into", 4)) val listRDD1: RDD...

Scala spark

Jul.11,2022
How to use java to implement SparkSQL dataframe to add self-increasing sequence number column?
query data with spark paging. Ordinary sql () does not support paging sql statements. it is said that you can add a sequence to realize , but basically it is scala add a list of "id " information to the original Schema information . val schema: S...

Spark java rdd hadoop

Jul.01,2022
How does spark calculate the difference between the two pieces of data (kafka data source)?
the data source is kafka, and a field is a timestamp. We want to calculate the difference between the timestamps of the two pieces of data, and then add a new field to store this value and send it out. I checked. Do you want to reducebykeyandwindow? Wit...

Kafka mapreduce spark spark-streaming

Apr.20,2022
Why does Spark only lazily calculate RDD??
Why does Spark only lazily calculate RDD? Why is it really calculated only when it is used in an action operation for the first time? ...

Spark

Oct.29,2021
Is the Scala Seq prompt used in Spark Dataframe join not serialized?
I want to use the multi-field join function of dataframe in java spark-sql. Take a look at this interface. If you want to have multiple fields join, you need to pass in a usingColumns. . public org.apache.spark.sql.DataFrame join(org.apache.spark.sql.Da...

Scala spark

Oct.24,2021
How to serialize a collection of objects to RDD? in pySpark?
how to serialize an object collection to RDD? in pySpark? for example: the simplest operation class test: data = 1 def __init__(self): self.property=0 def test2(self): print( hello ) if name = _ _ main__ : p1 = test() p2 = test()...

Python3.x spark

Sep.23,2021
Find the Spark self-study data set
recently, I have basically mastered the basic knowledge in self-study spark,. If you want to practice, ask for leave. Do you have any data sets to play with? Or if there are any good open source learning materials, I would like to further improve my unde...

Spark

Sep.16,2021
Usage of configuration files in spark project
problem description sparksql project, the sql script is placed in the resource sql file below (different businesses, there are a lot of scripts); Local write code to load the sql script using this.getClass.getResource (). GetPath method, get the pa...

Scala spark-submit spark java

Aug.01,2021
Spark streaming has been running for about 8 hours and hung up. What are you asking?
error log com.slhan.service.BusinessService the 341 line is to get the value of the broadcast variable 18 09 08 13:50:02 ERROR scheduler.JobScheduler: Error running job streaming job 1536385800000 ms.1 java.io.IOException: com.esotericsoftware.kr...

Yarn spark spark-streaming

Jun.03,2021
Can the program that calls Spark be started directly in Java-jar mode?
I wrote a worldcount program for spark, which can be debugged in eclipse using local mode, or run through the maven packaged java-jar command: SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount"); sparkConf.setMaster("loc...

Spark java

Apr.03,2021
Use spark or hadoop to delete duplicate two-way relational data
I have a batch of data (10 billion) as follows, ID FROM TO 1 A B 2 A C 3 B A 4 C A Delete duplicate two-way relational data as follows ID FROM TO 1 A B 2 A C 1. Because the amount of data is too large, bloomfilter is no...

Java bloomfilter elasticsearch spark hadoop

Mar.31,2021
How can two large pieces of data in spark avoid shuffle in join?
purpose: there are two large pieces of data in spark that require join,. Both input data contain the field userid. Now you need to associate them according to userid. I hope to avoid shuffle. completed: I pre-processed two pieces of data into 1w f...

Spark pyspark spark-streaming big-data

Mar.31,2021
Spark sql parses the json of an array of nested objects
1. Json data is now available as follows { "id ": 11, "data ": [{ "package ": "com.browser1 ", "activetime ": 60000}, { "package ": "com.browser6 ", "activetime ": 1205000}, { "package ": "com.browser7 ", "activetime ": 1205000}]} { "id ": 12...

Big-data json spark-streaming scala spark

Mar.28,2021
Why did pyspark fail to call python third-party libraries in RDD?
problem description Hi, I called the jieba participle when I was running pyspark on the company line, and found that I could successfully import, but when I called the participle function in RDD, it suggested that there was no module jieba, without th...

Spark python pyspark

Mar.28,2021
You have greatly helped me to see if my processing method is reasonable [a large number of Json files are read, parsed and imported into elasticsearch].
< H2 > Business scenario < H2 > A large number of json files need to be read and re-parsed and imported into elasticsearch . Json files are saved in different date folders. The size of a single folder is about 80g. The number of json files under the ...

Java spark hadoop elasticsearch big-data

Mar.23,2021
AWS EC2-Initial job has not accepted any resources
Slaves had registered, but cannot pass work to slave. (Standalone) If I open all of the Inbound TCP port, it can work. But I cannot do it, because it is about security. 2018-06-04 13:22:44 INFO DAGScheduler:54 - Submitting 100 missing tasks from...

Amazon-web-services ec2 spark

Mar.16,2021
Error NativeCodeLoader:62 after installing pyspark in win10
configure spark s environment according to this link https: blog.csdn.net w417950., but will report an error when starting: I searched for it and didn t find a way to solve my problem. Beginners on the road, please forgive me ...

Spark python

Mar.16,2021
CreateDirectStream.foreachRDD error calling external object
the figure is as follows: def update_model(rdd),mixture_model: it s OK to declare mixture_model directly in update_model, but every time you foreachRDD, you need to re-declare MingleModel. Makes it impossible to update the model in real time ...

Python spark-streaming spark spark-submit

Mar.16,2021

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-47bc08a-1422.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-47bc08a-1422.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?