Why does the sortBy function of Spark generate 4 MapPartitionsRDD?

execute two programs in spark-shell:
first paragraph sortBy:

val list1: List[(String, Int)] = List(("the", 12), ("they", 2), ("do", 4), ("wild", 1), ("and", 5), ("into", 4))
val listRDD1: RDD[(String, Int)] = sc.parallelize(list1)
val result1: RDD[(String, Int)] = listRDD1.sortBy(_._2, false)
result1.collect()

look at the DAG of the program in webui, resulting in three Stage:

4MapPartitionsRDD3ShuffledRDDshuffledRDD
sortBy:

keyByshuffleMapPartitionsRDD, valuesshuffledMapPartitionsRDDMapPartitionsRDDsortByKey

DAGStage:

look at the fact that DAG does generate two MapPartitionsRDD, but how are both MapPartitionsRDD generated? And why is there another parallelize phase in the middle? Ask the boss for an answer.

Scala spark

Jul.11,2022

Previous: The problem between SpringBoot and Freemarker.cache log4j

Next: Ele.me join the realization of shopping cart button

On the understanding of scala Grammar in Spark
val lines: Dataset[String] = session.read.textFile("") val words: Dataset[String] = lines.flatMap(_.split(" ")) linesdataSetflatMapdataSetIDEAflatMap: def flatMap[U : Encoder](func: T => Traversabl...

Java scala spark

Mar.03,2021
Spark sql parses the json of an array of nested objects
1. Json data is now available as follows { "id ": 11, "data ": [{ "package ": "com.browser1 ", "activetime ": 60000}, { "package ": "com.browser6 ", "activetime ": 1205000}, { "package ": "com.browser7 ", "activetime ": 1205000}]} { "id ": 12...

Big-data json spark-streaming scala spark

Mar.28,2021
Usage of configuration files in spark project
problem description sparksql project, the sql script is placed in the resource sql file below (different businesses, there are a lot of scripts); Local write code to load the sql script using this.getClass.getResource (). GetPath method, get the pa...

Scala spark-submit spark java

Aug.01,2021
Is the Scala Seq prompt used in Spark Dataframe join not serialized?
I want to use the multi-field join function of dataframe in java spark-sql. Take a look at this interface. If you want to have multiple fields join, you need to pass in a usingColumns. . public org.apache.spark.sql.DataFrame join(org.apache.spark.sql.Da...

Scala spark

Oct.24,2021

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-35fe1d2-2ae97.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-35fe1d2-2ae97.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?