Why does the sortBy function of Spark generate 4 MapPartitionsRDD?

execute two programs in spark-shell:
first paragraph sortBy:

val list1: List[(String, Int)] = List(("the", 12), ("they", 2), ("do", 4), ("wild", 1), ("and", 5), ("into", 4))
val listRDD1: RDD[(String, Int)] = sc.parallelize(list1)
val result1: RDD[(String, Int)] = listRDD1.sortBy(_._2, false)
result1.collect()

look at the DAG of the program in webui, resulting in three Stage:

clipboard.png

clipboard.png

clipboard.png

4MapPartitionsRDD3ShuffledRDDshuffledRDD
sortBy:


keyByshuffleMapPartitionsRDD, valuesshuffledMapPartitionsRDDMapPartitionsRDDsortByKey

:


DAGStage:

clipboard.png

clipboard.png

clipboard.png

look at the fact that DAG does generate two MapPartitionsRDD, but how are both MapPartitionsRDD generated? And why is there another parallelize phase in the middle? Ask the boss for an answer.

Jul.11,2022
MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-1bed714-31b66.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-1bed714-31b66.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?