Use spark or hadoop to delete duplicate two-way relational data

I have a batch of data (10 billion) as follows,

ID FROM TO
1   A    B
2   A    C
3   B    A
4   C    A

Delete duplicate two-way relational data as follows

ID FROM TO
1   A    B
2   A    C

1. Because the amount of data is too large, bloomfilter is not suitable;
2, the efficiency of using database query to repeat is too low;
3, is it more appropriate to use spark or hadoop to deal with such a large amount of data? All the deduplication solutions found on the network are similar to using a field of groupby to repeat, which doesn"t make much sense to my data.


you can sort the FROM and TO fields with Spark, and the first piece of data becomes

.
ID FROM TO
1   A    B
2   A    C
3   A    B
4   A    C

and then go to the duplicate or reduce

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-1b3313c-2be70.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-1b3313c-2be70.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?