I have a batch of data (10 billion) as follows,
ID FROM TO
1 A B
2 A C
3 B A
4 C A
Delete duplicate two-way relational data as follows
ID FROM TO
1 A B
2 A C
1. Because the amount of data is too large, bloomfilter is not suitable;
2, the efficiency of using database query to repeat is too low;
3, is it more appropriate to use spark or hadoop to deal with such a large amount of data? All the deduplication solutions found on the network are similar to using a field of groupby to repeat, which doesn"t make much sense to my data.