I want to use the multi-field join function of dataframe in java spark-sql. Take a look at this interface. If you want to have multiple fields join, you need to pass in a usingColumns.
.public org.apache.spark.sql.DataFrame join(org.apache.spark.sql.DataFrame right, scala.collection.Seq<java.lang.String> usingColumns, java.lang.String joinType)
< hr >
so I converted List to Scala in java myself. The Seq, code is as follows
List<String> tmp = Arrays.asList(
ColumnUtil.PRODUCT_COLUMN,
ColumnUtil.EVENT_ID_COLUMN
);
scala.collection.Seq<String> usingColumns = JavaConverters.asScalaIteratorConverter(tmp.iterator()).asScala().toSeq();
DataFrame unionDf = uvDataframe.join(deviceUvDataframe, usingColumns, "inner");
< hr >
the result finally reports an error when executing join
Caused by: java.io.NotSerializableException: java.util.AbstractList$Itr
Serialization stack:
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 49 more
I have tested that the following two interfaces can work properly with join, but only this kind of multi-field join will have the problem of non-serialization. Do you have any solutions?
public org.apache.spark.sql.DataFrame join(org.apache.spark.sql.DataFrame right)
public org.apache.spark.sql.DataFrame join(org.apache.spark.sql.DataFrame right, java.lang.String usingColumn)