How to convert a DataFrame back to normal RDD in pyspark?
I need to use the(rdd.)partitionBy(npartitions, custom_partitioner method that is not available on the DataFrame. All of the DataFrame methods refer only to DataFrame results. So then how to create an RDD from the DataFrame data?
To convert a pyspark dataframe to rdd simply use the .rdd method:
rdd = df.rdd
But the setback here is that it may not give the regular spark RDD, it may return a Row object. In order to have the regular RDD format run the code below:
rdd = df.rdd.map(tuple)
or
rdd = df.rdd.map(list)