SoFunction
Updated on 2024-11-19

spark: Interconversion methods between RDDs and DataFrames

A DataFrame is a dataset organized into columns of names. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but optimized.DataFrames can be built from a variety of sources, such as structured data files, tables in Hive, external databases, or existing RDDs.

The DataFrame API can be called by Scala, Java, Python, and R.

In Scala and Java, a DataFrame is represented by a dataset of Rows.

In Scala API, DataFrame is just a type alias Dataset[Row]. Whereas in Java API, users need Dataset<Row> to represent DataFrame.

Throughout this document, we often refer to Scala/Java dataset Rows as DataFrames.

So how to convert between DataFrame and spark core data structure RDD?

The code is as follows:

# -*- coding: utf-8 -*-
from __future__ import print_function
from  import SparkSession
from  import Row

if __name__ == "__main__":
 # Initialize SparkSession
 spark = SparkSession \
 .builder \
 .appName("RDD_and_DataFrame") \
 .config("", "some-value") \
 .getOrCreate()

 sc = 

 lines = ("")
 parts = (lambda l: (","))
 employee = (lambda p: Row(name=p[0], salary=int(p[1])))

 #RDD to DataFrame
 employee_temp = (employee)

 # Display DataFrame data
 employee_temp.show()

 #Creating a view
 employee_temp.createOrReplaceTempView("employee")
 # Filtered data
 employee_result = ("SELECT name,salary FROM employee WHERE salary >= 14000 AND salary <= 20000")

 # DataFrame to RDD conversion
 result = employee_result.(lambda p: "name: " +  + " salary: " + str()).collect()

 #Print RDD data
 for n in result:
 print(n)

Above this spark: RDD and DataFrame conversion method between each other is all I share with you, I hope to give you a reference, and I hope you support me more.