A DataFrame is a dataset organized into columns of names. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but optimized.DataFrames can be built from a variety of sources, such as structured data files, tables in Hive, external databases, or existing RDDs.
The DataFrame API can be called by Scala, Java, Python, and R.
In Scala and Java, a DataFrame is represented by a dataset of Rows.
In Scala API, DataFrame is just a type alias Dataset[Row]. Whereas in Java API, users need Dataset<Row> to represent DataFrame.
Throughout this document, we often refer to Scala/Java dataset Rows as DataFrames.
So how to convert between DataFrame and spark core data structure RDD?
The code is as follows:
# -*- coding: utf-8 -*- from __future__ import print_function from import SparkSession from import Row if __name__ == "__main__": # Initialize SparkSession spark = SparkSession \ .builder \ .appName("RDD_and_DataFrame") \ .config("", "some-value") \ .getOrCreate() sc = lines = ("") parts = (lambda l: (",")) employee = (lambda p: Row(name=p[0], salary=int(p[1]))) #RDD to DataFrame employee_temp = (employee) # Display DataFrame data employee_temp.show() #Creating a view employee_temp.createOrReplaceTempView("employee") # Filtered data employee_result = ("SELECT name,salary FROM employee WHERE salary >= 14000 AND salary <= 20000") # DataFrame to RDD conversion result = employee_result.(lambda p: "name: " + + " salary: " + str()).collect() #Print RDD data for n in result: print(n)
Above this spark: RDD and DataFrame conversion method between each other is all I share with you, I hope to give you a reference, and I hope you support me more.