Introduction to PySpark SQL Knowledge

1 Introduction to big data

Big data is one of the hottest topics of our time. But what is big data? It describes a huge set of data that is growing at an alarming rate. Apart from volume and velocity, variety and veracity are the key features of Big Data. Let us discuss volume, velocity, variety and accuracy in detail. These are also known as the 4V characteristics of Big Data.

1.1 Volume

The volume of data (Volume) specifies the amount of data to be processed. For large amounts of data, we need large machines or distributed systems. The computation time increases with the increase in the volume of data. So if we can parallelize the computation, it is better to use distributed system. Data can be structured data, unstructured data or something in between. If we have unstructured data then the situation becomes more complex and computationally intensive. You may be thinking, how big is Big Data? This is a controversial question. But in general, we can say that the amount of data that we cannot process using traditional systems is defined as big data. Now let's discuss the velocity of the data.

1.2 Velocity

More and more organizations are taking data seriously. Large amounts of data are being collected every minute of every day. This means that the velocity of data is increasing. How does a system handle this velocity? The problem becomes complex when the huge inflow of data has to be analyzed in real time. Many systems are being developed to handle this huge inflow of data. Another factor that differentiates traditional data from big data is the diversity of data.

1.3 Variety

The diversity of the data makes it so complex that traditional data analytics systems can't properly analyze it. Which one are we talking about? Isn't data just data? Image data differs from tabular data because it is organized and saved differently. An unlimited number of file systems can be used. Each file system requires a different way to handle it. Reading and writing a JSON file is different than dealing with a CSV file. Now, the data scientist must deal with a combination of data types. The data you will be working with may be a combination of images, videos, text, etc. The diversity of big data makes analyzing it more complex.

1.4 Veracity

Can you imagine a computer program with faulty logic producing correct output? Similarly, inaccurate data will provide misleading results. Accuracy, or data correctness, is an important issue. With big data, we must consider data anomalies.

2 Introduction to Hadoop

Hadoop is a distributed, scalable framework for solving big data problems.Hadoop was developed by Doug Cutting and Mark Cafarella.Hadoop is written in Java. It can be installed on a set of commercial hardware and can scale horizontally on distributed systems.

Working on commodity hardware makes it very efficient. If we were working on commodity hardware, failures would be an inevitable problem. But Hadoop provides a fault tolerant system for data storage and computation. This fault tolerance makes Hadoop very popular.

Hadoop has two components: the first component is HDFS (Hadoop Distributed File System) which is a distributed file system. The second component is MapReduce.HDFS is used for distributed data storage and MapReduce is used to perform computations on the data stored in HDFS.

2.1 Introduction to HDFS

HDFS is used to store large amounts of data in a distributed and fault-tolerant manner.HDFS is written in Java and runs on common hardware. It was inspired by the Google research paper on Google File System (GFS). It is a write-once-read-many system that is efficient for large amounts of data.HDFS has two components NameNode and DataNode.

These two components are Java daemons. the NameNode is responsible for maintaining metadata about files distributed across the cluster. it is the master node of many datanodes. the HDFS breaks large files into smaller chunks and stores these chunks on different datanodes. The actual chunks of file data reside on the datanode.HDFS provides a set of unix-shell like commands. However. we can use the Java filesystem API provided by HDFS to work with large files at a finer level. Fault tolerance is achieved by replicating data blocks.

We can access HDFS files using a parallel single-threaded process.HDFS provides a very useful utility called distcp which is commonly used to transfer data from one HDFS system to another in a parallel fashion. It uses parallel mapping tasks to copy data.

2.2 Introduction to MapReduce

The MapReduce model of computing first appeared in a Google research paper.MapReduce for Hadoop is the computational engine of the Hadoop framework that computes on distributed data in HDFS.MapReduce has been found to scale horizontally on distributed systems with commodity hardware. It is also suitable for large problems. In MapReduce, problem solving is divided into Map phase and Reduce phase. In the Map phase, blocks of data are processed and in the Reduce phase, aggregation or reduction operations are run on the results of the Map phase.The MapReduce framework for Hadoop is also written in Java.

MapReduce is a master-slave model. In Hadoop 1, this MapReduce computation is managed by two daemons Jobtracker and Tasktracker.Jobtracker is the master process that handles many task trackers.Tasktracker is the slave node of Jobtracker. But in Hadoop 2, Jobtracker and Tasktracker are replaced by YARN.

We can write MapReduce code using the APIs provided by the framework and Java.The Hadoop streaming body module enables programmers with knowledge of Python and Ruby to write MapReduce programs.

MapReduce algorithms have many uses. Such as many machine learning algorithms are implemented by Apache Mahout, which can be run on Hadoop via Pig and Hive.

But MapReduce is not suitable for iterative algorithms. At the end of each Hadoop job, MapReduce saves the data to HDFS and reads it again for the next job. As we know, reading and writing data to and from files are costly activities.Apache Spark mitigates the drawbacks of MapReduce by providing in-memory data persistence and computation.

More about Mapreduce and Mahout can be found on the following pages:
/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean_html/
/users/basics/

3 Introduction to Apache Hive

Computer science is an abstract world. Everyone knows that data is information in the form of bits. Programming languages like C provide abstractions from machine and assembly language. Other high-level languages provide more abstractions. Structured Query Language (SQL) is one of these abstractions. Many data modeling experts around the world use SQL.Hadoop is well suited for big data analytics. So, how can a wide range of users who understand SQL utilize the computational power of Hadoop on big data? In order to write MapReduce programs for Hadoop, users must know the programming languages that can be used to write MapReduce programs for Hadoop.

Everyday problems in the real world follow certain patterns. Some problems are common in everyday life, such as data manipulation, dealing with missing values, data transformation, and data summarization. Writing MapReduce code for these everyday problems can be a mind-numbing task for non-programmers. Writing code to solve problems is not a very smart thing to do. But there is value in writing efficient code that is performance scalable and extensible. With this in mind, Apache Hive was developed right here at Facebook, which solves everyday problems without the need to write MapReduce code for general problems.

According to the language of Hive wiki, Hive is a data warehouse infrastructure based on Apache Hadoop.Hive has its own SQL dialect called Hive Query Language. It is called HiveQL and sometimes referred to as HQL.Using HiveQL, Hive queries data in HDFS.Hive runs not only on HDFS but also on Spark and other big data frameworks like Apache Tez.

Hive provides users with a relational database management system-like abstraction for structured data in HDFS. You can create tables and run sql-like queries on them.Hive stores table schemas in a number of RDBMSs.Apache Derby is the default RDBMS that comes with the Apache Hive distribution.Apache Derby is written entirely in Java and is the open source RDBMS.

HiveQL commands are converted to Hadoop's MapReduce code, which is then run on the Hadoop cluster.

People who know SQL can easily learn Apache Hive and HiveQL and can use the storage and computational power of Hadoop in their day-to-day big data data analytics work.HiveQL is also supported in PySpark SQL.You can run HiveQL commands in PySpark SQL. In addition to executing HiveQL queries, you can read data directly from Hive to PySpark SQL and write the results to Hive.

RELATED:
/confluence/display/Hive/Tutorial
/derby/

4 Introduction to Apache Pig

Apache Pig is a data streaming framework for performing data analysis on large amounts of data. It was developed by Yahoo and open sourced to the Apache Software Foundation. It is now available under the Apache license version 2.0.The Pig programming language is a Pig Latin scripting language.Pig is loosely connected to Hadoop, which means that we can connect it to Hadoop and perform many analyses. But Pig can be used with other tools like Apache Tez and Apache Spark.

Apache Hive is used as a reporting tool where Apache Pig is used for Extract, Transform and Load (ETL). We can extend the functionality of Pig using User Defined Functions (UDFs). User-defined functions can be written in a variety of languages, including Java, Python, Ruby, JavaScript, Groovy, and Jython.

Apache Pig uses HDFS to read and store data, and Hadoop's MapReduce to execute the algorithms.Apache Pig is similar to Apache Hive in its use of Hadoop clusters.On Hadoop, Pig commands are first converted to Hadoop's MapReduce code. They are then converted to MapReduce code which runs on the Hadoop cluster.

The best part of Pig is the optimization and testing of the code to handle everyday problems. So users can just install Pig and start using it.Pig provides a Grunt shell to run interactive Pig commands. So anyone who knows Pig Latin can enjoy the benefits of HDFS and MapReduce without needing to know high-level programming languages like Java or Python.

Related links

/docs/
/wiki/Pig_(programming_tool))
/confluence/display/PIG/Index

5 Introduction to Apache Kafka

Apache Kafka is a publish-subscribe distributed messaging platform. It was developed by LinkedIn and further open sourced to the Apache Foundation. It is fault-tolerant, scalable, and fast.Messages (the smallest unit of data) in Kafka terminology flow from producers to consumers through Kafka servers and can be persisted and used at a later time.

Kafka provides a built-in API that developers can use to build their applications. Next we discuss the three main components of Apache Kafka.

5.1 Producer

Kafka Producer generates messages to Kafka topics, which can publish data to multiple topics.

5.2 Broker

This is a Kafka server running on a dedicated machine where messages are pushed from the Producer to the Broker. the Broker keeps the topics in different partitions which are replicated to different Brokers for error handling. It is essentially stateless, so the user must keep track of the messages it consumes.

5.3 Consumer

Consumer fetches messages from the Kafka Broker. Remember, it gets the messages.The Kafka Broker does not push the messages to the Consumer; instead, the Consumer pulls the data from the Kafka Broker.The Consumer subscribes to one or more topics on the Kafka Broker and reads the messages.The Broker also keeps track of all the messages it uses. The data is kept in the Broker for a specified time. If the user fails, it can fetch the data after a restart.

RELATED:
/quickstart
/documentation/

6 Introduction to Apache Spark

Apache Spark is a generalized distributed programming framework. It is considered well suited for iterative and batch data processing. It was developed at AMP Labs and it provides an in-memory computing framework. It is open source software. On one hand, it is best suited for batch processing and on the other hand, it is very effective for real-time or near real-time data. Machine learning and graphing algorithms are iterative by nature, and that's where the magic of Spark comes in. According to its research papers, it is much faster than its counterpart Hadoop. Data can be cached in memory. Caching intermediate data in iterative algorithms provides amazingly fast processing.Spark can be programmed using Java, Scala, Python and R.

If you think of Spark as a modified Hadoop, it is true to some extent. Because we can implement MapReduce algorithms in Spark, Spark uses the benefits of HDFS. This means that it can read data from and store data to HDFS, and it can handle iterative computations efficiently because data can be kept in memory. In addition to in-memory computations, it is also suitable for interactive data analysis.

There are many other libraries that sit on top of PySpark to make it easier to use PySpark. we'll discuss some of them below:

MLlib: MLlib is a wrapper around the core of PySpark that handles machine learning algorithms.The MLlib library provides a machine learning api that is very easy to use.MLlib supports a wide range of machine learning algorithms including classification, clustering, text analysis and more.
ML: ML is also a machine learning library that sits at the heart of PySpark.ML's machine learning api can be used for data streaming.
GraphFrames: The GraphFrames library provides a set of api's for efficient graphical analysis using PySpark core and PySpark SQL.

7 Introduction to PySpark SQL

Most data that data scientists deal with is either structured or semi-structured in nature. In order to work with structured and semi-structured datasets, the PySpark SQL module is a higher level abstraction on top of that PySpark core. We'll learn about PySpark SQL throughout the book. it's built into PySpark, which means it doesn't require any additional installation.

With PySpark SQL, you can read data from many sources.PySpark SQL supports reading from many file format systems, including text files, CSV, ORC, Parquet, JSON, and more. You can read data from relational database management systems (RDBMS) such as MySQL and PostgreSQL.You can also save analysis reports to many systems and file formats.

7.1 DataFrames

DataFrames are an abstraction, similar to tables in a relational database system. They consist of specified columns.DataFrames are collections of row objects that are defined in PySpark SQL.DataFrames also consist of specified column objects. Users are aware of the schema of tabular forms, so it is easy to manipulate the data flow.

Elements in a DataFrame column will have the same data type. rows in a DataFrame may consist of elements of different data types. The basic data structure is called a Resilient Distributed Dataset (RDD). Data streams are wrappers on RDDs. They are RDD or row objects.

RELATED:
/docs/latest/

7.2 SparkSession

The SparkSession object is the entry point to replace SQLContext and HiveContext. To make the PySpark SQL code compatible with previous versions, the SQLContext and HiveContext will continue to run in PySpark. In the PySpark console, we get the SparkSession object. We can create the SparkSession object using the following code.

In order to create the SparkSession object, we have to import the SparkSession as shown below.

from  import SparkSession

After importing the SparkSession, we can use to perform operations:

spark = ("PythonSQLAPP") .getOrCreate()

The appName function will set the name of the application. What the function does:Returns an existing SparkSession object. If no SparkSession object exists, the getOrCreate() function will create a new object and return it.

7.3 Structured Streaming

We can perform analysis on streaming data using the Structured Streams framework (a wrapper for PySpark SQL). We can use Structured Streams to perform analysis on streaming data in a similar way as we can use PySpark SQL to perform batch analysis on static data. Just as the Spark Streaming module performs streaming operations on small batches, the Structured Streaming engine also performs streaming operations on small batches. The best part of Structured Streaming is that it uses an API similar to PySpark SQL. hence, the learning curve is high. Optimize for streaming operations and optimize the Structured Streaming API in a similar way in a performance context.

7.4 Catalyst Optimizer

SQL is a declarative language. With SQL, we tell the SQL engine what to do. We don't tell it how to perform the task. Similarly, PySpark SQL commands do not tell it how to perform tasks. The commands only tell it what to perform. As a result, PySpark SQL queries need to be optimized as they perform their tasks. catalyst optimizer performs query optimization in PySpark SQL. pySpark SQL queries are converted into low-level resilient distributed dataset (RDD) operations. catalyst optimizer first converts PySpark SQL queries into logical plans, and then converts this logical plan into an optimized logical plan. Creates a physical plan from this optimized logical plan. Create multiple physical plans. Using the cost analyzer, select the optimal physical plan. Finally, create the low-level RDD operation code.

8 Cluster Managers

In a distributed system, jobs or applications are divided into different tasks that can run in parallel on different machines in a cluster. If a machine fails, you must reschedule the tasks on another machine.

Distributed systems often face scalability problems due to poor resource management. Consider a job that is already running on a cluster. Another person wants to work on another job. The second job has to wait until the first job is finished. But in this way we are not optimally utilizing resources. Resource management is easy to explain but difficult to implement on distributed systems. Cluster managers are developed to optimize the management of cluster resources. There are three cluster managers available for Spark standalone, Apache Mesos and YARN.The best part of these cluster managers is that they provide a layer of abstraction between the user and the cluster. Because of the abstraction provided by the cluster manager, the user experience is like working on a single machine, even though they are working on a cluster. Cluster managers dispatch cluster resources to running applications.

8.1 Standalone Cluster Manager (SCM)

Apache Spark comes with a single cluster manager. It provides a master-slave architecture to energize the cluster. It is a cluster manager that only uses spark. You can only run Spark applications using this standalone cluster manager. Its components are master and worker components. Workers are slaves of the master process and it is the simplest cluster manager. You can configure the Spark standalone cluster manager using a script in Spark's sbin directory.

8.2 Apache Mesos Cluster Manager (Apache Mesos Cluster Manager)

Apache Mesos is a general purpose cluster manager. It was developed at the AMP Lab at UC Berkeley.Apache Mesos helps distributed solutions scale efficiently. You can use Mesos to run different applications on the same cluster using different frameworks. What does it mean to have different applications from different frameworks? It means that you can run both Hadoop applications and Spark applications on Mesos. When multiple applications run on Mesos, they share the resources of the cluster.Apache Mesos has two important components: a master component and a slave component. This master-slave architecture is similar to the Spark standalone cluster manager. Applications running on Mesos are called frameworks. The slaves tell the master about the available resources provided as resources. Slaves provide resources at regular intervals. The allocation module of the master server decides which frame gets the resources.

8.3 YARN Cluster Manager (YARN Cluster Manager)

YARN stands for Another Resource Negotiator. YARN was introduced in Hadoop 2 to extend Hadoop. resource management is separated from job management. Separating these two components makes Hadoop scale better.The main components of YARN are Resource Manager, Application Master and Node Manager. There is a global Resource Manager and each cluster will run many Node Managers. Node Managers are slaves to the ResourceManager. The scheduler is the component of the ResourceManager that allocates resources to the different applications on the cluster. The best part is that you can run the Spark application and any other application such as Hadoop or MPI simultaneously on a YARN-managed cluster. each application has an application master, which handles tasks that run in parallel on the distributed system. Also, Hadoop and Spark have their own ApplicationMaster.

RELATED.
/docs/2.0.0/
/docs/2.0.0/
/docs/2.0.0/

9 Introduction to PostgreSQL

Relational database management systems are still very common in many organizations. What does relational mean here? Relational tables.PostgreSQL is a relational database management system. It runs on all major operating systems, such as Microsoft Windows, unix-based operating systems, MacOS X, and so on. It is an open source program and the code is available under the PostgreSQL license. Therefore, you are free to use it and modify it according to your needs.

PostgreSQL databases can be connected through other programming languages (such as Java, Perl, Python, C and c++) and many others (through different programming interfaces). It can also be programmed using a procedural programming language similar to PL/SQL, PL/pgSQL (Procedural Language/PostgreSQL). You can add custom functions to this database. You can write custom functions in C/ c++ and other programming languages. You can also read data from PostgreSQL using the JDBC connector from PySpark SQL.

PostgreSQLcomply withACID(Atomicity, Consistency, Isolation and
Durability/atomicity, consistency, isolation, and persistence) principles. It has many features, some of which are unique to PostgreSQL. It supports updatable views, transactional integrity, complex queries, triggers, etc. PostgreSQL uses a multi-version concurrency control model for concurrency management.

PostgreSQL is widely supported by the community.PostgreSQL is designed and developed to be scalable.

RELATED:
/wiki/Main_Page
/wiki/PostgreSQL
/wiki/Multiversion_concurrency_control
/

10 Introduction to MongoDB

MongoDB is a document-based NoSQL database. It is an open source distributed database developed by MongoDB Inc.MongoDB is written in c++ and it is horizontally scalable. Many organizations use it for backend databases and many other purposes.

MongoDB comes with a mongo shell, which is a JavaScript interface to the MongoDB server. mongo shell can be used to run queries and perform administrative tasks. On the mongo shell, we can also run JavaScript code.

Using PySpark SQL, we can read data from MongoDB and perform analysis. We can also write the results.

RELATED:
/

11 Introduction to Cassandra

Cassandra is an open source distributed database that comes with an Apache license. It is a NoSQL database developed by Facebook. It is horizontally scalable and best suited to handle structured data. It provides a high level of consistency and has adjustable consistency. It does not have a single point of failure. It uses a peer-to-peer distributed architecture to replicate data across different nodes. Nodes exchange information using gossip protocol.

RELATED:
/resources/tutorials
/doc/latest/

This is the whole content of this article.