SoFunction
Updated on 2024-11-13

Python third-party library Pandas data analysis tutorials

Pandas import

Pandas is a third-party library for Python that provides high-performance, easy-to-use data types and analysis tools Pandas is based on the NumPy implementation and is often used in conjunction with NumPy and Matplotlib Two data types: Series, DataFrame

import pandas as pd

Pandas vs numpy

The Series type of Pandas

Consists of a set of data and an index of the data associated with it

Creation of the Series type in Pandas

The Series type can be created from the following types:

Python lists, index is the same as the number of elements in the list Scalar values, index expresses the size of the Series type Python dictionaries, the "key" in a key-value pair is the index, and index performs a selection operation on the dictionary ndarray, indexes and data can be created with the ndarray Other functions, range() function, etc.

Basic operations on the Series type of Pandas

The Series type contains both index and values:

index Gets the index values Gets the data

A Series created from an ndarray or dictionary that operates like an ndarray or dictionary type.

DataFrame type for pandas

The DataFrame type consists of a set of columns that share the same index.

DataFrame is a tabular data type where each column value type can be different

DataFrame has both row and column indexes.

DataFrames are often used to express two-dimensional data, but can express multidimensional data.

DataFrame is a two-dimensional "labeled" array.

The basic operation of DataFrame is similar to Series, based on the index of rows and columns.

DataFrame type creation for pandas

The DataFrame type can be created from the following types:

Two-dimensional ndarray object Dictionary type consisting of a one-dimensional ndarray, list, dictionary, tuple, or Series Other DataFrame types

Basic operations on Pandas' Dataframe type

pandas index operations

pandas reindexing

reindex() can change or reorder Series and DataFrame indexes.

Arguments for reindex(index=None, columns=None, ...)

pandas delete index

drop () can delete the Series and DataFrame specified rows or columns indexed

pandas data operations

Arithmetic operations are based on row and column indexes, and are performed after completing the operation, which produces a floating-point number by default Missing entries are filled with NaN (null) when completing the operation Broadcasting operations between two and one dimensions, one dimension, and zero dimensions Binary operations using the + - * / notation produce a new object

arithmetic operation

Different dimensions are broadcast operations, one-dimensional series participate in axis 1 by default. Use the operation method to make one-dimensional series participate in axis 0 operations.

Pandas Data Analysis

pandas import and export data

Import data

pd.read_csv(filename): import data from CSV file

pd.read_table(filename): Import data from a delimited text file.

pd.read_excel(filename): import data from Excel file

pd.read_sql(query, connection_object): import data from SQL table/library

pd.read_json(json_string): import data from JSON format string

pd.read_html(url): parses a URL, string or HTML file and extracts the tables table.

pd.read_clipboard(): get the content from your clipboard and pass it to read_table()

(dict): import data from the dictionary object, Key is the column name, Value is the data

Export data

df.to_csv(filename): export data to CSV file

df.to_excel(filename): export data to Excel file

df.to_sql(table_name, connection_object): export data to SQL table

df.to_json(filename): export data to text file in Json format

Pandas to view, examine data

(n): view the first n rows of the DataFrame object

(n): view the last n rows of the DataFrame object

(): view the number of rows and columns

(): view index, data type and memory information

(): view summary statistics for numeric columns

s.value_counts(dropna=False): view unique values and counts for Series objects

(.value_counts): view unique values and counts for each column in the DataFrame object

Pandas data selection

df[col]: based on the column name and return the column as a Series

df[[col1, col2]]: return multiple columns as DataFrame

[0]: Selection of data by position

['index_one']: select data by index

[0,:]: return to the first line

[0,0]: return to the first element of the first column

pandas data cleanup

= ['a','b','c']: rename columns

(): checks for null values in the DataFrame object and returns an array of Booleans

(): checks for non-null values in the DataFrame object and returns an array of Booleans

(): removes all lines containing null values

(axis=1): remove all columns containing null values

(axis=1,thresh=n): remove all rows with less than n non-null values

(x): replace all null values in the DataFrame object with x

(float): change the data type in Series to float type

(1,'one'): replace all values equal to 1 with 'one'

([1,3],['one','three']): replace 1 with 'one' and 3 with 'three'

(columns=lambda x: x + 1): batch change column names

(columns={'old_name': 'new_ name'}): selectively change column names

df.set_index('column_one'): change index columns

(index=lambda x: x + 1): batch rename indexes

Pandas Data Processing

= ['a','b','c']: rename columns

(): checks for null values in the DataFrame object and returns an array of Booleans

(): checks for non-null values in the DataFrame object and returns an array of Booleans

(): removes all lines containing null values

(axis=1): remove all columns containing null values

(axis=1,thresh=n): remove all rows with less than n non-null values

(x): replace all null values in the DataFrame object with x

(float): change the data type in Series to float type

(1,'one'): replace all values equal to 1 with 'one'

([1,3],['one','three']): replace 1 with 'one' and 3 with 'three'

(columns=lambda x: x + 1): batch change column names

(columns={'old_name': 'new_ name'}): selectively change column names

df.set_index('column_one'): change index columns

(index=lambda x: x + 1): batch rename indexes

df[df[col] > 0.5]: selects rows where the value of the col column is greater than 0.5

df.sort_values(col1): sort data by column col1, default ascending order

df.sort_values(col2, ascending=False): sort data in descending order by column col1

df.sort_values([col1,col2], ascending=[True,False]): first by column col1 ascending, then by col2 descending order data

(col): return a Groupby object grouped by column col

([col1,col2]): Returns a Groupby object grouped by multiple columns.

(col1)[col2]: returns the mean value of column col2 after grouping by column col1

df.pivot_table(index=col1, values=[col2,col3], aggfunc=max): creates a pivot table grouped by column col1 and calculates the maximum values for col2 and col3

(col1).agg (): return to the average of all columns grouped by column col1

(): Apply the function to each column in the DataFrame

(,axis=1): apply the function to each row in the DataFrame

Pandas Data Merge

(df2): add rows from df2 to the end of df1

([df1, df2],axis=1): add the columns from df2 to the end of df1

(df2,on=col1,how='inner'): performs a SQL join between the columns of df1 and df2.

Pandas statistics

(): view summary statistics for columns of data values

(): return the average value of all columns

(): return the correlation coefficient between columns

(): return the number of non-null values in each column

(): return to the maximum value of each column

(): return the minimum value of each column

(): return the median of each column

(): return the standard deviation of each column

To this article on the Python third-party library Pandas data analysis tutorial is introduced to this article, more related to Python Pandas data analysis content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future!