Python pandas to get the number of rows and columns of data
import pandas as pd df = ({'Country':['China','China', 'India', 'India', 'America', 'Japan', 'China', 'India'], 'Income':[10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000], 'Number':[5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})
Act I:
#come (or go) backdfrows and columns
Output:
(8, 3)
[0]#come (or go) backdfrows
Output:
8
[1]
Output:
3
Law II:
()
Output:
pandas get data and data overview
1. Data acquisition
Introduce the necessary libraries first
import pandas as pd import numpy as np
1.1 Reading data
Usage: pandas.read_csv()
Parameters:
(1) Path where the file is located
(2) headers: by setting the parameter headers=None, pandas will not automatically set the first row of the dataset as a list header (column name)
other_path = "/cf-courses-data/CognitiveClass/DA0101EN/" df = pd.read_csv(other_path, header=None)
- Look up the first n rows of the dataset, using the function (n); the
- Look up the penultimate n rows of the dataset, using the function (n)
(5)
Output:
(10)
Output:
1.2 Adding column names (table headers) to data sets
Observing some of the data read out above, pandas automatically sets the column names (table headers) to numeric labels starting at 0.
We need to manually add column names that will help us understand the data better:
First create a list headers, the contents of which is the name of each column, and then use the method: = headers to replace the column name we just set.
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style", "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type", "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower", "peak-rpm","city-mpg","highway-mpg","price"] = headers (10)
Output:
1.3 Deletion of certain "dirty data" with null values
Looking at some of the data above, we find that there are some rows with values of "?" rows represent null values, first we need to replace these "?" flag to NaN, and then use the method dropna() to remove these dirty data.
df1=('?',)
The following method dropna is used to remove dirty data rows.
About method dropna():
Parameters:
(1) axis: default 0 refers to the deletion of rows, 1 for the deletion of columns
(2) subset: delete missing values for specific columns
(3) how: {'any', 'all'}, default 'any' means all lines with missing values; 'all ' means clearing all lines with missing values; 'all' means clearing all lines with missing values.
(4) thresh: int, retaining rows containing int non-null values.
(5) replace: True means change directly on the original data.
df=(subset=["price"], axis=0) (20)
The above call removes the rows where the "price" column is null.
Output:
As you can see, the original "price" column in line 9 was null, so line 9 was deleted.
1.4 List of column names for viewing data
Output:
1.5 Preservation of a data set
We can save the processed dataframe (df) as a file in some format (e.g., .csv) for easy reading later.
Usage df.to_csv ("Path where the file should be saved", index = False)
df.to_csv("", index=False)
Note: The parameter index means "whether to keep the row index", default is True.
Of course, we can read data in other formats, and after manipulating the data, we can also save the data in a different format. The following figure shows how to read a file in another format and save the data set in another format:
2 Overview of data
2.1 Viewing the type of data in each column
The attribute dtypes of a dataframe can return a list representing the names and types of the data in each column:
print()
Output:
The first column is the name of the column and the second column is the type of data
2.2 Obtain statistical characteristics for each column of data (eg: total number of rows, mean, standard deviation of columns of data, etc)
Use:() to see each column of data for the
(1) Total line count
(2) Mean
(3) Standard deviation std
(4) Minimum min
(5) 25% quartile "25%"
(6) 50% quartile "50%"
(7) 75th percentile "75%"
(8) Maximum value max
()
Output:
Note: The method describe() only counts (without any arguments) the statistical characteristics of the columns of the data type (numeric-typed data type, e.g. int, float, etc.) and automatically NaNs the values.
If we want to see the statistical characteristics of all columns (i.e., including columns of non-data types, such as those of type OBJECT), we need to add the parameter (include = "all") to the describe() method
(include = "all")
Output:
2.3 Getting the statistical characteristics of a given column
Use the following statement:
dataframe[[‘column1’, ‘column2’, ‘column3’]].describe()
df[['length', 'compression-ratio']].describe()
Output:
2.4 Using the method info() to view the profile description of a dataframe
Use the following statement:
()
This method prints information about the dataframe, including index dtype and columns, non-null values, and memory usage.
()
Output:
2.5 Viewing the number of rows and columns of data
Get the (number of) rows and columns of the dataset, using the attribute shape.
ratings_df.shape
Output:
(463, 19)
summarize
The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.