SoFunction
Updated on 2024-11-17

How Python pandas gets the number of rows and columns of data

Python pandas to get the number of rows and columns of data

import pandas as pd
 
df = ({'Country':['China','China', 'India', 'India', 'America', 'Japan', 'China', 'India'], 
                   'Income':[10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000],
                    'Number':[5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})

Act I:

#come (or go) backdfrows and columns

Output:

(8, 3)

[0]#come (or go) backdfrows

Output:

8

[1]

Output:

3

Law II:

()

Output:

pandas get data and data overview

1. Data acquisition

Introduce the necessary libraries first

import pandas as pd
import numpy as np

1.1 Reading data

Usage: pandas.read_csv()

Parameters:

(1) Path where the file is located

(2) headers: by setting the parameter headers=None, pandas will not automatically set the first row of the dataset as a list header (column name)

other_path = "/cf-courses-data/CognitiveClass/DA0101EN/"
df = pd.read_csv(other_path, header=None)
  • Look up the first n rows of the dataset, using the function (n); the
  • Look up the penultimate n rows of the dataset, using the function (n)
(5)

Output:

(10)

Output:

1.2 Adding column names (table headers) to data sets

Observing some of the data read out above, pandas automatically sets the column names (table headers) to numeric labels starting at 0.

We need to manually add column names that will help us understand the data better:

First create a list headers, the contents of which is the name of each column, and then use the method: = headers to replace the column name we just set.

headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
 = headers
(10)

Output:

1.3 Deletion of certain "dirty data" with null values

Looking at some of the data above, we find that there are some rows with values of "?" rows represent null values, first we need to replace these "?" flag to NaN, and then use the method dropna() to remove these dirty data.

df1=('?',)

The following method dropna is used to remove dirty data rows.

About method dropna():

Parameters:

(1) axis: default 0 refers to the deletion of rows, 1 for the deletion of columns

(2) subset: delete missing values for specific columns

(3) how: {'any', 'all'}, default 'any' means all lines with missing values; 'all ' means clearing all lines with missing values; 'all' means clearing all lines with missing values.

(4) thresh: int, retaining rows containing int non-null values.

(5) replace: True means change directly on the original data.

df=(subset=["price"], axis=0)
(20)

The above call removes the rows where the "price" column is null.

Output:

As you can see, the original "price" column in line 9 was null, so line 9 was deleted.

1.4 List of column names for viewing data

Output:

1.5 Preservation of a data set

We can save the processed dataframe (df) as a file in some format (e.g., .csv) for easy reading later.

Usage df.to_csv ("Path where the file should be saved", index = False)

df.to_csv("", index=False)

Note: The parameter index means "whether to keep the row index", default is True.

Of course, we can read data in other formats, and after manipulating the data, we can also save the data in a different format. The following figure shows how to read a file in another format and save the data set in another format:

2 Overview of data

2.1 Viewing the type of data in each column

The attribute dtypes of a dataframe can return a list representing the names and types of the data in each column:

print()

Output:

The first column is the name of the column and the second column is the type of data

2.2 Obtain statistical characteristics for each column of data (eg: total number of rows, mean, standard deviation of columns of data, etc)

Use:() to see each column of data for the

(1) Total line count

(2) Mean

(3) Standard deviation std

(4) Minimum min

(5) 25% quartile "25%"

(6) 50% quartile "50%"

(7) 75th percentile "75%"

(8) Maximum value max

()

Output:

Note: The method describe() only counts (without any arguments) the statistical characteristics of the columns of the data type (numeric-typed data type, e.g. int, float, etc.) and automatically NaNs the values.

If we want to see the statistical characteristics of all columns (i.e., including columns of non-data types, such as those of type OBJECT), we need to add the parameter (include = "all") to the describe() method

(include = "all")

Output:

2.3 Getting the statistical characteristics of a given column

Use the following statement:

dataframe[[‘column1’, ‘column2’, ‘column3’]].describe()

df[['length', 'compression-ratio']].describe()

Output:

2.4 Using the method info() to view the profile description of a dataframe

Use the following statement:

()

This method prints information about the dataframe, including index dtype and columns, non-null values, and memory usage.

()

Output:

2.5 Viewing the number of rows and columns of data

Get the (number of) rows and columns of the dataset, using the attribute shape.

ratings_df.shape

Output:

(463, 19)

summarize

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.