Pandas File Manipulation and Reading CSV Arguments in Python

I. Pandas reading files

When using Pandas to do data analysis, you need to read the prepared dataset, which is the first step to do data analysis.Panda provides several methods to read the data, for different file formats, there are the following:
(1) read_csv() is used to read a text file.
(2) read_excel() is used to read a text file.
(3) read_json() is used to read a json file.
(4) read_sql_query() reads the sql statement.
The generalized process is as follows:
(1) Import library import pandas as pd.
(2) Find the location of the file (absolute path = full name) (relative path = short name of the path in the same folder as the program).
(3) Variable name = pd.read/write operation method (file path, specific filter conditions, ......).

CSV file reading

CSV, also known as Comma Separated Value File, is a simple file format that arranges tabular data in a specific structure. CSV files are capable of storing tabular data in plain text, such as spreadsheets, database files, and have a common format for data exchange.CSV files will be opened in an Excel file with rows and columns that define standard data formats.
Converting data in CSV to DataFrame objects is very convenient. Unlike normal file reading and writing, it doesn't require you to do operations such as open file, read file, close file, etc. Instead, you can do all the above steps with one line of code and store the data in the DataFrame. Instead, you only need one line of code to do all the above steps and store the data in the DataFrame.
The following is an example demonstration with the following source data:

在这里插入图片描述

First, we read the CSV file, either via a relative path or an absolute path obtained dynamically by os ().

import pandas as pd
df = pd.read_csv("./data/my_csv.csv")
print(df,type(df))
#   col1 col2  col3    col4      col5
#0     2    a   1.4   apple  2022/1/1
#1     3    b   3.4  banana  2022/1/2
#2     6    c   2.5  orange  2022/1/5
#3     5    d   3.2   grape  2022/1/7 <class ''>

We can read the storage path of a file with ().

import os
()
#'C:\\\\ Users\\\\ CQB\\\ Desktop\\\\ Inner * Agricultural University Data Analysis Lesson Plans and Codes\\\ Day 16'

Its syntax template is as follows:

read_csv(filepath_or_buffer, sep=',',  header='infer', names=None, index_col=None, usecols=None, squeeze=None, prefix=None, 
mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, 
skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, 
skip_blank_lines=True, parse_dates=None, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False,
cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, 
quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, 
error_bad_lines=None, warn_bad_lines=None, on_bad_lines=None, delim_whitespace=False, low_memory=True, memory_map=False, 
float_precision=None, storage_options=None)

1. Basic parameters

(1) filepath_or_buffer (path to data input):This can be a file path, a URL, or any object that implements the read method. This is the first parameter we enter.
We can just read_csv to read the file we want.

import pandas as pd
pd.read_csv(r"data\")
#id	name	address	gender	birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1 2 Koh Boon Bo Moon Star Female 2003/8/7
#2 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#3 4 FU YANXU KHASING M 2003/10/11
#4 5 Wang Jie Charsing M 2002/6/12
#5	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

It can also be a URL, and if accessing that URL returns a file, then the read_csv function of pandas will automatically read the file. For example, if we put data on our server, it will return the file we just looked at.
However, it should be noted that he requires a network request, so reading files is slower.

pd.read_csv("/static/data/")

Inside could also be a _io.TextIOWrapper, where pandas reads the file using utf-8 by default, for example:

f = open(r"data\", encoding="utf-8")
pd.read_csv(f)
#id	name	address	gender	birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1 2 Koh Boon Bo Moon Star Female 2003/8/7
#2 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#3 4 FU YANXU KHASING M 2003/10/11
#4 5 Wang Jie Charsing M 2002/6/12
#5	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

（2） sep：The delimiter specified when reading a csv file, the default is a comma. Note: The delimiter of the csv file and the delimiter we specify when reading the csv file must be the same.

import pandas as pd
pd.read_csv(r"data\students_step.csv")
#id|name|address|gender|birthday
#0 1|Zhu Mengxue|Global Village|Female|2004/11/2
#1 2|Hsu Wenbo|Moon Star|Female|2003/8/7
#2 3|Zhang Zhaoyuan|Elsein|Female|2004/11/2
#3 4|Fu Yanxu|Kehasin|Men|2003/10/11
#4 5|Wang Jie|Charles Star|Men|2002/6/12
#5	6|Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009|Tasanis (city in Arizona)|male|2002/2/12

Since the specified separator is not consistent with the separator used in the csv file, the columns are not separated, but connected. Therefore, we need to set the separator to \t to make it work.

df = pd.read_csv(r"data\students_step.csv", sep="|")
df
#id	name	address	gender	birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1 2 Koh Boon Bo Moon Star Female 2003/8/7
#2 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#3 4 FU YANXU KHASING M 2003/10/11
#4 5 Wang Jie Charsing M 2002/6/12
#5	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

（3） delim_whitespace：Default is False, when set to True, it means the separator is a whitespace character, it can be a space, \t, etc. No matter what the separator is, as long as it is a blank character, then it can be read by delim_whitespace=True. As follows, we do not set delim_whitespace, that is, the default is False, you will find that reading a little problem.

df = pd.read_csv(r"data\students_whitespace.txt", sep=" ")
df
#id	name	address	gender	birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1 2 Koh Boon Bo\t Moon Star Female 2003/8/7 NaN
#2 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#3 4 FU YANXU KHASING M 2003/10/11
#4 5 Wang Jie\t Char Singh M 2002/6/12 NaN
#5	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009\tTasanis (city in Arizona)	male	2002/2/12	NaN

To do this, we set delim_whitespace to True, and we get the reads we want.

df = pd.read_csv(r"data\students_whitespace.txt", delim_whitespace=True)
df
#id	name	address	gender	birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1 2 Koh Boon Bo Moon Star Female 2003/8/7
#2 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#3 4 FU YANXU KHASING M 2003/10/11
#4 5 Wang Jie Charsing M 2002/6/12
#5	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

(4) header: the line number used as the column name, and the beginning of the data.
The default behavior is to infer column names: if no name is passed, the behavior is the same as header=0 and the column name is inferred from the first line of the file, or the behavior is the same as header=None if the column name is passed explicitly.
Explicitly pass header=0 to replace the existing name. The header can be a list of integers specifying the row positions of multiple indexes on the column, e.g. [0,1,3]. Unspecified intermediate rows will be skipped (e.g., 2 rows in this example).
Note that if skip_blank_lines=True, this parameter ignores comment lines and blank lines, so header=0 means the first line of data, not the first line of the file.
(5) names: when names is not assigned, header will become 0, i.e., the first row of the data file is selected as the column name; when names is assigned and header is not assigned, then header will become None. if both are assigned, the combination of the two parameters will be realized.
(a) names is not assigned and header is not assigned:
In this case, header is 0, which means that the first line of the file is selected as the header.

pd.read_csv(r"data\")
#id	name	address	gender	birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1 2 Koh Boon Bo Moon Star Female 2003/8/7
#2 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#3 4 FU YANXU KHASING M 2003/10/11
#4 5 Wang Jie Charsing M 2002/6/12
#5	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

(b) names is not assigned, header is assigned:
If you specify header as 1 without names, the second row is selected as the table header and the data is below the second row.

pd.read_csv(r"data\", header=1)
#1 Zhu Mengxue Global Village Female 2004/11/2
#0 2 Koh Boon Bo Moon Star Female 2003/8/7
#1 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#2 4 FU YANXU KHASING M 2003/10/11
#3 5 Wang Jie Charsing M 2002/6/12
#4	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

(c) names is assigned, header is not:

pd.read_csv(r"data\", names=["No.", "Name", "Address.", "Gender", "Date of birth"])
# No. Name Address Gender Date of Birth
#0	id	name	address	gender	birthday
#1 1 Zhu Mengxue Global Village Female 2004/11/2
#2 2 Koh Boon Bo Moon Star Female 2003/8/7
#3 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#4 4 FU YANXU KHASING M 2003/10/11
#5 5 Wang Jie Charsing M 2002/6/12
#6	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

As you can see, names applies to the case where there is no header. If you specify names without a header, then the header is equivalent to None.
In general, when reading a file there will be a table header, generally the default is the first line, but there are some files that do not have a table header, then this time you can manually specify, or generate the table header through the names, and the data inside the file is all the content.
So here id, name, address, and date are also treated as a record, and originally it was the header of the table, but we specified names, so it becomes data, and the header is what we specified inside the names.
(a) Both names and header are assigned values:

pd.read_csv(r"data\",
            names=["No.", "Name", "Address.", "Gender", "Date of birth"],
            header=1)
# No. Name Address Gender Date of Birth
#0 2 Koh Boon Bo Moon Star Female 2003/8/7
#1 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#2 4 FU YANXU KHASING M 2003/10/11
#3 5 Wang Jie Charsing M 2002/6/12
#4	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

At this time, it is equivalent to not look at the names first, only look at the header, header is 0 means that the first line as the table header, the following as data; and then the table header with names to replace.
The main scenarios for names and header are as follows:
(1) If the csv file has a header and is the first line, then neither names nor header need to be specified;
(2) The csv file has a header, but the header is not the first line, it may be the real header and data from the following lines, this time, you can specify the header;
(3) The csv file has no header, it is all pure data, so we can manually generate the header through names;
(4) The csv file has a header, but you don't want to use the header, this time specify both names and header, first use the header to select the header and data, and then replace the header with names, which is the same as reading in the data and then rename the column names.
(6) index_col: the index of the DataFrame we get after reading the file is 0, 1, 2 ...... by default, we can set the index by set_index, but we can also specify the index of a certain column when reading.

df = pd.read_csv(r"data\", index_col="birthday")
df
#           id	 name  address gender
#birthday				
#2004/11/2 1 Zhu Mengxue Global Village Female
#2003/8/7 2 Koh Boon Bo Moon Star Female
#2004/11/2 3 Zhang Zhaoyuan Elsing Female
#2003/10/11 4 Fu Yanxu Kehaxing Male
#2002/6/12 5 Wang Jie Char Singh Male
#2002/2/12	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona) male

It can also be used to delete specified columns.

=df['birthday']
del df['birthday']
df
#          id	name	address	gender
#birthday				
#2004/11/2 1 Zhu Mengxue Global Village Female
#2003/8/7 2 Koh Boon Bo Moon Star Female
#2004/11/2 3 Zhang Zhaoyuan Elsing Female
#2003/10/11 4 Fu Yanxu Kehaxing Male
#2002/6/12 5 Wang Jie Char Singh Male
#2002/2/12	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male

We specify the name column as the index when reading; moreover, besides specifying a single column, we can also specify multiple columns as the index, such as ["id", "name"]. Also, we can enter the index of the column in addition to the column name. For example: "id", "name", "address", "date" corresponding to the index are 0, 1, 2, 3.

df2 = pd.read_csv(r"data\", index_col=["gender","birthday"])
df2
#             id	name	address
#gender	birthday			
#Female 2004/11/2 1 Zhu Mengxue Global Village
# 2003/8/7 2 Xu Wenbo Moon Star
# 2004/11/2 3 Zhang Zhaoyuan Elsing
#Men 2003/10/11 4 Fu Yanxu Kha Xing
# 2002/6/12 5 Wang Jie Charsing
#   2002/2/12	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)

The same applies to using loc deletion.

["Female."]
#         id	name	address
#birthday			
#2004/11/2 1 Zhu Mengxue Global Village
#2003/8/7 2 Xu Wenbo Moon Star
#2004/11/2	3	Zhang Zhaoyuan (1927-), * poet and politician	Elsewhere

（7） usecols：Returns a subset of the columns.
If list-like, all elements must be positional (i.e., integer indexes in document columns) or strings corresponding to column names provided by the user in the name or inferred from the document title line. If a name is given, the document title line is disregarded.

pd.read_csv(r"data\", usecols=["name","birthday"])
#   name
#0 Zhu Mengxue
#1 Koh Boon Bo
#2 Zhang Zhaoyuan
#3 Fu Yanxu
#4 Wang Jie
#5	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009

2. Generic parsing parameters

（1） encoding：Indicates this encoding format, utf-8, gbk.

pd.read_csv(r"data\students_gbk.csv") # UnicodeDecodeError

If the error feed UnicodeDecodeError -> you need to think of encoding problems.
pandas reads in utf-8 format by default.

pd.read_csv(r"data\students_gbk.csv", encoding="gbk") 
#id	name	address	gender	birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1 2 Koh Boon Bo Moon Star Female 2003/8/7
#2 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#3 4 FU YANXU KHASING M 2003/10/11
#4 5 Wang Jie Charsing M 2002/6/12
#5	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

（2） dtype：Set the type of the field when reading the data.
For example, the id of the company's employees is usually: 00001234, if read by default, it will be displayed as 1234, so at this time, we have to convert him to a string type, in order to display it as 00001234 normally.

df = pd.read_csv(r"data\students_step_001.csv", sep="|")
df
#id	name	address	gender	birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1 2 Koh Boon Bo Moon Star Female 2003/8/7
#2 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#3 4 FU YANXU KHASING M 2003/10/11
#4 5 Wang Jie Charsing M 2002/6/12
#5	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

If we set the data type of the id to a string, we can display it as something like 001.

df = pd.read_csv(r"data\students_step_001.csv", sep="|", dtype ={"id":str}) 
df
#id	name	address	gender	birthday
#0 001 Zhu Mengxue Global Village Female 2004/11/2
#1 002 Koh Boon Bo Moon Star Female 2003/8/7
#2 003 Zhang Zhaoyuan Elsing Female 2004/11/2
#3 004 FU YANXU KHASING Male 2003/10/11
#4 005 Wang Jie Charsing M 2002/6/12
#5	006	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

（3） converters：Transforms the column data while reading it.
For example, increase id by 10, but note that int(x), when using the converters argument, the parser defaults to str for all columns, so a type conversion is required.

pd.read_csv('data\', converters={"id": lambda x: int(x) + 10})
#id	name	address	gender	birthday
#0 11 Zhu Mengxue Global Village Female 2004/11/2
#1 12 Koh Boon Bo Moon Star Female 2003/8/7
#2 13 Zhang Zhaoyuan Elsing Female 2004/11/2
#3 14 FU YANXU KHASING M 2003/10/11
#4 15 Wang Jie Charsing M 2002/6/12
#5	16	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

(4) true_values and false_values:Specifies which values should be purged to True and which values are purged to False.
Let's take gender as an example, with Male set to True and Female set to False.

pd.read_csv('data\', true_values=['Male'], false_values=['Female'])
#   id	name	address	gender	birthday
#0 1 Zhu Mengxue Global Village False 2004/11/2
#1 2 Xu Wenbo Lunar Star False 2003/8/7
#2 3 Zhang Zhaoyuan Elstar False 2004/11/2
#3 4 Fu Yanxu Kheha Xing True 2003/10/11
#4 5 Wang Jie Charsing True 2002/6/12
#5	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	True	2002/2/12

The replacement rule here is that a column will only be replaced if all of its data categories appear inside true_values + false_values.
（5） skiprows：It indicates the rows to be filtered, and the rows you want to filter out are written in a list and passed to skiprows. Note that here you filter first and then determine the table header, for example:

pd.read_csv('data\', skiprows=[0,3])
# 1 Zhu Mengxue Global Village Female 2004/11/2
#0 2 Koh Boon Bo Moon Star Female 2003/8/7
#1 4 FU YANXU KHASING M 2003/10/11
#2 5 Wang Jie Charsing M 2002/6/12
#3	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

Here the first row is filtered out, because the first row is the table header, so after filtering out the second row becomes the table header. Of course, in addition to passing in specific values to indicate which rows to filter out, you can also pass in a function.

pd.read_csv('data\', skiprows=lambda x: x > 0 and x % 2 == 0)
#  id	name	address	gender	birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#2	5	Wang Jie	Charles Darwin (name)	male	2002/6/12

Since the index starts at 0, any record with an index greater than 0 and %2 equal to 0 is filtered out. Indexes greater than 0 are used to ensure that table headers are not filtered out.
（6） skipfooter：Indicates filtering lines from the end of the file.

pd.read_csv('data\', skipfooter=1)

The above code will be run after the error, and the data in the form have become garbled, the specific reasons are explained below.

pd.read_csv('data\', skipfooter=1, engine="python", encoding="utf-8")
#  id	name	address	gender	birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1 2 Koh Boon Bo Moon Star Female 2003/8/7
#2 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#3 4 FU YANXU KHASING M 2003/10/11
#4	5	Wang Jie	Charles Darwin (name)	male	2002/6/12

pandas parses the data with the engine, there are two kinds of parsing engine: c, python. c is the default, because c engine parses faster, but the characteristics are not as full as python engine.
skipfooter receives an integer that filters out the specified number of lines from the end upwards. Since the engine is degraded to python, you need to specify engine="python" manually, otherwise you will be warned. You also need to specify encoding="utf-8", because csv has encoding problems, when the engine is degraded to python, it will be messy to read on Windows.
（7） nrows：Indicates the number of file lines to be read in at one time, which is useful when reading large files, such as hundreds of gigabytes that a PC with 16G of RAM can't hold.

pd.read_csv('data\', nrows=3)
#  id	name	address	gender	birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1 2 Koh Boon Bo Moon Star Female 2003/8/7
#2	3	Zhang Zhaoyuan (1927-), * poet and politician	Elsewhere	women	2004/11/2

3. Parameters related to null processing

na_values: This parameter configures which values need to be processed as NaN.

pd.read_csv('data\', na_values=["Female.", "Zhu Mengxue"])
  #id	name	address	gender	birthday
#0 1 NaN Global Village NaN 2004/11/2
#1 2 Xu Wenbo Moon Star NaN 2003/8/7
#2 3 Zhang Zhaoyuan Elstar NaN 2004/11/2
#3 4 FU YANXU KHASING M 2003/10/11
#4 5 Wang Jie Charsing M 2002/6/12
#5	6	Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009	Tasanis (city in Arizona)	male	2002/2/12

You can see that setting Female and Zhu Mengxue to NaN, as is the case here, contains different values in different columns.

4. Time processing-related parameters

parse_dates: specify some columns as time types, this parameter is usually used with date_parser.
date_parser: used with the parse_dates parameter, because some columns are dates, but there is no way to convert them directly, we need to specify a parsing format.

df = pd.read_csv('data\')

#id           int64
#name        object
#address     object
#gender      object
#birthday    object
#dtype: object

We set birthday to a time type via parse_dates.

df = pd.read_csv('data\', parse_dates=["birthday"])

#id                   int64
#name                object
#address             object
#gender              object
#birthday    datetime64[ns]
#dtype: object

5. Block reading of relevant parameters

（1） iterator：Iterator, iterator is of type bool, default is False.
If True, then a TextFileReader object is returned to process the file block by block. This can be read in batches and processed sequentially when the file is large and memory cannot hold all the data files.

chunk = pd.read_csv('data\', iterator=True)
chunk
#< at 0x1b27f00ef88>

We have already chunked the file and can start by extracting the first two lines.

print(chunk.get_chunk(2))
#   id name address gender   birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1   2  Xu Wenbo     the moon      women   2003/8/7

There are four lines left in the file, but we specify to read 100, then there will be no error, not enough to specify the number of lines, then how many to return how many.

print(chunk.get_chunk(100))
#   id name address gender    birthday
#2 3 Zhang Zhaoyuan Elsing Female 2004/11/2
#3 4 FU YANXU KHASING M 2003/10/11
#4 5 Wang Jie Charsing M 2002/6/12
#5   6  Tung Chak Yue (1937-), * entrepreneur and politician, chief executive of Beijing 2008-2009    Tasanis (city in Arizona)      male   2002/2/12

It is important to note here that after the read is complete, it will report an error if it is read again.（2） chunksize：Integer, defaults to None, sets the size of the file chunk. chunksize still returns an object similar to an iterator, which is the default chunksize when we call get_chunk without specifying the number of lines.

chunk = pd.read_csv('data\', chunksize=2)
print(chunk)  
print(chunk.get_chunk())
#< object at 0x000001B27F05C5C8>
#   id name address gender   birthday
#0 1 Zhu Mengxue Global Village Female 2004/11/2
#1   2  Xu Wenbo     the moon      women   2003/8/7

We'll use print(chunk.get_chunk()) twice more to read out all the data in steps, since the chunksize is set to 2 here.
We can also specify the parameters of chunk.get_chunk().
These are most of the parameters in the read_csv function of pandas, and some of them are also suitable for reading other types of files.
In fact, there are not many parameters used in reading csv files, many of which we would not normally use, but it does not prevent us from understanding them, because in some specific scenarios they can be very convenient to help us solve some problems.
Personally, I feel that reading this parameter in chunks has improved efficiency a lot in my work lately.

This article on Python's Pandas file operations and read CSV parameters detailed article is introduced to this, more related to Python Pandas read CSV parameters, please search for my previous posts or continue to browse the following related articles I hope you will support me in the future!