SoFunction
Updated on 2024-11-10

Python uses pandas to process CSV file usage examples

I. Introduction to pandas

pandas is a third-party data analysis library that integrates a large number of data analysis tools that make it easy to process and analyze all types of data. It is a third-party library, and pandas can be installed using the following command.

pip install pandas

Processing CSV files with pandas is divided into 3 main steps:

  • With the read_csv() function, the data is transformed into pandas' DataFrame object, a two-dimensional data object that integrates a number of data processing methods.
  • Manipulate DataFrame objects to accomplish various data processing through self-contained methods.
  • Write the data back to the CSV file via the to_csv() method of the DataFrame object.

II. Examples of usage

Let's create a sample file, copy the following data into the file and save it:

emp_no,birth_date,first_name,last_name,gender,salary
1,1953-09-02,Georgi,Facello,M,1000
2,1964-06-02,Bezalel,Simmel,F,2000
3,1959-12-03,Parto,Bamford,M,3000
4,1954-05-01,Chirstian,Koblick,M,4000
5,1955-01-21,Kyoichi,Maliniak,M,5000
6,1953-04-20,Anneke,Preusig,F,6000
7,1957-05-23,Tzvetan,Zielinski,F,7000
8,1958-02-19,Saniya,Kalloufi,M,8000
9,1952-04-19,Sumant,Peac,F,9000

The data corresponds to the excel format as a reference:

在这里插入图片描述

2.1 Reading CSV files

Ensure that the file is in the current directory (or provide the absolute path of the file as well), for example, the example file is saved in the d:\dir1 directory, switch to that directory first:

import os
(r'd:\dir1')

在这里插入图片描述

pandas read_csv function can read CSV files and return a DataFrame object, the first time you use to import the pandas module, use the read_csv () function to read the csv file, and will return the DataFrame object assigned to the variable name df:

import pandas as pd
df = pd.read_csv('')
df

在这里插入图片描述

2.1.1 read_csv parameters

read_csv() has a lot of customization settings in the reading process, only the file name is provided in the above example, and default values are used for other parameters. According to the different data formats, you may need to adjust some parameters, read_csv function of the common parameters are as follows:

  • sep/delimiter: delimiter, the default is a comma, equivalent to sep=',', if the contents of the file to non-comma separated, you need to explicitly specify this parameter or use sep=None to allow pandas to determine their own delimiter.
  • delim_whitespace: whether to use space as separator, equivalent to sep='\s+', when this parameter is set to True, the sep parameter cannot be used.
  • header: specify the first line as the column name, and specify the starting line of the data, the default header=0, that is, the first line as the column name (numbered from 0), the data from the second line to start. If there is no column name in the data, you need to specify header=None, so that all from the first line will be treated as data.
  • names: used to customize column names
  • index_col: specify a column as the row index (default is an integer starting at 0)
  • usecols: selects a subset of columns, i.e. reads only the specified columns

Example: Select only some columns, only read emp_no, first_name, salary, use usecols to specify these 3 columns:

df1 = pd.read_csv('', usecols=['emp_no', 'first_name', 'salary'])

在这里插入图片描述

Custom column names: read the data and use col1~col6 to define the column names, since the original data first line is the column name, use header=0 to specify the first line is the column name, so that the first line will not be read as data. Then use the names parameter to re-specify the column names:

name_list = ['col1', 'col2', 'col3','col4','col5','col6'] 
df2 = pd.read_csv('', header=0, names=name_list)

在这里插入图片描述

2.1.2 Handling of "bad lines"

Many times we get the data format is not standardized, there may be some rows of missing data, some rows of data and more. read_csv function in the missing data encountered in the columns will be automatically filled with NaN (in pandas on behalf of the null value) to fill (we delete the seventh line of the file of the salary, re-read the file, you can see that it will be automatically filled with NaN):

df2 = pd.read_csv('')

在这里插入图片描述

But for rows with more columns of data, the default is to report an error. Adding an extra column of data after line 8 of the file prompts a parsing error, expecting 6 columns, but there are 7:

df3 = pd.read_csv('')

在这里插入图片描述

For this type of error, we can use on_bad_lines='skip' to skip these lines without affecting the reading of other data, and you can also see from the result that the data with emp_no of 8 is ignored:

df3 = pd.read_csv('', on_bad_lines='skip')

在这里插入图片描述

2.2 Citation of data

After completing the reading of the file we get a DataFrame object, using its own methods can quickly preprocess the data, as opposed to using Python code, you can save a lot of logic writing time.

The first step in processing data is to reference the data, pandas commonly used data reference methods are:

  • Using [] to reference data
  • Using the .loc attribute to reference data via labels
  • Referencing data by location using the .iloc attribute

2.2.1 Location Indexing and Tag Indexing

Figure out the position index and label index before referencing the data:

  • Position index: position numbering of rows/columns, starting from 0, isometric series with tolerance 1, 0,1,2,3,4 .... must be a number
  • Label index: row/column "alias", can be customized. Where the row label index is also called "index label", column label index is also called "column label".
    The label index defaults to the same as the position index if not explicitly specified.

For example, in the df below, the red boxes are labeled indexes:

The column labels are emp_no, birth_date ...... and the index label is the same as the positional index since it is not explicitly specified, 0,1,2,3,4 .... , but it is not a positional index.

在这里插入图片描述

In the tab index, you can view the index tabs and column tabs separately by using the and attributes:


在这里插入图片描述

The labels can be changed by assigning values to the corresponding attributes, as can be visualized by the example below, where the 0,1,2,3... in the red box are index labels, not position labels:

 = [0,1,2,3,'a','b','c','d','e']

在这里插入图片描述

2.2.2 Referencing data using []

Using the df['column label'] format, data columns can be referenced through column labels, for example by selecting the frist_name column:

df['first_name']

在这里插入图片描述

If you want to reference multiple columns, pass multiple columns as a list, e.g. select emp_no, first_name, last_name:

df[['emp_no', 'first_name', 'last_name']]

在这里插入图片描述

Using the format df[start:stop:step], you can refer to rows by positional index, which is the same as the standard Python slicing syntax (not detailed here):

df[0:4] # Note that [0:4] contains the header and not the tail, i.e., rows with position indexes 0,1,2,3.

在这里插入图片描述

df[::2]  # select every other line, start and stop omitted for all, 2 for step size

在这里插入图片描述

df[::-1] # Negative steps represent selection from the end, -1 is equivalent to reverse order

在这里插入图片描述

2.2.3 Using the .loc Attribute to Reference Data via Tags

Referencing with [] can be a bit complicated; it uses a label index when referencing columns and a position index when referencing rows.

pandas provides more intuitive .loc and .iloc properties:

  • .loc exclusively uses labeled indexes to reference data (slice with endings)
  • .iloc uses positional indexes exclusively to reference data (slice without endings)

Data can be referenced using ['index tags', 'column tags']. The tags are separated by commas, and the slices inside the tags are separated by colons; omission means all. Note that slices in the .loc attribute are inclusive of endings, which is different from the standard python slice syntax.

References row a (which returns the pandas one-dimensional data type Series):

['a'] # Column labels omitted, representing all columns, equivalent to ['a',:]

在这里插入图片描述

References the birth_date column:

[:,'birth_date'] # line tags omitted, representing all lines, with :placeholder

在这里插入图片描述

Reference rows a-e, and emp_no to last_name columns, noting that both row e and last_name columns are included in the sliced results:

['a':'e', 'emp_no':'last_name']

在这里插入图片描述

Reference to row a, single element of birth_date column (no slicing):

['a', 'birth_date']

在这里插入图片描述

2.2.4 Referencing data by location using the .iloc property

.iloc is used much like .loc, except that the index tags are replaced with position tags. The syntax is ['row position index', 'column position index'], and note that .iloc's slice does not contain endings (same as python).

Quote lines 1, 2:

[0:2] # Reference to the line position indexed 0,1, 2 is not included in the result, equivalent to [0:2, :]

在这里插入图片描述

Cite columns 1, 2:

[:,0:2] # of row position index: indicates all rows

在这里插入图片描述

Quoting data from columns 3 and 4 of rows 1 and 2.

[0:2, 2:4]

在这里插入图片描述

Data reference with the assignment symbol '=', that is, you can modify the value in the DataFrame, for example, emp_no for 9 salary to 9999

['e', 'salary'] = 9999

在这里插入图片描述

Other ways of referencing data are through attributes, but this approach is flawed, not recommended, and not described here. Just focus on the .loc and .iloc methods.

2.3 Data filtering

DataFrame data filtering is very convenient, for example, I want to select salary greater than 5000 data, the following expression that is the salary column test results, composed of bool type data:

df['salary']>5000

在这里插入图片描述

Simply substitute it back into df to filter out the data that meets the criteria:

df[ df['salary']>5000 ]

在这里插入图片描述

Or use the query method to filter the results directly based on conditions in string form:

('salary>5000')

在这里插入图片描述

2.4 Write back to csv file

After completing the data processing, use the to_csv method that comes with the DataFrame object to write the data back to the file, the main parameters are similar to read_csv:

  • sep: separator, defaults to a comma.
  • columns: specifies which columns are written to the file
  • header: whether to write the header to the file, default is True
  • index: whether or not to write the row index to the file, the default is True.

Assuming that you want to rewrite the salary>5000 data to a CSV file after filtering it out, you can call the DataFrame's to_csv method directly:

df_result = ('salary>5000')
df_result.to_csv('', index=False)

在这里插入图片描述

在这里插入图片描述

In this article, the case is just to show the most simple and most commonly used DataFrame data processing methods, the actual pandas data processing capabilities are far more than these, interested students can explore their own in-depth.

The above is Python using pandas to deal with CSV file usage examples in detail, more information about Python pandas to deal with CSV please pay attention to my other related articles!