SoFunction
Updated on 2024-11-12

python data cleaning data merging, transforming, filtering, sorting

Earlier we did some basic operations with pandas, next we further understand the manipulation of data, the
Data cleansing has always been an extremely important part of data analysis.

Data consolidation

You can merge data in pandas by merge.

import numpy as np
import pandas as pd
data1 = ({'level':['a','b','c','d'],
         'numeber':[1,3,5,7]})

data2=({'level':['a','b','c','e'],
         'numeber':[2,3,6,10]})
print(data1)

Results for:

print(data2) 

Results for:

print((data1,data2)) 

Results for:


You can see that the fields used for the same label in data1 and data2 are displayed, while other fields are discarded, which is equivalent to doing inner join join operation in SQL.
There are also connections such as outer, ringt, left, etc., which are represented by the keyword how.

data3 = ({'level1':['a','b','c','d'],
         'numeber1':[1,3,5,7]})
data4=({'level2':['a','b','c','e'],
         'numeber2':[2,3,6,10]})
print((data3,data4,left_on='level1',right_on='level2'))

Results for:


If the columns in the two data frames have different names, we can join the data together by specifying the letf_on and right_on parameters.

print((data3,data4,left_on='level1',right_on='level2',how='left')) 

Results for:

Other detailed parameter descriptions

Overlapping data merging

Sometimes we will encounter overlapping data need to be merged processing, this time you can use comebine_first function.

data3 = ({'level':['a','b','c','d'],
         'numeber1':[1,3,5,]})
 data4=({'level':['a','b','c','e'],
         'numeber2':[2,,6,10]})
 print(data3.combine_first(data4))

Results for:


You can see that the content under the same label is prioritized to display the content of data3, if a data box in a data is missing, at this time another data box in the element will be replaced

The usage here is similar to (isnull(a),b,a)

Data reshaping and axial rotation

This content we have mentioned in the previous pandas article. Data reshaping mainly uses the reshape function, and rotation mainly uses the unstack and stack functions.

data=((12).reshape(3,4),
       columns=['a','b','c','d'],
       index=['wang','li','zhang'])
print(data)

Results for:

print(()) 

Results for:

data conversion

Remove duplicate rows of data

data=({'a':[1,3,3,4],
       'b':[1,3,3,5]})
print(data)

Results for:

print(()) 

Results for:


It can be seen that the third line is repeating the data of the second line, so the result is displayed as True

Alternatively, you can remove duplicate rows with the drop_duplicates method

print(data.drop_duplicates()) 

Results for:

replacement value

In addition to using the fillna method mentioned in our previous post, you can also use the replace method, and it's much easier and faster

data=({'a':[1,3,3,4],
       'b':[1,3,3,5]})
print((1,2))

Results for:


Multiple data change together

print(([1,4],)) 

Data segmentation

data=[11,15,18,20,25,26,27,24]
bins=[15,20,25]
print(data)
print((data,bins))

Results for:
[11, 15, 18, 20, 25, 26, 27, 24][NaN, NaN, (15, 20], (15, 20], (20, 25], NaN, NaN, (20, 25]]
Categories (2, object): [(15, 20] < (20, 25]]

You can see the result after segmentation, the data that is not in a segment is shown as na value, and the others are shown in the segment where the data is located.

print((data,bins).labels) 

Results for:

[-1 -1 0 0 1 -1 -1 1]

Show segment sorting labels

print((data,bins).levels) 

Results for:

Index([‘(15, 20]', ‘(20, 25]'], dtype='object')

Showing so segmented labels

print(value_counts((data,bins))) 

Results for:


Display the number of values per segment

There is also a qcut function that performs a 4-part cut on the data, which is used in a similar way to cut.

Alignment and sampling

We know that there are several ways to sort data, such as sort, order, rank, and other functions that can sort data
The one we're talking about now is randomizing the data (permutation)

data=(5)
print(data)

Results for:

[1 0 4 2 3]

Here's the result of the peemutation function randomly sorting the data from 0-4.
Data can also be sampled

df=((12).reshape(4,3))
samp=(3)
print(df)

Results for:

print(samp)

Results for:
[1 0 2]

print((samp))

Results for:


The result of using take here is that the samples are extracted from the df in the order of samp.