Earlier we did some basic operations with pandas, next we further understand the manipulation of data, the
Data cleansing has always been an extremely important part of data analysis.
Data consolidation
You can merge data in pandas by merge.
import numpy as np import pandas as pd data1 = ({'level':['a','b','c','d'], 'numeber':[1,3,5,7]}) data2=({'level':['a','b','c','e'], 'numeber':[2,3,6,10]}) print(data1)
Results for:
print(data2)
Results for:
print((data1,data2))
Results for:
You can see that the fields used for the same label in data1 and data2 are displayed, while other fields are discarded, which is equivalent to doing inner join join operation in SQL.
There are also connections such as outer, ringt, left, etc., which are represented by the keyword how.
data3 = ({'level1':['a','b','c','d'], 'numeber1':[1,3,5,7]}) data4=({'level2':['a','b','c','e'], 'numeber2':[2,3,6,10]}) print((data3,data4,left_on='level1',right_on='level2'))
Results for:
If the columns in the two data frames have different names, we can join the data together by specifying the letf_on and right_on parameters.
print((data3,data4,left_on='level1',right_on='level2',how='left'))
Results for:
Other detailed parameter descriptions
Overlapping data merging
Sometimes we will encounter overlapping data need to be merged processing, this time you can use comebine_first function.
data3 = ({'level':['a','b','c','d'], 'numeber1':[1,3,5,]}) data4=({'level':['a','b','c','e'], 'numeber2':[2,,6,10]}) print(data3.combine_first(data4))
Results for:
You can see that the content under the same label is prioritized to display the content of data3, if a data box in a data is missing, at this time another data box in the element will be replaced
The usage here is similar to (isnull(a),b,a)
Data reshaping and axial rotation
This content we have mentioned in the previous pandas article. Data reshaping mainly uses the reshape function, and rotation mainly uses the unstack and stack functions.
data=((12).reshape(3,4), columns=['a','b','c','d'], index=['wang','li','zhang']) print(data)
Results for:
print(())
Results for:
data conversion
Remove duplicate rows of data
data=({'a':[1,3,3,4], 'b':[1,3,3,5]}) print(data)
Results for:
print(())
Results for:
It can be seen that the third line is repeating the data of the second line, so the result is displayed as True
Alternatively, you can remove duplicate rows with the drop_duplicates method
print(data.drop_duplicates())
Results for:
replacement value
In addition to using the fillna method mentioned in our previous post, you can also use the replace method, and it's much easier and faster
data=({'a':[1,3,3,4], 'b':[1,3,3,5]}) print((1,2))
Results for:
Multiple data change together
print(([1,4],))
Data segmentation
data=[11,15,18,20,25,26,27,24] bins=[15,20,25] print(data) print((data,bins))
Results for:
[11, 15, 18, 20, 25, 26, 27, 24][NaN, NaN, (15, 20], (15, 20], (20, 25], NaN, NaN, (20, 25]]
Categories (2, object): [(15, 20] < (20, 25]]
You can see the result after segmentation, the data that is not in a segment is shown as na value, and the others are shown in the segment where the data is located.
print((data,bins).labels)
Results for:
[-1 -1 0 0 1 -1 -1 1]
Show segment sorting labels
print((data,bins).levels)
Results for:
Index([‘(15, 20]', ‘(20, 25]'], dtype='object')
Showing so segmented labels
print(value_counts((data,bins)))
Results for:
Display the number of values per segment
There is also a qcut function that performs a 4-part cut on the data, which is used in a similar way to cut.
Alignment and sampling
We know that there are several ways to sort data, such as sort, order, rank, and other functions that can sort data
The one we're talking about now is randomizing the data (permutation)
data=(5) print(data)
Results for:
[1 0 4 2 3]
Here's the result of the peemutation function randomly sorting the data from 0-4.
Data can also be sampled
df=((12).reshape(4,3)) samp=(3) print(df)
Results for:
print(samp)
Results for:
[1 0 2]
print((samp))
Results for:
The result of using take here is that the samples are extracted from the df in the order of samp.