Pandas similar to the R language in the data frame (DataFrame), Pandas based on Numpy, but for the structure of the data frame processing than Numpy to come easy.
1. Basic data structures and use of Pandas
Pandas has two main data structures: Series and DataFrame. series is similar to a one-dimensional array in Numpy, DataFrame is used more multi-dimensional tabular data structure.
Creation of Series
>>>import numpy as np >>>import pandas as pd >>>s=([1,2,3,,44,1]) # Create a missing value >>>s # If not specified, Series will automatically create index, here automatically create index 0-5 0 1.0 1 2.0 2 3.0 3 NaN 4 44.0 5 1.0 dtype: float64
DataFrame Creation
>>>dates=pd.date_range('20170101',periods=6) >>>dates DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05', '2017-01-06'], dtype='datetime64[ns]', freq='D') >>>df=((6,4),index=dates,columns=['a','b','c','d']) >>>df a b c d 2017-01-01 -1.993447 1.272175 -1.578337 -1.972526 2017-01-02 0.092701 -0.503654 -0.540655 -0.126386 2017-01-03 0.191769 -0.578872 -1.693449 0.457891 2017-01-04 2.121120 0.521884 -0.419368 -1.916585 2017-01-05 1.642063 0.222134 0.108531 -1.858906 2017-01-06 0.636639 0.487491 0.617841 -1.597920
DataFrame can be the same as Numpy according to the index to take out the data in it, only DataFrame index way is more diversified.DataFrame can not only according to the default row and column number to index, but also according to the label sequence to index.
DataFrames can also be created using a dictionary approach:
>>>df2=({'a':1,'b':'hello kitty','c':(2),'d':['o','k']}) >>>df2 a b c d 0 1 hello kitty 0 o 1 1 hello kitty 1 k
Some of the properties of the DataFrame can also be viewed using the corresponding methods
dtype # View data types index # View row sequences or indexes columns # View the labels of the columns values # View the data in the data frame, i.e. without the table header indexes describe # View some information about the data, such as the extreme value of each column, the mean, median and so on, can only be on the numerical data statistical information transpose # Transpose, which can also be manipulated with T sort_index # Sort, can sort output by row or column index sort_values # Sort by data value
Some examples
>>> a int64 b object c int64 d object dtype: object >>> RangeIndex(start=0, stop=2, step=1) >>> Index(['a', 'b', 'c', 'd'], dtype='object') >>> array([[1, 'hello kitty', 0, 'o'], [1, 'hello kitty', 1, 'k']], dtype=object) >>> # Statistical information is only available for numeric data a c count 2.0 2.000000 mean 1.0 0.500000 std 0.0 0.707107 min 1.0 0.000000 25% 1.0 0.250000 50% 1.0 0.500000 75% 1.0 0.750000 max 1.0 1.000000 >>> 0 1 a 1 1 b hello kitty hello kitty c 0 1 d o k >>>df2.sort_index(axis=1,ascending=False) # axis=1 Sort by column label from largest to smallest d c b a 0 o 0 hello kitty 1 1 k 1 hello kitty 1 >>>df2.sort_index(axis=0,ascending=False) # Sort by row label from largest to smallest a b c d 1 1 hello kitty 1 k 0 1 hello kitty 0 o >>>df2.sort_values(by="c",ascending=False) # Sort the values in column c from largest to smallest a b c d 1 1 hello kitty 1 k 0 1 hello kitty 0 o
2. Filtering out destination data from DataFrame
There are various methods to take out the destination data from DataFrame, which are generally used:
- - Selected directly from the index
- - Selection according to label (vertical selection column): loc
- - According to the sequence (horizontal selection of rows): iloc
- - Combination uses a sequence of labels to pick data at a specific location: ix
- - Screening by logical judgment
simple choice
>>>import numpy as np >>>import pandas as pd >>>dates=pd.date_range('20170101',periods=6) >>>df=((24).reshape((6,4)),index=dates,columns=['a','b','c','d']) >>>df a b c d 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23 >>>df['a'] # Select column a directly from the table label, also available, same result 2017-01-01 0 2017-01-02 4 2017-01-03 8 2017-01-04 12 2017-01-05 16 2017-01-06 20 Freq: D, Name: a, dtype: int64 >>>df[0:3] # Select the first 3 rows, can also use the row label df['2017-01-01':'2017-01-03'], the result is the same, but can not use this method to select multiple columns a b c d 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11
loc uses explicit row labels to pick data
DataFrame rows are represented in two ways, one is indexed by explicit row labels, and the other is indexed by default implicit row numbers. loc method is indexed by row labels to pick target rows, which can be used with column labels to pick data at specific locations.
>>>['2017-01-01':'2017-01-03'] a b c d 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11 >>>['2017-01-01',['a','b']] # Select columns a,b of a particular row a 0 b 1 Name: 2017-01-01 00:00:00, dtype: int64
iloc uses implicit row sequence numbers to pick data
The use of iloc can be combined with column sequence numbers to make it easier to select data at specific points.
>>>[3,1] 13 >>>[1:3,2:4] c d 2017-01-02 6 7 2017-01-03 10 11
ix can mix explicit tags with implicit serial numbers using ix
While loc can only pick data using explicit labels and iloc can only pick data using implicit sequence numbers, ix is able to use both together.
>>> [3:5,['a','b']] a b 2017-01-04 12 13 2017-01-05 16 17
Using logical judgment to select data
>>>df a b c d 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23 >>>df[df['a']>5] # Equivalent to df[>5] a b c d 2017-01-03 8 9 10 11 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23
3. Pandas sets position-specific values
>>>import numpy as np >>>import pandas as pd >>>dates=pd.date_range('20170101',periods=6) >>>datas=(24).reshape((6,4)) >>>columns=['a','b','c','d'] >>>df= me(data=datas,index=dates,colums=columns) >>>[2,2:4]=111 # Replace row 2, columns 2 and 3 with 111. a b c d 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 111 111 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23 >>>[df['a']>10]=0 # Equivalent to [>10] # Change the value of the corresponding row in column b to 0, using the position of the number in column a greater than 10 as a reference a b c d 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 111 111 2017-01-04 12 0 14 15 2017-01-05 16 0 18 19 2017-01-06 20 0 22 23 >>>df['f']= # Create a new column f and set the value to a b c d f 2017-01-01 0 1 2 3 NaN 2017-01-02 4 5 6 7 NaN 2017-01-03 8 9 111 111 NaN 2017-01-04 12 0 14 15 NaN 2017-01-05 16 0 18 19 NaN 2017-01-06 20 0 22 23 NaN >>> # A `Series` sequence can also be added in the same way as above, but it must be the same length as the columns >>>df['e']=((6),index=dates) >>>df a b c d f e 2017-01-01 0 1 2 3 NaN 0 2017-01-02 4 5 6 7 NaN 1 2017-01-03 8 9 111 111 NaN 2 2017-01-04 12 0 14 15 NaN 3 2017-01-05 16 0 18 19 NaN 4 2017-01-06 20 0 22 23 NaN 5
4. Addressing missing data
Sometimes our data will have some empty or missing (NaN) data, the use of dropna can selectively remove or fill these NaN data. drop function can selectively remove rows or columns, drop_duplicates to remove redundancy. fillna will replace the NaN value with another value. After the operation does not change the original value, if you want to save the changes need to re-assign the value.
>>>import numpy as np >>>import pandas as pd >>>df=((24).reshape(6,4),index=pd.date_range('20170101',periods=6),columns=['a','b','c','d']) >>>df a b c d 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23 >>>[1,3]= >>>[3,2]= >>>df. a b c d 2017-01-01 0 1 2.0 3.0 2017-01-02 4 5 6.0 NaN 2017-01-03 8 9 10.0 11.0 2017-01-04 12 13 NaN 15.0 2017-01-05 16 17 18.0 19.0 2017-01-06 20 21 22.0 23.0 >>>(axis=0,how='any') # axis=0(1) means that rows (columns) containing NaN will be deleted. # how='any' means that as long as the row (or column, depending on the value of axis) contains NaN, the row (column) will be deleted. # how='all' means that when a row (column) is all NaN to delete a b c d 2017-01-01 0 1 2.0 3.0 2017-01-03 8 9 10.0 11.0 2017-01-05 16 17 18.0 19.0 2017-01-06 20 21 22.0 23.0 >>>(value=55) a b c d 2017-01-01 0 1 2.0 3.0 2017-01-02 4 5 6.0 55.0 2017-01-03 8 9 10.0 11.0 2017-01-04 12 13 55.0 15.0 2017-01-05 16 17 18.0 19.0 2017-01-06 20 21 22.0 23.0
Functions can also be used to check if any or all of the data is NaN
>>>(())==True True >>>(())==True False
5. Import and export of data
Generally excel files are read in as csv, pd.read_csv(file), and data is saved as filedata.to_csv(file).
6. Data addition and consolidation
This section focuses on learning Pandas some simple and basic data add merge methods: concat, append.
The concat merge method is similar to Numpy's concatenate method, which can merge horizontally or vertically.
>>>import numpy as np >>>import pandas as pd >>> df1=(((3,4))*0,columns=['a','b','c','d']) >>> df2=(((3,4))*1,columns=['a','b','c','d']) >>> df3=(((3,4))*2,columns=['a','b','c','d']) >>>res=([df1,df2,df3],axis=0) # axis=0 means stacked merge by rows, axis=1 means left-right merge by columns >>>res a b c d 0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 0 1.0 1.0 1.0 1.0 1 1.0 1.0 1.0 1.0 2 1.0 1.0 1.0 1.0 0 2.0 2.0 2.0 2.0 1 2.0 2.0 2.0 2.0 2 2.0 2.0 2.0 2.0 >>> # Use the ignore_index=True parameter to reset the row labeling >>>res=([df1,df2,df3],axis=0,ignore_index=True) >>>res a b c d 0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 3 1.0 1.0 1.0 1.0 4 1.0 1.0 1.0 1.0 5 1.0 1.0 1.0 1.0 6 2.0 2.0 2.0 2.0 7 2.0 2.0 2.0 2.0 8 2.0 2.0 2.0 2.0
join parameter provides a more diverse merger. join = outer for the default value, that will be several merged data are used, with the same column labeled as one, up and down merger, different column labeled alone into a column, the original location of the value of the original no NaN fill; join = inner will only have the same column labeled (rows) up and down merger of columns, the rest of the columns to be discarded. In short, outer on behalf of the union, inner on behalf of the intersection **.
>>>import numpy as np >>>import pandas as pd >>>df1=(((3,4)),index=[1,2,3],columns=['a','b','c','d']) >>>df2=(((3,4))*2,index=[1,2,3],columns=['b','c','d','e']) >>>res=([df1,df2],axis=0,join='outer') >>>res a b c d e 1 1.0 1.0 1.0 1.0 NaN 2 1.0 1.0 1.0 1.0 NaN 3 1.0 1.0 1.0 1.0 NaN 1 NaN 2.0 2.0 2.0 2.0 2 NaN 2.0 2.0 2.0 2.0 3 NaN 2.0 2.0 2.0 2.0 >>>res1=([df1,df2],axis=1,join='outer') # axis=1 means merge the ones with the same row label by column left and right, and the rest into one row each, NaN to fill in the gaps. >>>res1 a b c d b c d e 1 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 >>>res2=([df1,df2],axis=0,join='inner',ignore_index=True) # Merge columns with the same column label up and down >>>res2 b c d 0 1.0 1.0 1.0 1 1.0 1.0 1.0 2 1.0 1.0 1.0 3 2.0 2.0 2.0 4 2.0 2.0 2.0 5 2.0 2.0 2.0
The join_axes parameter allows you to set the reference system, merge with the set reference, and discard anything that is not in the reference system.
>>>import numpy as np >>>import pandas as pd >>>df1=(((3,4)),index=[1,2,3],columns=['a','b','c','d']) >>> df2=(((3,4))*2,index=[2,3,4],columns=['b','c','d','e']) >>>res3=([df1,df2],axis=0,join_axes=[]) # Merge columns with the same column label up and down using df1's column label as a reference >>>res3 a b c d 1 1.0 1.0 1.0 1.0 2 1.0 1.0 1.0 1.0 3 1.0 1.0 1.0 1.0 2 NaN 2.0 2.0 2.0 3 NaN 2.0 2.0 2.0 4 NaN 2.0 2.0 2.0 >>>res4=([df1,df2],axis=1,join_axes=[]) # Merge columns with the same row label left and right, using the df1 row label as a reference a b c d b c d e 1 1.0 1.0 1.0 1.0 NaN NaN NaN NaN 2 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
append only merges up and down, not left and right
>>>df1=(((3,4)),index=[1,2,3],columns=['a','b','c','d']) >>> df2=(((3,4))*2,index=[2,3,4],columns=['b','c','d','e']) >>>res5=(df2,ignore_index=True) >>>res5 a b c d e 0 1.0 1.0 1.0 1.0 NaN 1 1.0 1.0 1.0 1.0 NaN 2 1.0 1.0 1.0 1.0 NaN 3 NaN 2.0 2.0 2.0 2.0 4 NaN 2.0 2.0 2.0 2.0 5 NaN 2.0 2.0 2.0 2.0
7. Pandas advanced merging: merge
The merge merge is similar to concat, except that merge can join the rows of two datasets by one or more keys.
merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)
Parameter Description:
- Left vs. right:Two different DataFrames
- how:refers to the merge (connection) of the way inner (inner connection), left (left outer connection), right (right outer connection), outer (full outer connection); the default is inner
- on : Refers to the name of the column index used for joining. Must exist in the right and right DataFrame objects, if not specified and other parameters are not specified, then the intersection of the column names of the two DataFrames will be used as the key for joining.
- left_on:The name of the column in the left DataFrame that is used as the join key; this parameter is useful when the left and right column names are not the same, but represent the same meaning.
- right_on:The name of the column in the right DataFrame that will be used as the join key.
- left_index:Use the row index in the left DataFrame as the join key
- right_index:Use the row index in the right DataFrame as the join key
- sort:The default is True, which sorts the merged data. Setting it to False improves performance in most cases
- suffixes:A tuple of string values, used to specify the suffix name appended to the column name when the same column name exists in the left and right DataFrames, defaults to ('_x','_y')
- copy:Defaults to True, which always copies data to the data structure; False improves performance in most cases.
- indicator:Show the sources in the merged data; e.g., only from the left (left_only), both (both).
>>>import pandas as pd >>>df1=({'key':['k0','k1','k2','k3'],'A':['a0','a1','a2','a3'],'B':['b0','b1','b2','b3']}) >>>df2=({'key':['k0','k1','k2','k3'],'C':['c0','c1','c2','c3'],'D':['d0','d1','d2','d3']}) >>> res=(df1,df2,on='key',indicator=True) >>>res A B key C D _merge 0 a0 b0 k0 c0 d0 both 1 a1 b1 k1 c1 d1 both 2 a2 b2 k2 c2 d2 both 3 a3 b3 k3 c3 d3 both
Merging by row index is similar to merging by column key.
>>>res2=(df1,df2,left_index=True,right_index=True,indicator=True) >>>res2 A B key_x C D key_y _merge 0 a0 b0 k0 c0 d0 k0 both 1 a1 b1 k1 c1 d1 k1 both 2 a2 b2 k2 c2 d2 k2 both 3 a3 b3 k3 c3 d3 k3 both
This is the whole content of this article.