python pandas module basic learning in detail

Pandas similar to the R language in the data frame (DataFrame), Pandas based on Numpy, but for the structure of the data frame processing than Numpy to come easy.

1. Basic data structures and use of Pandas

Pandas has two main data structures: Series and DataFrame. series is similar to a one-dimensional array in Numpy, DataFrame is used more multi-dimensional tabular data structure.

Creation of Series

>>>import numpy as np
>>>import pandas as pd
>>>s=([1,2,3,,44,1]) # Create a missing value
>>>s　# If not specified, Series will automatically create index, here automatically create index 0-5
0   1.0
1   2.0
2   3.0
3   NaN
4  44.0
5   1.0
dtype: float64

DataFrame Creation

>>>dates=pd.date_range('20170101',periods=6)
>>>dates
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
        '2017-01-05', '2017-01-06'],
       dtype='datetime64[ns]', freq='D')
>>>df=((6,4),index=dates,columns=['a','b','c','d'])
>>>df
           a     b     c     d
2017-01-01 -1.993447 1.272175 -1.578337 -1.972526
2017-01-02 0.092701 -0.503654 -0.540655 -0.126386
2017-01-03 0.191769 -0.578872 -1.693449 0.457891
2017-01-04 2.121120 0.521884 -0.419368 -1.916585
2017-01-05 1.642063 0.222134 0.108531 -1.858906
2017-01-06 0.636639 0.487491 0.617841 -1.597920

DataFrame can be the same as Numpy according to the index to take out the data in it, only DataFrame index way is more diversified.DataFrame can not only according to the default row and column number to index, but also according to the label sequence to index.

DataFrames can also be created using a dictionary approach:

>>>df2=({'a':1,'b':'hello kitty','c':(2),'d':['o','k']})
>>>df2
   a      b c d
0 1 hello kitty 0 o
1 1 hello kitty 1 k

Some of the properties of the DataFrame can also be viewed using the corresponding methods

dtype # View data types
index # View row sequences or indexes
columns # View the labels of the columns
values　# View the data in the data frame, i.e. without the table header indexes
describe # View some information about the data, such as the extreme value of each column, the mean, median and so on, can only be on the numerical data statistical information
transpose # Transpose, which can also be manipulated with T
sort_index # Sort, can sort output by row or column index
sort_values # Sort by data value

Some examples

>>>
a   int64
b  object
c   int64
d  object
dtype: object
>>>
RangeIndex(start=0, stop=2, step=1)
>>>
Index(['a', 'b', 'c', 'd'], dtype='object')
>>>
array([[1, 'hello kitty', 0, 'o'],
    [1, 'hello kitty', 1, 'k']], dtype=object)
>>> # Statistical information is only available for numeric data
     a     c
count 2.0 2.000000
mean  1.0 0.500000
std  0.0 0.707107
min  1.0 0.000000
25%  1.0 0.250000
50%  1.0 0.500000
75%  1.0 0.750000
max  1.0 1.000000
>>>
       0      1
a      1      1
b hello kitty hello kitty
c      0      1
d      o      k
>>>df2.sort_index(axis=1,ascending=False) # axis=1 Sort by column label from largest to smallest
   d c      b a
0 o 0 hello kitty 1
1 k 1 hello kitty 1
>>>df2.sort_index(axis=0,ascending=False) # Sort by row label from largest to smallest
   a      b c d
1 1 hello kitty 1 k
0 1 hello kitty 0 o
>>>df2.sort_values(by="c",ascending=False) # Sort the values in column c from largest to smallest
  　a      b c d
1 1 hello kitty 1 k
0 1 hello kitty 0 o

2. Filtering out destination data from DataFrame

There are various methods to take out the destination data from DataFrame, which are generally used:

　- Selected directly from the index
　- Selection according to label (vertical selection column): loc
　- According to the sequence (horizontal selection of rows): iloc
　- Combination uses a sequence of labels to pick data at a specific location: ix
　- Screening by logical judgment

simple choice

>>>import numpy as np
>>>import pandas as pd
>>>dates=pd.date_range('20170101',periods=6)
>>>df=((24).reshape((6,4)),index=dates,columns=['a','b','c','d'])
>>>df
        a  b  c  d
2017-01-01  0  1  2  3
2017-01-02  4  5  6  7
2017-01-03  8  9 10 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23
>>>df['a']     # Select column a directly from the table label, also available, same result
2017-01-01   0
2017-01-02   4
2017-01-03   8
2017-01-04  12
2017-01-05  16
2017-01-06  20
Freq: D, Name: a, dtype: int64
>>>df[0:3]  # Select the first 3 rows, can also use the row label df['2017-01-01':'2017-01-03'], the result is the same, but can not use this method to select multiple columns
       a b  c  d
2017-01-01 0 1  2  3
2017-01-02 4 5  6  7
2017-01-03 8 9 10 11

loc uses explicit row labels to pick data

DataFrame rows are represented in two ways, one is indexed by explicit row labels, and the other is indexed by default implicit row numbers. loc method is indexed by row labels to pick target rows, which can be used with column labels to pick data at specific locations.

>>>['2017-01-01':'2017-01-03']
       a b  c  d
2017-01-01 0 1  2  3
2017-01-02 4 5  6  7
2017-01-03 8 9 10 11
>>>['2017-01-01',['a','b']]  # Select columns a,b of a particular row
a  0
b  1
Name: 2017-01-01 00:00:00, dtype: int64

iloc uses implicit row sequence numbers to pick data

The use of iloc can be combined with column sequence numbers to make it easier to select data at specific points.

>>>[3,1]
13
>>>[1:3,2:4]
        c  d
2017-01-02  6  7
2017-01-03 10 11

ix can mix explicit tags with implicit serial numbers using ix

While loc can only pick data using explicit labels and iloc can only pick data using implicit sequence numbers, ix is able to use both together.

>>> [3:5,['a','b']]
       a  b
2017-01-04 12 13
2017-01-05 16 17

Using logical judgment to select data

>>>df
        a  b  c  d
2017-01-01  0  1  2  3
2017-01-02  4  5  6  7
2017-01-03  8  9 10 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23
>>>df[df['a']>5] # Equivalent to df[>5]
        a  b  c  d
2017-01-03  8  9 10 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23

3. Pandas sets position-specific values

>>>import numpy as np
>>>import pandas as pd
>>>dates=pd.date_range('20170101',periods=6)
>>>datas=(24).reshape((6,4))
>>>columns=['a','b','c','d']
>>>df= me(data=datas,index=dates,colums=columns)
>>>[2,2:4]=111 # Replace row 2, columns 2 and 3 with 111.
        a  b  c  d
2017-01-01  0  1  2  3
2017-01-02  4  5  6  7
2017-01-03  8  9 111 111
2017-01-04 12 13  14  15
2017-01-05 16 17  18  19
2017-01-06 20 21  22  23
>>>[df['a']>10]=0 # Equivalent to [>10] # Change the value of the corresponding row in column b to 0, using the position of the number in column a greater than 10 as a reference
        a b  c  d
2017-01-01  0 1  2  3
2017-01-02  4 5  6  7
2017-01-03  8 9 111 111
2017-01-04 12 0  14  15
2017-01-05 16 0  18  19
2017-01-06 20 0  22  23
>>>df['f']=  # Create a new column f and set the value to
       a b  c  d  f
2017-01-01  0 1  2  3 NaN
2017-01-02  4 5  6  7 NaN
2017-01-03  8 9 111 111 NaN
2017-01-04 12 0  14  15 NaN
2017-01-05 16 0  18  19 NaN
2017-01-06 20 0  22  23 NaN
>>>
# A `Series` sequence can also be added in the same way as above, but it must be the same length as the columns
>>>df['e']=((6),index=dates)
>>>df
       a b  c  d  f e
2017-01-01  0 1  2  3 NaN 0
2017-01-02  4 5  6  7 NaN 1
2017-01-03  8 9 111 111 NaN 2
2017-01-04 12 0  14  15 NaN 3
2017-01-05 16 0  18  19 NaN 4
2017-01-06 20 0  22  23 NaN 5

4. Addressing missing data

Sometimes our data will have some empty or missing (NaN) data, the use of dropna can selectively remove or fill these NaN data. drop function can selectively remove rows or columns, drop_duplicates to remove redundancy. fillna will replace the NaN value with another value. After the operation does not change the original value, if you want to save the changes need to re-assign the value.

>>>import numpy as np
>>>import pandas as pd
>>>df=((24).reshape(6,4),index=pd.date_range('20170101',periods=6),columns=['a','b','c','d'])
>>>df
       a  b  c  d
2017-01-01  0  1  2  3
2017-01-02  4  5  6  7
2017-01-03  8  9 10 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23
>>>[1,3]=
>>>[3,2]=
>>>df.
       a  b   c   d
2017-01-01  0  1  2.0  3.0
2017-01-02  4  5  6.0  NaN
2017-01-03  8  9 10.0 11.0
2017-01-04 12 13  NaN 15.0
2017-01-05 16 17 18.0 19.0
2017-01-06 20 21 22.0 23.0
>>>(axis=0,how='any') # axis=0(1) means that rows (columns) containing NaN will be deleted.
   # how='any' means that as long as the row (or column, depending on the value of axis) contains NaN, the row (column) will be deleted.
   # how='all' means that when a row (column) is all NaN to delete
       a  b   c   d
2017-01-01  0  1  2.0  3.0
2017-01-03  8  9 10.0 11.0
2017-01-05 16 17 18.0 19.0
2017-01-06 20 21 22.0 23.0
>>>(value=55)
       a  b   c   d
2017-01-01  0  1  2.0  3.0
2017-01-02  4  5  6.0 55.0
2017-01-03  8  9 10.0 11.0
2017-01-04 12 13 55.0 15.0
2017-01-05 16 17 18.0 19.0
2017-01-06 20 21 22.0 23.0

Functions can also be used to check if any or all of the data is NaN

>>>(())==True
True
>>>(())==True
False

5. Import and export of data

Generally excel files are read in as csv, pd.read_csv(file), and data is saved as filedata.to_csv(file).

6. Data addition and consolidation

This section focuses on learning Pandas some simple and basic data add merge methods: concat, append.

The concat merge method is similar to Numpy's concatenate method, which can merge horizontally or vertically.

>>>import numpy as np
>>>import pandas as pd
>>> df1=(((3,4))*0,columns=['a','b','c','d'])
>>> df2=(((3,4))*1,columns=['a','b','c','d'])
>>> df3=(((3,4))*2,columns=['a','b','c','d'])
>>>res=([df1,df2,df3],axis=0) 
# axis=0 means stacked merge by rows, axis=1 means left-right merge by columns
>>>res
    a  b  c  d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
>>>
# Use the ignore_index=True parameter to reset the row labeling
>>>res=([df1,df2,df3],axis=0,ignore_index=True)
>>>res
    a  b  c  d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0

join parameter provides a more diverse merger. join = outer for the default value, that will be several merged data are used, with the same column labeled as one, up and down merger, different column labeled alone into a column, the original location of the value of the original no NaN fill; join = inner will only have the same column labeled (rows) up and down merger of columns, the rest of the columns to be discarded. In short, outer on behalf of the union, inner on behalf of the intersection **.

>>>import numpy as np
>>>import pandas as pd
>>>df1=(((3,4)),index=[1,2,3],columns=['a','b','c','d'])
>>>df2=(((3,4))*2,index=[1,2,3],columns=['b','c','d','e'])
>>>res=([df1,df2],axis=0,join='outer')
>>>res
    a  b  c  d  e
1 1.0 1.0 1.0 1.0 NaN
2 1.0 1.0 1.0 1.0 NaN
3 1.0 1.0 1.0 1.0 NaN
1 NaN 2.0 2.0 2.0 2.0
2 NaN 2.0 2.0 2.0 2.0
3 NaN 2.0 2.0 2.0 2.0
>>>res1=([df1,df2],axis=1,join='outer') 
 # axis=1 means merge the ones with the same row label by column left and right, and the rest into one row each, NaN to fill in the gaps.
>>>res1
    a  b  c  d  b  c  d  e
1 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
2 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
3 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
>>>res2=([df1,df2],axis=0,join='inner',ignore_index=True) 
# Merge columns with the same column label up and down
>>>res2
   b  c  d
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
3 2.0 2.0 2.0
4 2.0 2.0 2.0
5 2.0 2.0 2.0

The join_axes parameter allows you to set the reference system, merge with the set reference, and discard anything that is not in the reference system.

>>>import numpy as np
>>>import pandas as pd
>>>df1=(((3,4)),index=[1,2,3],columns=['a','b','c','d'])
>>> df2=(((3,4))*2,index=[2,3,4],columns=['b','c','d','e'])
>>>res3=([df1,df2],axis=0,join_axes=[])
# Merge columns with the same column label up and down using df1's column label as a reference
>>>res3
    a  b  c  d
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
2 NaN 2.0 2.0 2.0
3 NaN 2.0 2.0 2.0
4 NaN 2.0 2.0 2.0
>>>res4=([df1,df2],axis=1,join_axes=[])
# Merge columns with the same row label left and right, using the df1 row label as a reference
    a  b  c  d  b  c  d  e
1 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
2 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
3 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0

append only merges up and down, not left and right

>>>df1=(((3,4)),index=[1,2,3],columns=['a','b','c','d'])
>>> df2=(((3,4))*2,index=[2,3,4],columns=['b','c','d','e'])
>>>res5=(df2,ignore_index=True)
>>>res5
    a  b  c  d  e
0 1.0 1.0 1.0 1.0 NaN
1 1.0 1.0 1.0 1.0 NaN
2 1.0 1.0 1.0 1.0 NaN
3 NaN 2.0 2.0 2.0 2.0
4 NaN 2.0 2.0 2.0 2.0
5 NaN 2.0 2.0 2.0 2.0

7. Pandas advanced merging: merge

The merge merge is similar to concat, except that merge can join the rows of two datasets by one or more keys.

merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, 
sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)

Parameter Description:

Left vs. right:Two different DataFrames
how：refers to the merge (connection) of the way inner (inner connection), left (left outer connection), right (right outer connection), outer (full outer connection); the default is inner
on : Refers to the name of the column index used for joining. Must exist in the right and right DataFrame objects, if not specified and other parameters are not specified, then the intersection of the column names of the two DataFrames will be used as the key for joining.
left_on：The name of the column in the left DataFrame that is used as the join key; this parameter is useful when the left and right column names are not the same, but represent the same meaning.
right_on：The name of the column in the right DataFrame that will be used as the join key.
left_index：Use the row index in the left DataFrame as the join key
right_index：Use the row index in the right DataFrame as the join key
sort：The default is True, which sorts the merged data. Setting it to False improves performance in most cases
suffixes：A tuple of string values, used to specify the suffix name appended to the column name when the same column name exists in the left and right DataFrames, defaults to ('_x','_y')
copy：Defaults to True, which always copies data to the data structure; False improves performance in most cases.
indicator：Show the sources in the merged data; e.g., only from the left (left_only), both (both).

>>>import pandas as pd
>>>df1=({'key':['k0','k1','k2','k3'],'A':['a0','a1','a2','a3'],'B':['b0','b1','b2','b3']})
>>>df2=({'key':['k0','k1','k2','k3'],'C':['c0','c1','c2','c3'],'D':['d0','d1','d2','d3']})
>>> res=(df1,df2,on='key',indicator=True)
>>>res
  A  B key  C  D _merge
0 a0 b0 k0 c0 d0  both
1 a1 b1 k1 c1 d1  both
2 a2 b2 k2 c2 d2  both
3 a3 b3 k3 c3 d3  both

Merging by row index is similar to merging by column key.

>>>res2=(df1,df2,left_index=True,right_index=True,indicator=True)
>>>res2
  A  B key_x  C  D key_y _merge
0 a0 b0  k0 c0 d0  k0  both
1 a1 b1  k1 c1 d1  k1  both
2 a2 b2  k2 c2 d2  k2  both
3 a3 b3  k3 c3 d3  k3  both

This is the whole content of this article.