This article example describes the Python data analysis module pandas usage. Shared for your reference, as follows:
I. Introduction
pandas (Python Data Analysis Library) is a numpy-based data analysis module that provides a large number of standard data models and tools needed to efficiently manipulate large datasets, and it can be said that pandas is one of the important factors that enable Python to become an efficient and powerful data analysis environment.
pandas provides 3 main data structures:
1) Series, a one-dimensional array with labels.
(2) DataFrame, with labels and variable size of the two-dimensional table structure.
(3) Panel, a three-dimensional array with labels and variable size.
II Code
1, generate a one-dimensional array
>>>import pandas as pd >>>import numpy as np >>> x = ([1,3,5, ]) >>>print(x) 01.0 13.0 25.0 3NaN dtype: float64
2, generate two-dimensional arrays
>>> dates = pd.date_range(start='20170101', end='20171231', freq='D')# of days between >>>print(dates) DatetimeIndex(['2017-01-01','2017-01-02','2017-01-03','2017-01-04', '2017-01-05','2017-01-06','2017-01-07','2017-01-08', '2017-01-09','2017-01-10', ... '2017-12-22','2017-12-23','2017-12-24','2017-12-25', '2017-12-26','2017-12-27','2017-12-28','2017-12-29', '2017-12-30','2017-12-31'], dtype='datetime64[ns]', length=365, freq='D') >>> dates = pd.date_range(start='20170101', end='20171231', freq='M')# Intervals in months >>>print(dates) DatetimeIndex(['2017-01-31','2017-02-28','2017-03-31','2017-04-30', '2017-05-31','2017-06-30','2017-07-31','2017-08-31', '2017-09-30','2017-10-31','2017-11-30','2017-12-31'], dtype='datetime64[ns]', freq='M') >>> df = ((12,4), index=dates, columns=list('ABCD')) >>>print(df) A B C D 2017-01-31-0.6825560.2441020.4508550.236475 2017-02-28-0.6300600.5906670.4824380.225697 2017-03-311.0669890.3193391.0949531.716053 2017-04-300.334944-0.053049-1.009493-1.039470 2017-05-31-0.380778-0.0444290.0756470.931243 2017-06-300.8675400.872197-0.738974-1.114596 2017-07-310.423371-1.0863860.183820-0.438921 2017-08-311.2851630.634134-0.4729731.281057 2017-09-30-1.002832-0.888122-1.316014-0.070637 2017-10-311.735617-0.2538150.5544031.536211 2017-11-302.0303840.6675561.0126980.239479 2017-12-312.059718-0.0890501.4205170.224578 >>> df = ([[(1,100)for j in range(4)]for i in range(12)], index=dates, columns=list('ABCD')) >>>print(df) A B C D 2017-01-317532522 2017-02-2870997098 2017-03-3199477567 2017-04-3033701749 2017-05-3162886891 2017-06-3019751844 2017-07-3150856582 2017-08-315628776 2017-09-306173111 2017-10-318296692 2017-11-306359194 2017-12-3179586933 >>> df = ({'A':[(1,100)for i in range(4)], 'B':pd.date_range(start='20130101', periods=4, freq='D'), 'C':([1,2,3,4],index=list(range(4)),dtype='float32'), 'D':([3]*4,dtype='int32'), 'E':(["test","train","test","train"]), 'F':'foo'}) >>>print(df) A B C D E F 0152013-01-011.03 test foo 1112013-01-022.03 train foo 2912013-01-033.03 test foo 3912013-01-044.03 train foo >>> df = ({'A':[(1,100)for i in range(4)], 'B':pd.date_range(start='20130101', periods=4, freq='D'), 'C':([1,2,3,4],index=['zhang','li','zhou','wang'],dtype='float32'), 'D':([3]*4,dtype='int32'), 'E':(["test","train","test","train"]), 'F':'foo'}) >>>print(df) A B C D E F zhang 362013-01-011.03 test foo li 862013-01-022.03 train foo zhou 102013-01-033.03 test foo wang 792013-01-044.03 train foo >>>
3、Two-dimensional data view
>>> () # Default display of the first 5 rows A B C D E F zhang 362013-01-011.03 test foo li 862013-01-022.03 train foo zhou 102013-01-033.03 test foo wang 792013-01-044.03 train foo >>> (3) #View the first 3 rows A B C D E F zhang 362013-01-011.03 test foo li 862013-01-022.03 train foo zhou 102013-01-033.03 test foo >>> (2) #View the last 2 lines A B C D E F zhou 102013-01-033.03 test foo wang 792013-01-044.03 train foo
4, view two-dimensional data index, column names and data
>>> Index(['zhang','li','zhou','wang'], dtype='object') >>> Index(['A','B','C','D','E','F'], dtype='object') >>> array([[36,Timestamp('2013-01-01 00:00:00'),1.0,3,'test','foo'], [86,Timestamp('2013-01-02 00:00:00'),2.0,3,'train','foo'], [10,Timestamp('2013-01-03 00:00:00'),3.0,3,'test','foo'], [79,Timestamp('2013-01-04 00:00:00'),4.0,3,'train','foo']], dtype=object)
5、View the statistical information of the data
>>> () # Mean, standard deviation, minimum, maximum and other information A C D count 4.0000004.0000004.0 mean 52.7500002.5000003.0 std 36.0682221.2909940.0 min 10.0000001.0000003.0 25%29.5000001.7500003.0 50%57.5000002.5000003.0 75%80.7500003.2500003.0 max 86.0000004.0000003.0
6. Two-dimensional data transposition
>>> zhang li zhou \ A 368610 B 2013-01-0100:00:002013-01-0200:00:002013-01-0300:00:00 C 123 D 333 E test train test F foo foo foo wang A 79 B 2013-01-0400:00:00 C 4 D 3 E train F foo
7. Sorting
>>> df.sort_index(axis=0, ascending=False)# Sort the axes A B C D E F zhou 102013-01-033.03 test foo zhang 362013-01-011.03 test foo wang 792013-01-044.03 train foo li 862013-01-022.03 train foo >>> df.sort_index(axis=1, ascending=False) F E D C B A zhang foo test 31.02013-01-0136 li foo train 32.02013-01-0286 zhou foo test 33.02013-01-0310 wang foo train 34.02013-01-0479 >>> df.sort_index(axis=0, ascending=True) A B C D E F li 862013-01-022.03 train foo wang 792013-01-044.03 train foo zhang 362013-01-011.03 test foo zhou 102013-01-033.03 test foo >>> df.sort_values(by='A')# Sort the data A B C D E F zhou 102013-01-033.03 test foo zhang 362013-01-011.03 test foo wang 792013-01-044.03 train foo li 862013-01-022.03 train foo >>> df.sort_values(by='A', ascending=False)# Descending order A B C D E F li 862013-01-022.03 train foo wang 792013-01-044.03 train foo zhang 362013-01-011.03 test foo zhou 102013-01-033.03 test foo
8. Data selection
>>> df['A']# Select columns zhang 1 li 1 zhou 60 wang 58 Name: A, dtype: int64 >>> df[0:2]# Use slicing to select multiple lines A B C D E F zhang 12013-01-011.03 test foo li 12013-01-022.03 train foo >>> [:,['A','C']]#Select multiple columns A C zhang 11.0 li 12.0 zhou 603.0 wang 584.0 >>> [['zhang','zhou'],['A','D','E']]# Specify multiple rows and columns for selection at the same time A D E zhang 13 test zhou 603 test >>> ['zhang',['A','D','E']] A 1 D 3 E test Name: zhang, dtype: object
9. Data modification and setting
>>> [0,2]=3# Modify data values at specified row and column positions >>>print(df) A B C D E F zhang 12013-01-013.03 test foo li 12013-01-022.03 train foo zhou 602013-01-033.03 test foo wang 582013-01-044.03 train foo >>> [:,'D']=[(50,60)for i in range(4)]# Modify the value of a column >>>print(df) A B C D E F zhang 12013-01-013.057 test foo li 12013-01-022.052 train foo zhou 602013-01-033.057 test foo wang 582013-01-044.056 train foo >>> df['C']=-df['C']# Invert the data in the specified column >>>print(df) A B C D E F zhang 12013-01-01-3.057 test foo li 12013-01-02-2.052 train foo zhou 602013-01-03-3.057 test foo wang 582013-01-04-4.056 train foo
10. Missing value processing
>>> df1 = (index=['zhang','li','zhou','wang'], columns=list()+['G']) >>>print(df1) A B C D E F G zhang 12013-01-01-3.057 test foo NaN li 12013-01-02-2.052 train foo NaN zhou 602013-01-03-3.057 test foo NaN wang 582013-01-04-4.056 train foo NaN >>> [0,6]=3# Modify the value of the element at the specified position, the other elements of the column are missing values NaN >>>print(df1) A B C D E F G zhang 12013-01-01-3.057 test foo 3.0 li 12013-01-02-2.052 train foo NaN zhou 602013-01-03-3.057 test foo NaN wang 582013-01-04-4.056 train foo NaN >>> (df1)# Test for missing values, return value is True/False array A B C D E F G zhang FalseFalseFalseFalseFalseFalseFalse li FalseFalseFalseFalseFalseFalseTrue zhou FalseFalseFalseFalseFalseFalseTrue wang FalseFalseFalseFalseFalseFalseTrue >>> ()# Returns rows that do not contain missing values A B C D E F G zhang 12013-01-01-3.057 test foo 3.0 >>> df1['G'].fillna(5, inplace=True)# Fill in missing values with specified values >>>print(df1) A B C D E F G zhang 12013-01-01-3.057 test foo 3.0 li 12013-01-02-2.052 train foo 5.0 zhou 602013-01-03-3.057 test foo 5.0 wang 582013-01-04-4.056 train foo 5.0
11. Data manipulation
>>> ()#Average values, automatically ignoring missing values A 30.0 C -3.0 D 55.5 G 4.5 dtype: float64 >>> (1)# Calculate the average horizontally zhang 18.333333 li 17.000000 zhou 38.000000 wang 36.666667 dtype: float64 >>> (1)#Data Shift A B C D E F G zhang NaNNaTNaNNaNNaNNaNNaN li 1.02013-01-01-3.057.0 test foo 3.0 zhou 1.02013-01-02-2.052.0 train foo 5.0 wang 60.02013-01-03-3.057.0 test foo 5.0 >>> df1['D'].value_counts()# Histogram statistics 572 561 521 Name: D, dtype: int64 >>>print(df1) A B C D E F G zhang 12013-01-01-3.057 test foo 3.0 li 12013-01-02-2.052 train foo 5.0 zhou 602013-01-03-3.057 test foo 5.0 wang 582013-01-04-4.056 train foo 5.0 >>> df2 = ((10,4)) >>>print(df2) 0123 0-0.939904-1.856658-0.2819650.203624 10.3501620.060674-0.9148080.135735 2-1.031384-1.6112740.341546-0.363671 30.139464-0.050959-0.810610-0.772648 4-1.146810-0.7916081.488790-0.490004 5-0.100707-0.763545-0.071274-0.298142 6-0.2120140.8097090.6931960.980568 7-0.812985-0.000325-0.675101-0.217394 80.066969-0.084609-0.4330990.535616 9-0.319120-0.5328541.321712-1.751913 >>> p1 = df2[:3] >>> print(p1) 0 1 2 3 0 -0.939904 -1.856658 -0.281965 0.203624 1 0.350162 0.060674 -0.914808 0.135735 2 -1.031384 -1.611274 0.341546 -0.363671 >>> p2 = df2[3:7] >>> print(p2) 0 1 2 3 3 0.139464 -0.050959 -0.810610 -0.772648 4 -1.146810 -0.791608 1.488790 -0.490004 5 -0.100707 -0.763545 -0.071274 -0.298142 6 -0.212014 0.809709 0.693196 0.980568 >>> p3 = df2[7:] >>> print(p3) 0 1 2 3 7 -0.812985 -0.000325 -0.675101 -0.217394 8 0.066969 -0.084609 -0.433099 0.535616 9 -0.319120 -0.532854 1.321712 -1.751913 >>> df3 = ([p1, p2, p3]) #Data row merging >>> print(df3) 0 1 2 3 0 -0.939904 -1.856658 -0.281965 0.203624 1 0.350162 0.060674 -0.914808 0.135735 2 -1.031384 -1.611274 0.341546 -0.363671 3 0.139464 -0.050959 -0.810610 -0.772648 4 -1.146810 -0.791608 1.488790 -0.490004 5 -0.100707 -0.763545 -0.071274 -0.298142 6 -0.212014 0.809709 0.693196 0.980568 7 -0.812985 -0.000325 -0.675101 -0.217394 8 0.066969 -0.084609 -0.433099 0.535616 9 -0.319120 -0.532854 1.321712 -1.751913 >>> df2 == df3 0 1 2 3 0 True True True True 1 True True True True 2 True True True True 3 True True True True 4 True True True True 5 True True True True 6 True True True True 7 True True True True 8 True True True True 9 True True True True >>> df4 = ({'A':[(1,5) for i in range(8)], 'B':[(10,15) for i in range(8)], 'C':[(20,30) for i in range(8)], 'D':[(80,100) for i in range(8)]}) >>> print(df4) A B C D 0 4 11 24 91 1 1 13 28 95 2 2 12 27 91 3 1 12 20 87 4 3 11 24 96 5 1 13 21 99 6 3 11 22 95 7 2 13 26 98 >>> >>> ('A').sum() #Data grouping calculation B C D A 1 38 69 281 2 25 53 189 3 22 46 191 4 11 24 91 >>> >>> (['A','B']).mean() C D A B 1 12 20.0 87.0 13 24.5 97.0 2 12 27.0 91.0 13 26.0 98.0 3 11 23.0 95.5 4 11 24.0 91.0
12, combined with matplotlib plotting
>>>import pandas as pd >>>import numpy as np >>>import as plt >>> df = ((1000,2), columns=['B','C']).cumsum() >>>print(df) B C 00.0898860.511081 11.3237661.584758 21.489479-0.438671 30.831331-0.398021 4-0.2482330.494418 5-0.0130850.684518 60.666951-1.422161 71.768838-0.658786 82.6610800.648505 91.9517510.836261 103.5387851.657475 113.2540342.052609 124.2486201.568401 134.0771730.055622 143.452590-0.200314 152.627620-0.408829 163.690537-0.210440 173.1849240.365447 183.646556-0.150044 194.164563-0.023405 202.3914470.517872 212.8651530.686649 223.6231830.663927 231.5451170.151044 243.5959240.903619 253.0138041.855083 264.4388011.014572 275.1552160.882628 284.4314570.741509 292.8419490.709991 ........ 970-7.910646-13.738689 971-7.318091-14.811335 972-9.144376-15.466873 973-9.538658-15.367167 974-9.061114-16.822726 975-9.803798-17.368350 976-10.180575-17.270180 977-10.601352-17.671543 978-10.804909-19.535919 979-10.397964-20.361419 980-10.979640-20.300267 981-8.738223-20.202669 982-9.339929-21.528973 983-9.780686-20.902152 984-11.072655-21.235735 985-10.849717-20.439201 986-10.953247-19.708973 987-13.032707-18.687553 988-12.984567-19.557132 989-13.508836-18.747584 990-13.420713-19.883180 991-11.718125-20.474092 992-11.936512-21.360752 993-14.225655-22.006776 994-13.524940-20.844519 995-14.088767-20.492952 996-14.169056-20.666777 997-14.798708-19.960555 998-15.766568-19.395622 999-17.281143-19.089793 [1000 rows x 2 columns] >>> df['A']= (list(range(len(df)))) >>>print(df) B C A 00.0898860.5110810 11.3237661.5847581 21.489479-0.4386712 30.831331-0.3980213 4-0.2482330.4944184 5-0.0130850.6845185 60.666951-1.4221616 71.768838-0.6587867 82.6610800.6485058 91.9517510.8362619 103.5387851.65747510 113.2540342.05260911 124.2486201.56840112 134.0771730.05562213 143.452590-0.20031414 152.627620-0.40882915 163.690537-0.21044016 173.1849240.36544717 183.646556-0.15004418 194.164563-0.02340519 202.3914470.51787220 212.8651530.68664921 223.6231830.66392722 231.5451170.15104423 243.5959240.90361924 253.0138041.85508325 264.4388011.01457226 275.1552160.88262827 284.4314570.74150928 292.8419490.70999129 ........... 970-7.910646-13.738689970 971-7.318091-14.811335971 972-9.144376-15.466873972 973-9.538658-15.367167973 974-9.061114-16.822726974 975-9.803798-17.368350975 976-10.180575-17.270180976 977-10.601352-17.671543977 978-10.804909-19.535919978 979-10.397964-20.361419979 980-10.979640-20.300267980 981-8.738223-20.202669981 982-9.339929-21.528973982 983-9.780686-20.902152983 984-11.072655-21.235735984 985-10.849717-20.439201985 986-10.953247-19.708973986 987-13.032707-18.687553987 988-12.984567-19.557132988 989-13.508836-18.747584989 990-13.420713-19.883180990 991-11.718125-20.474092991 992-11.936512-21.360752992 993-14.225655-22.006776993 994-13.524940-20.844519994 995-14.088767-20.492952995 996-14.169056-20.666777996 997-14.798708-19.960555997 998-15.766568-19.395622998 999-17.281143-19.089793999 [1000 rows x 3 columns] >>> () < object at 0x000002A2A0B10F28> >>> (x='A') <._subplots.AxesSubplot object at 0x000002A2A12FE7F0> >>> ()
running result
>>> df = ((10,4), columns=['a','b','c','d']) >>>print(df) a b c d 00.5044340.1908750.0016870.327372 10.4068440.6020290.9120750.815889 20.8285340.9859100.0946620.552089 30.1988430.8187850.7506490.967054 40.4984940.1513780.4175060.264438 50.6552880.6727880.0886160.433270 60.4931270.0092540.1794790.396655 70.4193860.9109860.0200040.229063 80.6714690.6121890.3749200.407093 90.4149780.0334990.7560250.717849 >>> (kind='bar') <._subplots.AxesSubplot object at 0x000002A2A17BD7B8> >>> ()
running result
>>> df = ((10,4), columns=['a','b','c','d']) >>> (kind='barh', stacked=True) <._subplots.AxesSubplot object at 0x000002A2A3784390> >>> ()
Readers interested in more Python related content can check out this site's topic: theSummary of Python mathematical operations techniques》、《Python Data Structures and Algorithms Tutorial》、《Summary of Python function usage tips》、《Summary of Python string manipulation techniquesand thePython introductory and advanced classic tutorials》
I hope that what I have said in this article will help you in Python programming.