indexing
Series and DataFrame are indexed. The benefit of indexing is fast positioning, and when two Series or DataFrames are involved they can be automatically aligned according to the index, for example, dates are automatically aligned, which saves a lot of work.
missing value
(obj) ()
Convert a dictionary into a dataframe with column names and indexes.
DataFrame(data, columns=['col1','col2','col3'...], index = ['i1','i2','i3'...])
View Column Name
View Index
Rebuild Index
(['a','b','c','d','e'...], fill_value=0] # Reorder the indexes in the order given, instead of replacing them. If the index has no value, fill it with 0 # Modify indexes in situ =()
Column order reordering (also rebuilding indexes)
[columns=['col1','col2','col3'...])` # It is also possible to rebuild both index and columns at the same time [index=['a','b','c'...],columns=['col1','col2','col3'...])
Shortcut keys to rebuild indexes
[['a','b','c'...],['col1','col2','col3'...]]
Rename Axis Index
(index=,columns=) # Modify an index and column name by passing in the dictionary (index={'old_index':'new_index'}, columns={'old_col':'new_col'})
Viewing a column
DataFrame['state'] maybe
View a row
Indexing is required
['index_name']
Adding or deleting a column
DataFrame['new_col_name'] = 'char_or_number' # Delete rows (['index1','index2'...]) # Delete columns (['col1','col2'...],axis=1) # or del DataFrame['col1']
DataFrame Selection Subset
typology | clarification |
---|---|
obj[val] | Select one or more columns |
[val] | Select one or more lines |
[:,val] | Select one or more columns |
[val1,val2] | Select both rows and columns |
reindx | Reindexing rows and columns |
icol,irow | Single column or single row depending on the integer position. |
get_value,set_value | Select individual values based on row labels and column labels |
For series
obj[['a','b','c'...]] obj['b':'e']=5
For dataframes
#Select multiple columns dataframe[['col1','col2'...]] #Select multiple lines dataframe[m:n] # Conditional Filtering dataframe[dataframe['col3'>5]] #Select subsets [0:3,0:5]
Operations on dataframes and series
It automatically aligns itself to the index and columns and then does the math, which is very convenient.
methodologies | clarification |
---|---|
add | addition |
sub | subtractive |
div | division (math.) |
mul | subtraction |
# Fill nulls with 0 where there is no data (df2,fill_value=0) # dataframe operations with series dataframe - series The rules are: -------- -------- | | | | | | | | -------- | | | | | | v -------- # Specify axis direction (series,axis=0) The rules are: -------- --- | | | | -----> | | | | | | | | | | | | -------- ---
apply function
f=lambda x:()-() # Default application to each column (f) # If you need to apply a grouping to each row (f,axis=1)
Sorting and ranking
# Sort by index by default, axis = 1 sorts by columns. dataframe.sort_index(axis=0, ascending=False) # Sorted by value dataframe.sort_index(by=['col1','col2'...]) #Rank, given as rank value (ascending=False) # If there are duplicate values, take the average rank order # Ranking above rows or columns (axis=0)
descriptive statistics
methodologies | clarification |
---|---|
count | reckoning |
describe | Give the common statistics for each column |
min,max | minimum and maximum values |
argmin,argmax | Index position of the maximum and minimum values (integer) |
idxmin,idxmax | Maximum and minimum index values |
quantile | Calculate the sample quartile |
sum,mean | Summing the columns, mean |
mediam | upper quartile |
mad | Calculate the mean absolute deviation from the mean |
var,std | Variance, standard deviation |
skew | Skewness (third order moments) |
Kurt | Kurtosis (fourth order moments) |
cumsum | cumulative and |
Cummins,cummax | Cumulative group approximate and cumulative minimum |
cumprod | cumulative product |
diff | first-order difference |
pct_change | Calculation of percentage change |
Unique values, value counts, memberships
() obj.value_count() (['b','c'])
Handling of missing values
# Filtering for missing values # Discard this line whenever there's a missing value # () # Require all missing before dropping the line. (how='all') # Judged by columns (how='all',axis=1) # Fill in missing values #1. Fill with zeros (0) #2. Different columns are populated with different values ({1:0.5, 3:-1}) #3. Fill with averages (()) # this timeaxisThe parameters are the same as in the previous
Groupby
pandas provides a flexible and efficient groupby function, which enables you to slice, dice, summarize, and so on in a natural way to your dataset.
Splits pandas objects based on one or more keys (which can be functions, arrays, or DataFrame column names).
Calculate grouped summary statistics such as count, mean, standard deviation, or user-defined functions. Apply a wide variety of functions to the columns of a DataFrame.
Apply in-group transformations or other operations such as specification, linear regression, ranking, or picking subsets. Calculate pivot tables or crosstabs. Perform quantile analysis and other grouping analyses.
1) View DataFrame data and properties
df_obj = DataFrame() # Create DataFrame objects df_obj.dtypes # View data format for each row df_obj['Column name'].astype(int)# Convert the data type of a column df_obj.head() # View the first few rows of data, defaults to the first 5 rows. df_obj.tail() #View the last few rows of data, defaults to the last 5 rows. df_obj.index #View Index df_obj.columns #View Column Names df_obj.values #View data values df_obj.describe() # Descriptive statistics df_obj.T #Transpose df_obj.sort_values(by=['',''])#ibid
2) Selecting data using DataFrame:.
df_obj.ix[1:3] # Get 1-3 rows of data, this operation is called a slice operation, get rows of data df_obj.ix[columns_index] # Get the data for the column df_obj.ix[1:3,[1,3]]# Get 1~3 rows of data in 1 column of 3 columns df_obj[columns].drop_duplicates() #Remove duplicate rows of data
3) Reset the data using DataFrame :)
df_obj.ix[1:3,[1,3]]=1#The selected position data is replaced with1
4) Use DataFrame to filter data (similar to WHERE in SQL).
alist = ['023-18996609823'] df_obj['User number'].isin(alist) # The data to be filtered into the dictionary, use isin to filter the data, return the line index and the results of the screening of each line, if the match is returned ture df_obj[df_obj['User number'].isin(alist)] #Get matches forturefeasibility study
5) Use DataFrame to fuzzy filter data (similar to LIKE in SQL).
df_obj[df_obj['Packages'].(r'. *? Voice CDMA.*')] #Fuzzy Matching with Regular Expressions,*match0or unlimited,?match0maybe1substandard
6) Data conversion using DataFrame (additional instructions at a later stage)
df_obj['Branch_maintenance_line'] = df_obj['Branch_maintenance_line'].('Wuxi Branch(.{2,})sub-bureau','\\1')#Regular expressions can be used
You can set take_last=ture to keep the last one, or to keep the beginning one.
Additional note: note that take_last=ture is obsolete, please use keep='last'
7) Reading data in pandas using.
read_csv('D:\',sep=';',nrows=2) # First enter the csv text address, then select the separator and so on. df.to_excel('',sheet_name='Sheet1');pd.read_excel('', 'Sheet1', index_col=None, na_values=['NA'])# Write to read excel data, pd.read_excel reads data stored as DataFrame df.to_hdf('foo.h5','df');pd.read_hdf('foo.h5','df')#Write ReadHDF5digital
8) Use pandas to aggregate data (similar to GROUP BY or HAVING in SQL).
data_obj['User ID'].groupby(data_obj['Branch_maintenance_line']) data_obj.groupby('Branch_maintenance_line')['User ID'] # Simple writing above adsl_obj.groupby('Branch_maintenance_line')['User ID'].agg([('ADSL','count')])#Aggregated counting of user IDs by branch office,and name the columns of the count column asADSL
9) Merge datasets using pandas (similar to JOIN in SQL).
merge(mxj_obj2, mxj_obj1 ,on='User ID',how='inner')# mxj_obj1cap (a poem)mxj_obj2Merge two datasets by treating user identifiers as keys of overlapping columns,innerdenotes to take the intersection of two datasets.
10) Cleaning up the data
df[()] df[()] ()# Remove all rows containing nan items. (axis=1,thresh=3) # Remove items with three NaNs in the direction of the columns (how='ALL')# Remove padding from rows where all items are nan (0) ({1:0,2:0.5}) # Assign 0 to the first column of nan values, and 0.5 to the second column. (method='ffill') #The previous value in the column direction is assigned as a value to theNaN
an actual example
1. Reading excel data
The code is as follows
import pandas as pd# Read blast furnace data, note that the file name cannot be Chinese data=pd.read_excel('gaolushuju_201501', '201501', index_col=None, na_values=['NA']) print data
The test results are as follows
fuel ratio southwesternmost part of the planet be at loggerheads south-easternmost part of the planet northeastern 0 531.46 185 176 176 174 1 510.35 184 173 184 188 2 533.49 180 165 182 177 3 511.51 190 172 179 188 4 531.02 180 167 173 180 5 511.24 174 164 178 176 6 532.62 173 170 168 179 7 583.00 182 175 176 173 8 530.70 158 149 159 156 9 530.32 168 156 169 171 10 528.62 164 150 171 169
2. Slicing, selecting rows or columns, modifying data
The code is as follows:
data_1row=[1] data_5row_2col=[0:5,[u'Fuel ratio',u'Topwin Southwest'] print data_1row,data_5row_2col data_5row_2col.ix[0:1,0:2]=3
The test results are as follows:
fuel ratio 510.35 southwesternmost part of the planet 184.00 be at loggerheads 173.00 south-easternmost part of the planet 184.00 northeastern 188.00 Name: 1, dtype: float64 fuel ratio southwesternmost part of the planet 0 531.46 185 1 510.35 184 2 533.49 180 3 511.51 190 4 531.02 180 5 511.24 174 fuel ratio southwesternmost part of the planet 0 3.00 3 1 3.00 3 2 533.49 180 3 511.51 190 4 531.02 180 5 511.24 174
Format description, data_5row_2col.ix[0:1,0:2], data_5row_2col.ix[0:1,[0,2]], select part of the rows and columns need to add "[]"
3. Sorting
The code is as follows:
print data_1row.sort_values() print data_5row_2col.sort_values(by=u'Fuel ratio')
The test results are as follows:
be at loggerheads 173.00 southwesternmost part of the planet 184.00 south-easternmost part of the planet 184.00 northeastern 188.00 fuel ratio 510.35 Name: 1, dtype: float64 fuel ratio southwesternmost part of the planet 1 510.35 184 5 511.24 174 3 511.51 190 4 531.02 180 0 531.46 185 2 533.49 180
4. Deletion of duplicate rows
The code is as follows:
print data_5row_2col[u'Topwin Southwest'].drop_duplicates()#Remove duplicate rows of data
The test results are as follows:
0 185 1 184 2 180 3 190 5 174 Name: southwesternmost part of the planet, dtype: int64
Note: From test result 3, it can be seen that the data of top temperature southwest index=2 is duplicated with the data of index=4, and test result 4 shows that the data of top temperature southwest index=4 will be deleted.
summarize
The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.