SoFunction
Updated on 2024-11-15

Python data analysis: pandas Dataframe groupby and index usage

indexing

Series and DataFrame are indexed. The benefit of indexing is fast positioning, and when two Series or DataFrames are involved they can be automatically aligned according to the index, for example, dates are automatically aligned, which saves a lot of work.

missing value

(obj)
()

Convert a dictionary into a dataframe with column names and indexes.

DataFrame(data, columns=['col1','col2','col3'...],
            index = ['i1','i2','i3'...])

View Column Name

View Index

Rebuild Index

(['a','b','c','d','e'...], fill_value=0]
# Reorder the indexes in the order given, instead of replacing them. If the index has no value, fill it with 0
 
# Modify indexes in situ
=()

Column order reordering (also rebuilding indexes)

[columns=['col1','col2','col3'...])`
 
# It is also possible to rebuild both index and columns at the same time
 
[index=['a','b','c'...],columns=['col1','col2','col3'...])

Shortcut keys to rebuild indexes

[['a','b','c'...],['col1','col2','col3'...]]

Rename Axis Index

(index=,columns=)
 
# Modify an index and column name by passing in the dictionary
(index={'old_index':'new_index'},
            columns={'old_col':'new_col'})

Viewing a column

DataFrame['state'] maybe 

View a row

Indexing is required

['index_name']

Adding or deleting a column

DataFrame['new_col_name'] = 'char_or_number'
# Delete rows
(['index1','index2'...])
# Delete columns
(['col1','col2'...],axis=1)
# or
del DataFrame['col1']

DataFrame Selection Subset

typology clarification
obj[val] Select one or more columns
[val] Select one or more lines
[:,val] Select one or more columns
[val1,val2] Select both rows and columns
reindx Reindexing rows and columns
icol,irow Single column or single row depending on the integer position.
get_value,set_value Select individual values based on row labels and column labels

For series

obj[['a','b','c'...]]
obj['b':'e']=5

For dataframes

#Select multiple columns
dataframe[['col1','col2'...]]
 
#Select multiple lines
dataframe[m:n]
 
# Conditional Filtering
dataframe[dataframe['col3'>5]]
 
#Select subsets
[0:3,0:5]

Operations on dataframes and series

It automatically aligns itself to the index and columns and then does the math, which is very convenient.

methodologies clarification
add addition
sub subtractive
div division (math.)
mul subtraction
# Fill nulls with 0 where there is no data
(df2,fill_value=0)
 
# dataframe operations with series
dataframe - series
The rules are:
--------   --------  |
|      |   |      |  |
|      |   --------  |
|      |             |
|      |             v
--------
# Specify axis direction
(series,axis=0)
The rules are:
--------   ---  
|      |   | |   ----->
|      |   | | 
|      |   | | 
|      |   | | 
--------   ---

apply function

f=lambda x:()-()
 
# Default application to each column
(f)
 
# If you need to apply a grouping to each row
(f,axis=1)

Sorting and ranking

# Sort by index by default, axis = 1 sorts by columns.
dataframe.sort_index(axis=0, ascending=False)
 
# Sorted by value
dataframe.sort_index(by=['col1','col2'...])
 
#Rank, given as rank value
 
(ascending=False)
# If there are duplicate values, take the average rank order
 
# Ranking above rows or columns
(axis=0)

descriptive statistics

methodologies clarification
count reckoning
describe Give the common statistics for each column
min,max minimum and maximum values
argmin,argmax Index position of the maximum and minimum values (integer)
idxmin,idxmax Maximum and minimum index values
quantile Calculate the sample quartile
sum,mean Summing the columns, mean
mediam upper quartile
mad Calculate the mean absolute deviation from the mean
var,std Variance, standard deviation
skew Skewness (third order moments)
Kurt Kurtosis (fourth order moments)
cumsum cumulative and
Cummins,cummax Cumulative group approximate and cumulative minimum
cumprod cumulative product
diff first-order difference
pct_change Calculation of percentage change

Unique values, value counts, memberships

()
obj.value_count()
(['b','c'])

Handling of missing values

# Filtering for missing values
 
# Discard this line whenever there's a missing value #
()
# Require all missing before dropping the line.
(how='all')
# Judged by columns
(how='all',axis=1)
 
# Fill in missing values
 
#1. Fill with zeros
(0)
 
#2. Different columns are populated with different values
({1:0.5, 3:-1})
 
#3. Fill with averages
(())
 
# this timeaxisThe parameters are the same as in the previous

Groupby

pandas provides a flexible and efficient groupby function, which enables you to slice, dice, summarize, and so on in a natural way to your dataset.

Splits pandas objects based on one or more keys (which can be functions, arrays, or DataFrame column names).

Calculate grouped summary statistics such as count, mean, standard deviation, or user-defined functions. Apply a wide variety of functions to the columns of a DataFrame.

Apply in-group transformations or other operations such as specification, linear regression, ranking, or picking subsets. Calculate pivot tables or crosstabs. Perform quantile analysis and other grouping analyses.

1) View DataFrame data and properties

df_obj = DataFrame() # Create DataFrame objects
df_obj.dtypes # View data format for each row
df_obj['Column name'].astype(int)# Convert the data type of a column
df_obj.head() # View the first few rows of data, defaults to the first 5 rows.
df_obj.tail() #View the last few rows of data, defaults to the last 5 rows.
df_obj.index #View Index
df_obj.columns #View Column Names
df_obj.values #View data values
df_obj.describe() # Descriptive statistics
df_obj.T #Transpose
df_obj.sort_values(by=['',''])#ibid

2) Selecting data using DataFrame:.

df_obj.ix[1:3] # Get 1-3 rows of data, this operation is called a slice operation, get rows of data
df_obj.ix[columns_index] # Get the data for the column
df_obj.ix[1:3,[1,3]]# Get 1~3 rows of data in 1 column of 3 columns
df_obj[columns].drop_duplicates() #Remove duplicate rows of data

3) Reset the data using DataFrame :)

df_obj.ix[1:3,[1,3]]=1#The selected position data is replaced with1

4) Use DataFrame to filter data (similar to WHERE in SQL).

alist = ['023-18996609823']
df_obj['User number'].isin(alist) # The data to be filtered into the dictionary, use isin to filter the data, return the line index and the results of the screening of each line, if the match is returned ture
df_obj[df_obj['User number'].isin(alist)] #Get matches forturefeasibility study

5) Use DataFrame to fuzzy filter data (similar to LIKE in SQL).

df_obj[df_obj['Packages'].(r'. *? Voice CDMA.*')] #Fuzzy Matching with Regular Expressions,*match0or unlimited,?match0maybe1substandard

6) Data conversion using DataFrame (additional instructions at a later stage)

df_obj['Branch_maintenance_line'] = df_obj['Branch_maintenance_line'].('Wuxi Branch(.{2,})sub-bureau','\\1')#Regular expressions can be used

You can set take_last=ture to keep the last one, or to keep the beginning one.

Additional note: note that take_last=ture is obsolete, please use keep='last'

7) Reading data in pandas using.

read_csv('D:\',sep=';',nrows=2) # First enter the csv text address, then select the separator and so on.
df.to_excel('',sheet_name='Sheet1');pd.read_excel('', 'Sheet1', index_col=None, na_values=['NA'])# Write to read excel data, pd.read_excel reads data stored as DataFrame
df.to_hdf('foo.h5','df');pd.read_hdf('foo.h5','df')#Write ReadHDF5digital

8) Use pandas to aggregate data (similar to GROUP BY or HAVING in SQL).

data_obj['User ID'].groupby(data_obj['Branch_maintenance_line'])
data_obj.groupby('Branch_maintenance_line')['User ID'] # Simple writing above
adsl_obj.groupby('Branch_maintenance_line')['User ID'].agg([('ADSL','count')])#Aggregated counting of user IDs by branch office,and name the columns of the count column asADSL

9) Merge datasets using pandas (similar to JOIN in SQL).

merge(mxj_obj2, mxj_obj1 ,on='User ID',how='inner')# mxj_obj1cap (a poem)mxj_obj2Merge two datasets by treating user identifiers as keys of overlapping columns,innerdenotes to take the intersection of two datasets.

10) Cleaning up the data

df[()]
df[()]
()# Remove all rows containing nan items.
(axis=1,thresh=3) # Remove items with three NaNs in the direction of the columns
(how='ALL')# Remove padding from rows where all items are nan
(0)
({1:0,2:0.5}) # Assign 0 to the first column of nan values, and 0.5 to the second column.
(method='ffill') #The previous value in the column direction is assigned as a value to theNaN

an actual example

1. Reading excel data 

The code is as follows

import pandas as pd# Read blast furnace data, note that the file name cannot be Chinese
data=pd.read_excel('gaolushuju_201501', '201501', index_col=None, na_values=['NA'])
print data

The test results are as follows

      fuel ratio  southwesternmost part of the planet  be at loggerheads  south-easternmost part of the planet  northeastern
0   531.46   185   176   176   174
1   510.35   184   173   184   188
2   533.49   180   165   182   177
3   511.51   190   172   179   188
4   531.02   180   167   173   180
5   511.24   174   164   178   176
6   532.62   173   170   168   179
7   583.00   182   175   176   173
8   530.70   158   149   159   156
9   530.32   168   156   169   171
10  528.62   164   150   171   169

2. Slicing, selecting rows or columns, modifying data 

The code is as follows:

data_1row=[1]
data_5row_2col=[0:5,[u'Fuel ratio',u'Topwin Southwest']
print data_1row,data_5row_2col
data_5row_2col.ix[0:1,0:2]=3

The test results are as follows:

fuel ratio     510.35
southwesternmost part of the planet    184.00
be at loggerheads    173.00
south-easternmost part of the planet    184.00
northeastern    188.00
Name: 1, dtype: float64    
   fuel ratio  southwesternmost part of the planet
0  531.46   185
1  510.35   184
2  533.49   180
3  511.51   190
4  531.02   180
5  511.24   174
      fuel ratio  southwesternmost part of the planet
0    3.00     3
1    3.00     3
2  533.49   180
3  511.51   190
4  531.02   180
5  511.24   174

Format description, data_5row_2col.ix[0:1,0:2], data_5row_2col.ix[0:1,[0,2]], select part of the rows and columns need to add "[]"

3. Sorting 

The code is as follows:

print data_1row.sort_values()
print data_5row_2col.sort_values(by=u'Fuel ratio')

The test results are as follows:

be at loggerheads    173.00
southwesternmost part of the planet    184.00
south-easternmost part of the planet    184.00
northeastern    188.00
fuel ratio     510.35
Name: 1, dtype: float64
      fuel ratio  southwesternmost part of the planet
1  510.35   184
5  511.24   174
3  511.51   190
4  531.02   180
0  531.46   185
2  533.49   180

4. Deletion of duplicate rows 

The code is as follows:

print data_5row_2col[u'Topwin Southwest'].drop_duplicates()#Remove duplicate rows of data

The test results are as follows:

0    185
1    184
2    180
3    190
5    174
Name: southwesternmost part of the planet, dtype: int64

Note: From test result 3, it can be seen that the data of top temperature southwest index=2 is duplicated with the data of index=4, and test result 4 shows that the data of top temperature southwest index=4 will be deleted.

summarize

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.