《Python for Data Analysis》
GroupBy
Grouping operations: split-apply-combine (split-apply-combine)
A DataFrame can be grouped on its rows (axis=0) or columns (axis=1). Then, a function is applied to each grouping and generates new values. Finally, the results of the execution of all these functions are merged into the final result object.
The size method of GroupBy returns a Series with the size of the group.
Iterate over the grouping
for (k1,k2), group in (['key1','key2']): print k1,k2 print group
Select a column or group of columns
(['key1','key2'])[['data2']].mean()
Grouping by Dictionary or Series
Simply pass the dictionary or Series to groupby.
Grouping by function
(len).sum() #group people's names based on their lengths
Grouping by Index Level
Hierarchical indexed data, aggregated according to index level, pass in level number or name via level keyword.
(level='cty',axis=1).count()
data aggregation
Optimized groupby method
function name | clarification |
---|---|
count | Number of non-NA values in the subgroup |
sum | Sum of non-NA values |
mean | Average of non-NA values |
median | Arithmetic mean of non-NA values |
std、var | Unbiased (n-1 denominator) standard deviation and variance |
min、max | Minimum and maximum values of non-NA values |
prod | The product of non-NA values |
first、last | First and last non-NA values |
For the above descriptive statistics methods, the function name can be passed into the agg method as a string. For example: (['mean', 'std'])
To use your own aggregate function, simply pass it into the aggregate or agg method
def peak_to_peak(arr): return () - () (peak_to_peak)
column-oriented multifunction applications.You can use different aggregate functions for different columns or apply multiple functions at once.
If you pass in a set of functions or function names, the columns of the resulting DataFrame will be named after the corresponding functions
If a list of (name,function) tuples is passed in, the first element of each tuple is used as the column name of the DataFrame.
Different aggregation functions for different columns can also be passed to agg as a dictionary that maps from column names to functions
(['mean', 'std', peak_to_peak]) # 1 ([('foo', 'mean'), ('bar', )]) # 2 functions = ['count', 'mean', 'max'] result = grouped['tip', 'bill'].agg(functions) # 3 ({'tip' : , 'bill' : 'sum'}) # 4
Group-level operations and conversions
transform
transform applies a function to the individual groupings and then places the result in the appropriate location. If the individual groupings produce a scalar value, that value is broadcast.
apply
General "split-apply-merge"
('smoker').apply(top) is equivalent to the top function being called on the individual segments of the DataFrame, and the result is then labeled by the assembled together with the group name. , so the final result has a hierarchical index with inner index values from the original DataFrame.
Disable group_keys: group_keys will join the index of the original object to form a hierarchical index in the result object. Pass group_keys=False into groupby to disable this effect. ('smoker', group_keys=False).apply(top)
Calling describe on a GroupBy object is equivalent to f = lambda x : (); (f).
Data aggregation tools
Pivot table pivot_table
Aggregate data based on one or more keys and assign data to individual rectangular regions based on grouping keys on rows and columns.
tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'], columns='day', aggfunc='mean', fill_value=0)
parameter name | clarification |
---|---|
values | The name of the column to be aggregated. Default All Columns |
rows | Column names used for grouping or other grouping keys that appear in the rows of the pivot table |
cols | Column names used for grouping or other grouping keys that appear in the columns of the pivot table of results |
aggfunc | Aggregate function or list of functions, defaults to "mean". It can be any function that is valid for groupby. |
fill_value | Used to replace missing values in the result table |
margins | Add row/column subtotals and totals, default is False |
Crosstab
is a special pivot table for calculating grouping frequencies.
([, ], , margins=True)
Above this pandas data grouping and aggregation operation method is all I have shared with you, I hope to give you a reference, and I hope you support me more.