1. Introduction
Pandas is a professional data analysis tool based on Numpy, which can flexibly and efficiently handle a variety of data sets, and is also a godsend for our late analysis cases. It provides two types of data structures, respectively, DataFrame and Series, we can simply and roughly DataFrame understood as a table inside Excel, and Series is a column in the table!
2、Create DataFrame
# -*- encoding=utf-8 -*- import pandas if __name__ == '__main__': pass test_stu = ( {'Higher Mathematics': [66, 77, 88, 99, 85], 'Big stuff': [88, 77, 85, 78, 65], 'English': [99, 84, 87, 56, 75]}, ) print(test_stu) stu = ( {'Higher Mathematics': [66, 77, 88, 99, 85], 'Big stuff': [88, 77, 85, 78, 65], 'English': [99, 84, 87, 56, 75]}, index=['Little Red', 'Little Lee', 'White', 'Blackie', 'Little Green'] # Specify index ) print(stu)
(of a computer) run
high mathematics big stuff English (language) 0 66 88 99 1 77 77 84 2 88 85 87 3 99 78 56 4 85 65 75 high mathematics big stuff English (language) little red 66 88 99 Xiao Li 77 77 84 young white 88 85 87 Blackie 99 78 56 lilac 85 65 75
3, read CSV or Excel (.xlsx) for simple operations (add, delete, change, check)
# -*- encoding=utf-8 -*- import pandas if __name__ == '__main__': pass data = pandas.read_csv('', engine='python') # Read csv file using python analytics engine print((5)) # Display the first 5 lines. print((5)) # Display the last 5 lines print(data) # Show all data print(data['height']) # Show height column print(data[['height', 'weight']]) # Show height and weight columns data.to_csv('') # Save to csv file data.to_excel('') # Save to xlsx file () # View data information (total number of rows, any vacant data, type) print(()) # (count non-null, mean mean, std standard deviation, min min, max max 25% 50% 75% quantile.) data['New Columns'] = range(0, len(data)) # Similar to a dictionary just add it print(data) new_data = ('New Columns', axis=1, inplace=False) # Delete columns, if inplace is True then delete in source data, return None, otherwise return new data, no change in source data print(new_data) data['Weight + Height'] = data['height'] + data['weight'] print(data) data['remark'] = data['remark'].('to', '') # Manipulate strings print(data['remark']) data['birth'] = pandas.to_datetime(data['birth']) # Converted to date type print(data['birth'])
4, according to the conditions for screening, intercept
# -*- encoding=utf-8 -*- import pandas if __name__ == '__main__': pass data = pandas.read_csv('', engine='python') # Read csv file using python analysis engine a = [:12, ] # Intercepts 0-12 rows, columns all intercepted # print(a) b = [:, [1, 3]] # Rows are fully truncated, columns 1, 3 # print(b) c = [0:12, 0:4] # Intercept rows 0-12, columns 0-4 # print(c) d = data['sex'] == 1 # Viewed with gender 1 (male) # print(d) f = [data['sex'] == 1, :] # Viewed with gender 1 (male) # print(f) g = [:, ['weight', 'height']] # Selected height and weight # print(g) h = [data['height'].isin([166, 175]), :] # Selection of heights 166,175 # print(h) h1 = [data['height'].isin([166, 175]), ['weight', 'height']] # Selection of heights 166,175 # print(h1) i = data['height'].mean() # Mean j = data['height'].std() # Variance k = data['height'].median() # Median l = data['height'].min() # Minimum m = data['height'].max() # Maximum # print(i) # print(j) # print(k) # print(l) # print(m) n = [ (data['height'] > data['height'].mean()) & (data['weight'] > data['weight'].mean()), :] # Height is greater than the mean of height and weight is greater than the mean of weight, not and but & if yes or | print(n)
5、Clear Nan data, de-weighting, grouping, merging
# -*- encoding=utf-8 -*- import pandas if __name__ == '__main__': pass sheet1 = pandas.read_excel('', sheet_name='Sheet1') # Read sheet1 # print(sheet1) # print('-------------------------') sheet2 = pandas.read_excel('', sheet_name='Sheet2') # Read sheet2 # print(sheet2) # print('-------------------------') a = ([sheet1, sheet2]) # Merge # print(a) # print('-------------------------') b = () # Delete empty data nan, with nan in it # print(b) # print('-------------------------') b1 = (subset=['weight']) # Delete empty data nan the specified column # print(b1) # print('-------------------------') c = b.drop_duplicates() # Delete duplicates # print(c) # print('-------------------------') d = b.drop_duplicates(subset=['weight']) # Remove duplicates from specified columns # print(d) # print('-------------------------') e = b.drop_duplicates(subset=['weight'], keep='last') # Remove duplicates from the specified column, saving the last identical data. # print(e) # print('-------------------------') f = a.sort_values(['weight'], ascending=False) # Sort weight from largest to smallest # print(f) g = (['sex']).sum() # Grouped by sex and then summed # # print(g) g1 = (['sex'], as_index=False).sum() # Group by sex, then sum, but sex is not indexed # print(g1) g2 = (['sex', 'weight']).sum() # Group by sex and then by weight and then sum it up # print(g2) h = (c['weight'], bins=[80, 90, 100, 150, 200], ) # Segmentation of body weight according to intervals print(h) # print('-------------------------') c['Split according to weight'] = h # There will be warnings, unresolved, but not affecting results print(c)
This is the whole content of this article, I hope it will help you to learn more.