Analyzing cdn logs with the pandas library in Python in detail

preamble

Recently, I have encountered a requirement to filter some data based on CDN logs, such as traffic, status code statistics, TOP IP, URL, UA, Referer, etc. I used to use the bash shell to do this, but when the amount of log files is large, the number of Gs and rows are in the trillions. Previously, the bash shell is used to achieve, but when the log volume is large, the log file number of G, the number of lines up to thousands of trillions of times, through the shell processing is somewhat overwhelmed, processing time is too long. So studied the use of Python pandas data processing library. Ten million lines of logs, processing is completed in about 40s.

coding

#!/usr/bin/python
# -*- coding: utf-8 -*-
# sudo pip install pandas
__author__ = 'Loya Chen'
import sys
import pandas as pd
from collections import OrderedDict
"""
Description: This script is used to analyse qiniu cdn log.
================================================================================
Log format
IP - ResponseTime [time +0800] "Method URL HTTP/1.1" code size "referer" "UA"
================================================================================
Sample logs
 [0] [1][2]  [3]  [4]   [5]
101.226.66.179 - 68 [16/Nov/2016:04:36:40 +0800] "GET / -" 
[6] [7] [8]    [9]
200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
================================================================================
"""
if len() != 2:
 print('Usage:', [0], 'file_of_log')
 exit() 
else:
 log_file = [1] 
# Log locations corresponding to the fields to be counted
ip  = 0
url  = 5
status_code = 6
size = 7
referer = 8
ua  = 9
# Read logs into DataFrame
reader = pd.read_table(log_file, sep=' ', names=[i for i in range(10)], iterator=True)
loop = True
chunkSize = 10000000
chunks = []
while loop:
 try:
 chunk = reader.get_chunk(chunkSize)
 (chunk)
 except StopIteration:
 #Iteration is stopped.
 loop = False
df = (chunks, ignore_index=True)
byte_sum = df[size].sum()        # Traffic statistics
top_status_code = (df[6].value_counts())      #Status Code Statistics
top_ip  = df[ip].value_counts().head(10)      #TOP IP
top_referer = df[referer].value_counts().head(10)      #TOP Referer
top_ua  = df[ua].value_counts().head(10)      #TOP User-Agent
top_status_code['persent'] = (top_status_code/top_status_code.sum()*100)
top_url  = df[url].value_counts().head(10)      #TOP URL
top_url_byte = df[[url,size]].groupby(url).sum().apply(lambda x:(float)/1024/1024) \
   .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) # URLs with the highest request traffic
top_ip_byte = df[[ip,size]].groupby(ip).sum().apply(lambda x:(float)/1024/1024) \
   .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) # IPs with the most request traffic
# Order the results into a dictionary
result = OrderedDict([("Total traffic [in GB]:"   , byte_sum/1024/1024/1024),
   ("Status code statistics [times|percentage]:"  , top_status_code),
   ("IP TOP 10:"    , top_ip),
   ("Referer TOP 10:"   , top_referer),
   ("UA TOP 10:"    , top_ua),
   ("URL TOP 10:"   , top_url),
   ("Top 10 URLs with the highest request traffic [in MB]:" , top_url_byte), 
   ("Top 10 IPs with the highest request traffic [in MB]:" , top_ip_byte)
])
# Output results
for k,v in ():
 print(k)
 print(v)
 print('='*80)

pandas study notes

There are two basic data structures in Pandas, Series and Dataframe. A Series is an object similar to a one-dimensional array, consisting of a set of data and an index. Dataframe is a tabular data structure with both row and column indexes.

from pandas import Series, DataFrame
import pandas as pd

Series

In [1]: obj = Series([4, 7, -5, 3])
In [2]: obj
Out[2]: 
0 4
1 7
2 -5
3 3

The string representation of Series is such that the index is on the left and the value is on the right. When no index is specified, an integer index from 0 to N-1 (N is the length of the data) is automatically created. The array representation and index object can be obtained from the values and index properties of a Series.

In [3]: 
Out[3]: array([ 4, 7, -5, 3])
In [4]: 
Out[4]: RangeIndex(start=0, stop=4, step=1)

Typically, Series are created with the following indexes.

In [5]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [6]: obj2
Out[6]: 
d 4
b 7
a -5
c 3

Gets a single or group of values in a Series by index:

In [7]: obj2['a']
Out[7]: -5
In [8]: obj2[['c','d']]
Out[8]: 
c 3
d 4

arrange in order

In [9]: obj2.sort_index()
Out[9]: 
a -5
b 7
c 3
d 4
In [10]: obj2.sort_values()
Out[10]: 
a -5
c 3
d 4
b 7

filtering operation

In [11]: obj2[obj2 > 0]
Out[11]: 
d 4
b 7
c 3
In [12]: obj2 * 2
Out[12]: 
d 8
b 14
a -10
c 6

members

In [13]: 'b' in obj2
Out[13]: True
In [14]: 'e' in obj2
Out[14]: False

Creating a Series from a Dictionary

In [15]: sdata = {'Shanghai':35000, 'Beijing':40000, 'Nanjing':26000, 'Hangzhou':30000}
In [16]: obj3 = Series(sdata)
In [17]: obj3
Out[17]: 
Beijing 40000
Hangzhou 30000
Nanjing 26000
Shanghai 35000

If only one dictionary is passed in, the index in the resulting Series is the key of the original dictionary (ordered)

In [18]: states = ['Beijing', 'Hangzhou', 'Shanghai', 'Suzhou']
In [19]: obj4 = Series(sdata, index=states)
In [20]: obj4
Out[20]: 
Beijing 40000.0
Hangzhou 30000.0
Shanghai 35000.0
Suzhou  NaN

When index is specified, the three values in sdata that match the index of states will be found and put into the response position, but since the value of sdata corresponding to 'Suzhou' cannot be found, the result is NaN (not a number), which is used in pandas to indicate missing or NA values.

The isnull and notnull functions of pandas can be used to detect missing data.

In [21]: (obj4)
Out[21]: 
Beijing False
Hangzhou False
Shanghai False
Suzhou True
In [22]: (obj4)
Out[22]: 
Beijing True
Hangzhou True
Shanghai True
Suzhou False

Series has a similar example method

In [23]: ()
Out[23]: 
Beijing False
Hangzhou False
Shanghai False
Suzhou True

An important feature of Series is the automatic alignment of data at different indexes in data operations

In [24]: obj3
Out[24]: 
Beijing 40000
Hangzhou 30000
Nanjing 26000
Shanghai 35000
In [25]: obj4
Out[25]: 
Beijing 40000.0
Hangzhou 30000.0
Shanghai 35000.0
Suzhou  NaN
In [26]: obj3 + obj4
Out[26]: 
Beijing 80000.0
Hangzhou 60000.0
Nanjing  NaN
Shanghai 70000.0
Suzhou  NaN

Series indexes can be modified in-place by copying them

In [27]:  = ['Bob', 'Steve', 'Jeff', 'Ryan']
In [28]: obj
Out[28]: 
Bob 4
Steve 7
Jeff -5
Ryan 3

DataFrame

pandas reads a file

In [29]: df = pd.read_table('pandas_test.txt',sep=' ', names=['name', 'age'])
In [30]: df
Out[30]: 
 name age
0 Bob 26
1 Loya 22
2 Denny 20
3 Mars 25

DataFrame column picking

df[name]

In [31]: df['name']
Out[31]: 
0 Bob
1 Loya
2 Denny
3 Mars
Name: name, dtype: object

DataFrame Row Selection

[0,:] # The first argument is the row and the second argument is the column. Here it means all columns of row 0
[:,0] #All rows，(prefix indicating ordinal number, e.g. first, number two etc)0columns

In [32]: [0,:]
Out[32]: 
name Bob
age 26
Name: 0, dtype: object
In [33]: [:,0]
Out[33]: 
0 Bob
1 Loya
2 Denny
3 Mars
Name: name, dtype: object

Getting an element can be done via iloc, the faster way is iat

In [34]: [1,1]
Out[34]: 22
In [35]: [1,1]
Out[35]: 22

DataFrame block picking

In [36]: [1:2,['name','age']]
Out[36]: 
 name age
1 Loya 22
2 Denny 20

Filter rows by condition

Filter rows by adding a condition in square brackets, which must return True or False.

In [37]: df[( >= 1) & ( <= 3)]
Out[37]: 
 name age city
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing
In [38]: df[df['age'] > 22]
Out[38]: 
 name age city
0 Bob 26 Beijing
3 Mars 25 Nanjing

Add column

In [39]: df['city'] = ['Beijing', 'Shanghai', 'Hangzhou', 'Nanjing']
In [40]: df
Out[40]: 
 name age city
0 Bob 26 Beijing
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing

arrange in order

Sort by specified column

In [41]: df.sort_values(by='age')
Out[41]: 
 name age city
2 Denny 20 Hangzhou
1 Loya 22 Shanghai
3 Mars 25 Nanjing
0 Bob 26 Beijing

# Introducing numpy to build DataFrames
import numpy as np

In [42]: df = ((8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])
In [43]: df
Out[43]: 
 d a b c
three 0 1 2 3
one 4 5 6 7

# Sort by index
In [44]: df.sort_index()
Out[44]: 
 d a b c
one 4 5 6 7
three 0 1 2 3
In [45]: df.sort_index(axis=1)
Out[45]: 
 a b c d
three 1 2 3 0
one 5 6 7 4
# Descending
In [46]: df.sort_index(axis=1, ascending=False)
Out[46]: 
 d c b a
three 0 3 2 1
one 4 7 6 5

ferret out

# View the first 5 rows of the table
(5)
# View the last 5 rows of the table
(5) 
# View the name of the column
In [47]: 
Out[47]: Index(['name', 'age', 'city'], dtype='object')
# View the current value of the table
In [48]: 
Out[48]: 
array([['Bob', 26, 'Beijing'],
 ['Loya', 22, 'Shanghai'],
 ['Denny', 20, 'Hangzhou'],
 ['Mars', 25, 'Nanjing']], dtype=object)

reprovision


Out[49]: 
  0  1  2 3
name Bob Loya Denny Mars
age 26 22 20 25
city Beijing Shanghai Hangzhou Nanjing

Using isin

In [50]: df2 = ()
In [51]: df2[df2['city'].isin(['Shanghai','Nanjing'])]
Out[52]: 
 name age city
1 Loya 22 Shanghai
3 Mars 25 Nanjing

Algorithmic operations:

In [53]: df = ([[1.4, ], [7.1, -4.5], [, ], [0.75, -1.3]], 
 ...:    index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
In [54]: df
Out[54]: 
 one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3

# Summing by columns
In [55]: ()
Out[55]: 
one 9.25
two -5.80
# Summation by rows
In [56]: (axis=1)
Out[56]: 
a 1.40
b 2.60
c NaN
d -0.55

group

group refers to the following steps:

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

See the Grouping section

In [57]: df = ({'A' : ['foo', 'bar', 'foo', 'bar',
 ....:    'foo', 'bar', 'foo', 'foo'],
 ....:   'B' : ['one', 'one', 'two', 'three',
 ....:    'two', 'two', 'one', 'three'],
 ....:   'C' : (8),
 ....:   'D' : (8)})
 ....: 
In [58]: df
Out[58]: 
 A B  C  D
0 foo one -1.202872 -0.055224
1 bar one -1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three -0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two -0.392670 -0.136473
6 foo one 0.007207 -0.561757
7 foo three 1.928123 -1.623033

Group it and apply the sum function

In [59]: ('A').sum()
Out[59]: 
  C D
A   
bar -2.802588 2.42611
foo 3.146492 -0.63958
In [60]: (['A','B']).sum()
Out[60]: 
   C  D
A B   
bar one -1.814470 2.395985
 three -0.595447 0.166599
 two -0.392670 -0.136473
foo one -1.195665 -0.616981
 three 1.928123 -1.623033
 two 2.414034 1.600434

summarize

Above is about the use of Python in the pandas library for cdn log analysis of all the content, I hope that the content of this article on everyone's learning or work can bring some help, if there is any doubt you can leave a message to exchange, thank you for my support.