This time, the main thing is to climb a pair of shoes on the Jingdong related comments: the data will be saved to excel and visualized to show the corresponding information
The main python code is below:
Document 1
#Read and analyze the data in excel import openpyxl import as pit # For statistics. wk=openpyxl.load_workbook('Sales Data.xlsx') sheet= # Get the activity sheet # Get the maximum number of rows and columns rows=sheet.max_row cols=sheet.max_column lst=[] # Used to store shoe sizes for i in range (2,rows+1): size=(i,3).value (size) # Above has read the data from excel # A single operation will count the number of different sizes in your line. '''There is a data structure in python called a dictionary that uses the shoe code as the key and the number of sales as the value''' dic_size={} for item in lst: dic_size[item]=0 for item in lst: for size in dic_size: # Iterate over the dictionary if item==size: dic_size[size]+=1 break for item in dic_size: print(item,dic_size[item]) # Make it a percentage lst_total=[] for item in dic_size: lst_total.append([item,dic_size[item],dic_size[item]/160*1.0]) # Next visualize the data (do the pie operation) labels=[item[0] +'Code'for item in lst_total] #Use list generator style to get the labels of the pie charts fraces=[item[2] for item in lst_total] # Data sources in pie charts ['']=['SimHei'] # Separate tables for messy code (x=fraces,labels=labels,autopct='%1.1f%%') #() to display the resultant image. ('Figure.jpg')
Document 2
#All that is involved is requests and openpyxl data storage and data cleaning and statistics then matplotlib for data visualization. #static data clicked in element clicked found in html, the server has rendered the content, sent directly to the browser, the browser interpretation of the implementation of the #Dynamic data: If you click on the next page. Our address bar (plus suffix but the previous address bar did not change also counts) (also can click on 2 and 3 pages) did not happen any change shows that it is dynamic data, it means that our data is later rendered into the html. His data is not in the html at all. # Dynamically view the network and then the url used is the headers within the network #Install third-party modules by typing cmd followed by pip install and adding a name such as requests. import requests import re import time import json import openpyxl # for manipulating excel files headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}#Create header information def get_comments(productId,page): url = "/comment/?callback=fetchJSON_comment98&productId={0}&score=0&sortType=5&page={1}&pageSize=10&isShadowSku=0&fold=1".format(productId,page) resp = (url, headers=headers) s=('fetchJSON_comment98(','')# Perform a replace operation. Get the corresponding json you need, i.e., remove the useless stuff before and after it s=(');','') json_data=(s)# Perform data json conversion return json_data #Get the maximum number of pages def get_max_page(productId): dis_data=get_comments(productId,0)# Call the function you just wrote to make an access request to the server to get the dictionary data return dis_data['maxPage']# Get his maximum number of pages. Every page has a maximum number of pages # Perform data extraction def get_info(productId): max_page=get_max_page(productId) lst=[]# Used to store extracted product data for page in range(1,max_page+1): #Get product reviews without a page comments=get_comments(productId,page) comm_list=comments['comments']# Get a list of comments based on comnents (10 comments per page) # Iterate through the list of comments and get the corresponding data in it for item in comm_list: # Each comment is separately a dictionary. After continuing through the key to get the value content=item['content'] color=item['productColor'] size=item['productSize'] ([content,color,size])# Add each comment to the list (3)#Prevent being blocked by Jingdong ip for a time delay. Prevent the number of visits too often save(lst) def save(lst): #Store the crawled data and save it to excel wk=()# Used to create workbook objects sheet= # Get active tables (three tables in one workbook) # Traverse a list to add data to excel. A piece of data in the list is a row in the table biaotou='Comments','Color','Size' (biaotou) for item in lst: (item) #save excel to disk ('Sales Data.xlsx') if __name__=='__main__': productId='66749071789' get_info(productId) print("ok")
The realized effect is as follows:
This is the whole content of this article.