SoFunction
Updated on 2024-11-21

Python crawl data and visualization code analysis

This time, the main thing is to climb a pair of shoes on the Jingdong related comments: the data will be saved to excel and visualized to show the corresponding information

The main python code is below:

Document 1

#Read and analyze the data in excel
import openpyxl
import  as pit # For statistics.
wk=openpyxl.load_workbook('Sales Data.xlsx')
sheet= # Get the activity sheet
# Get the maximum number of rows and columns
rows=sheet.max_row
cols=sheet.max_column
lst=[] # Used to store shoe sizes
for i in range (2,rows+1):
  size=(i,3).value
  (size)
# Above has read the data from excel
# A single operation will count the number of different sizes in your line.
'''There is a data structure in python called a dictionary that uses the shoe code as the key and the number of sales as the value'''
dic_size={}
for item in lst:
  dic_size[item]=0

for item in lst:
  for size in dic_size:
    # Iterate over the dictionary
    if item==size:
      dic_size[size]+=1
      break
for item in dic_size:
  print(item,dic_size[item])
# Make it a percentage
lst_total=[]
for item in dic_size:
  lst_total.append([item,dic_size[item],dic_size[item]/160*1.0])

# Next visualize the data (do the pie operation)
labels=[item[0] +'Code'for item in lst_total] #Use list generator style to get the labels of the pie charts
fraces=[item[2] for item in lst_total] # Data sources in pie charts
['']=['SimHei'] # Separate tables for messy code
(x=fraces,labels=labels,autopct='%1.1f%%')
#() to display the resultant image.
('Figure.jpg')

Document 2

#All that is involved is requests and openpyxl data storage and data cleaning and statistics then matplotlib for data visualization.
#static data clicked in element clicked found in html, the server has rendered the content, sent directly to the browser, the browser interpretation of the implementation of the
#Dynamic data: If you click on the next page. Our address bar (plus suffix but the previous address bar did not change also counts) (also can click on 2 and 3 pages) did not happen any change shows that it is dynamic data, it means that our data is later rendered into the html. His data is not in the html at all.
# Dynamically view the network and then the url used is the headers within the network
#Install third-party modules by typing cmd followed by pip install and adding a name such as requests.
import requests
import re
import time
import json
import openpyxl # for manipulating excel files
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}#Create header information
def get_comments(productId,page):
  url = "/comment/?callback=fetchJSON_comment98&productId={0}&score=0&sortType=5&page={1}&pageSize=10&isShadowSku=0&fold=1".format(productId,page)
  resp = (url, headers=headers)
  s=('fetchJSON_comment98(','')# Perform a replace operation. Get the corresponding json you need, i.e., remove the useless stuff before and after it
  s=(');','')
  json_data=(s)# Perform data json conversion
  return json_data

#Get the maximum number of pages
def get_max_page(productId):
  dis_data=get_comments(productId,0)# Call the function you just wrote to make an access request to the server to get the dictionary data
  return dis_data['maxPage']# Get his maximum number of pages. Every page has a maximum number of pages

# Perform data extraction
def get_info(productId):
  max_page=get_max_page(productId)
  lst=[]# Used to store extracted product data
  for page in range(1,max_page+1):
    #Get product reviews without a page
    comments=get_comments(productId,page)
    comm_list=comments['comments']# Get a list of comments based on comnents (10 comments per page)
    # Iterate through the list of comments and get the corresponding data in it
    for item in comm_list:
      # Each comment is separately a dictionary. After continuing through the key to get the value
      content=item['content']
      color=item['productColor']
      size=item['productSize']
      ([content,color,size])# Add each comment to the list
    (3)#Prevent being blocked by Jingdong ip for a time delay. Prevent the number of visits too often
  save(lst)

def save(lst):
  #Store the crawled data and save it to excel
  wk=()# Used to create workbook objects
  sheet= # Get active tables (three tables in one workbook)
  # Traverse a list to add data to excel. A piece of data in the list is a row in the table
  biaotou='Comments','Color','Size'
  (biaotou)
  for item in lst:
    (item)
  #save excel to disk
  ('Sales Data.xlsx')


if __name__=='__main__':
  productId='66749071789'
  get_info(productId)
  print("ok")

The realized effect is as follows:

This is the whole content of this article.