This case study is an example of crawling the Starting Point novels
Case Purpose:
Introduces how to crack the font encryption backcrawl and convert encrypted data into plaintext data by crawling the name and number of monthly votes on the monthly vote list of the Start Dot novel.
Program Functions:
Enter the number of pages to crawl and get the corresponding novel name and monthly votes for each page.
Case study: finding the target url:
(Right click to check) Find where the name of the novel is located:
Find the xpath syntax for the name of the novel by the position of the node where the name is located:
(Right click to check) Find where the monthly vote count is located:
As found in the above figure, examining the text of the monthly pass data yields a string of encrypted data.
We debugged through xpathhelper and found that the syntax of the encrypted data could not be found. Therefore, it needs to be extracted by regular expressions.
Data extraction via regularization.
The regular expression is as follows:
The encrypted data obtained is as follows:
Breaking the encrypted data is the key to this case:
Since it is encrypted data, there will be Font files with encryption rules corresponding to the encrypted data.
By finding the url of the data encrypted file in Font font file, send a request, get the response and get the woff file with encrypted data.
Note: We need woff files with the same name as the class attribute in front of the number of encrypted monthly votes.
As shown below, download the woff file:
Find the alphanumeric equivalent of the hexadecimal number.
Secondly, we need to pass theThird-party library TTFontConvert the hexadecimal numbers in the file to decimal and the English numbers to Arabic numbers. As shown below:
The number of corresponding monthly votes corresponding to each encrypted data is parsed as follows:
Attention:
Since the regular tableau we obtained above by regularizing theEncrypted data carries special symbols
So after parsing out the numbers in the monthly vote data, in addition to removing the special symbols, each number needs to be spliced together to get the final number of votes.
Finally, find the pattern of page turns by comparing the urls of different pages:
Comparing the three different urls reveals that the pattern of page turning lies in the parameter page
So the problem is analyzed and the code begins:
import requests from lxml import etree import re from import TTFont import json if __name__ == '__main__': # Enter the number of pages crawled, pages = int(input('Please enter the number of pages to crawl:')) # eg:pages=1,2 for i in range(pages): # i=0,(0,1) page = i+1 # 1,(1,2) # Confirm the target url url_ = f'/rank/yuepiao?page={page}' # Construct request header parameters headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36' } # Send a request, get a response response_ = (url_,headers=headers) # Response type is html ask text str_data = response_.text # Convert html text to python file py_data = (str_data) # Extract target data from text title_list = py_data.xpath('//h4/a[@target="_blank"]/text() ') # Extract the number of monthly votes, due to the use of xpath syntax can not be extracted, so switch to regular expressions, regular extraction of the target response_.text mon_list = ('</style><span class=".*?">(.*?)</span></span>',str_data) print(mon_list) # Get the url of the corresponding font backcrawl woff file, xpath with regular use fonturl_str = py_data.xpath('//p/span/style/text()') font_url = (r"format\('eot'\); src: url\('(.*?)'\) format\('woff'\)",str_data)[0] print(font_url) # After getting the url, construct the request header to get the response headers_ = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36', 'Referer':'/' } # Send a request, get a response font_response = (font_url,headers=headers_) # The file type is unknown, so use the content format. font_data = font_response.content # Saved locally with open('Encryptedfontfile.woff','wb')as f: (font_data) # Parsing encrypted font files font_obj = TTFont('Encryptedfontfile.woff') # Convert documents to plaintext xml files font_obj.saveXML('Encryptedfontfile.xml') # Get the relational mapping table for font encryption to convert hex to decimal cmap_list = font_obj.getBestCmap() print('Font cryptographic relationship mapping table:',cmap_list) # Create English-to-English dictionaries dict_e_a = {'one':'1','two':'2','three':'3','four':'4','five':'5','six':'6', 'seven':'7','eight':'8','nine':'9','zero':'0'} # Convert English data for i in cmap_list: for j in dict_e_a: if j == cmap_list[i]: cmap_list[i] = dict_e_a[j] print('The mapping table for conversion to Arabic numerals is:',cmap_list) # Remove the symbols from the encrypted list of monthly ticket data new_mon_list = [] for i in mon_list: list_ = (r'\d+',i) new_mon_list.append(list_) print('The list of monthly vote data after removing the symbols is:',new_mon_list) # Final parsing of monthly vote data for i in new_mon_list: for j in enumerate(i): for k in cmap_list: if j[1] == str(k): i[j[0]] = cmap_list[k] print('The monthly vote data after parsing is:',new_mon_list) # Splicing of monthly pass data new_list = [] for i in new_mon_list: j = ''.join(i) new_list.append(j) print('The parsed plaintext data is:',new_list) # Put the name and corresponding monthly ticket data into a dictionary, convert it to json format and save it. for i in range(len(title_list)): dict_ = {} dict_[title_list[i]] = new_list[i] # Convert dictionary to json format json_data = (dict_,ensure_ascii=False)+',\n' # Save data locally with open('page flip start dot monthly vote list data crawl.json','a',encoding='utf-8')as f: (json_data)
Crawled two pages of data, each containing 20 pieces of data
The results of the implementation are as follows:
to this article on the python crawler to crack the font encryption case details of the article is introduced to this, more related python crawler to crack the font encryption content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!