Python crawler to crack the font encryption case details

This case study is an example of crawling the Starting Point novels

Case Purpose:

Introduces how to crack the font encryption backcrawl and convert encrypted data into plaintext data by crawling the name and number of monthly votes on the monthly vote list of the Start Dot novel.

Program Functions:

Enter the number of pages to crawl and get the corresponding novel name and monthly votes for each page.

Case study: finding the target url:

在这里插入图片描述

(Right click to check) Find where the name of the novel is located:

在这里插入图片描述

Find the xpath syntax for the name of the novel by the position of the node where the name is located:

在这里插入图片描述

(Right click to check) Find where the monthly vote count is located:

在这里插入图片描述

As found in the above figure, examining the text of the monthly pass data yields a string of encrypted data.

We debugged through xpathhelper and found that the syntax of the encrypted data could not be found. Therefore, it needs to be extracted by regular expressions.

Data extraction via regularization.

在这里插入图片描述

The regular expression is as follows:

在这里插入图片描述

The encrypted data obtained is as follows:

在这里插入图片描述

Breaking the encrypted data is the key to this case:

Since it is encrypted data, there will be Font files with encryption rules corresponding to the encrypted data.
By finding the url of the data encrypted file in Font font file, send a request, get the response and get the woff file with encrypted data.

Note: We need woff files with the same name as the class attribute in front of the number of encrypted monthly votes.

在这里插入图片描述

As shown below, download the woff file:

Find the alphanumeric equivalent of the hexadecimal number.

在这里插入图片描述

Secondly, we need to pass theThird-party library TTFontConvert the hexadecimal numbers in the file to decimal and the English numbers to Arabic numbers. As shown below:

在这里插入图片描述

The number of corresponding monthly votes corresponding to each encrypted data is parsed as follows:

在这里插入图片描述

Attention:

Since the regular tableau we obtained above by regularizing theEncrypted data carries special symbols

在这里插入图片描述

So after parsing out the numbers in the monthly vote data, in addition to removing the special symbols, each number needs to be spliced together to get the final number of votes.

Finally, find the pattern of page turns by comparing the urls of different pages:

在这里插入图片描述

Comparing the three different urls reveals that the pattern of page turning lies in the parameter page

So the problem is analyzed and the code begins:

import requests
from lxml import etree
import re
from  import TTFont
import json

if __name__ == '__main__':
  # Enter the number of pages crawled,
  pages = int(input('Please enter the number of pages to crawl:')) # eg:pages=1,2
  for i in range(pages): # i=0,(0,1)
    page = i+1  # 1,(1,2)
    # Confirm the target url
    url_ = f'/rank/yuepiao?page={page}'
    # Construct request header parameters
    headers = {
      'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    }
    # Send a request, get a response
    response_ = (url_,headers=headers)
    # Response type is html ask text
    str_data = response_.text
    # Convert html text to python file
    py_data = (str_data)
    # Extract target data from text
    title_list = py_data.xpath('//h4/a[@target="_blank"]/text() ')
    # Extract the number of monthly votes, due to the use of xpath syntax can not be extracted, so switch to regular expressions, regular extraction of the target response_.text
    mon_list = ('</style><span class=".*?">(.*?)</span></span>',str_data)
    print(mon_list)
    # Get the url of the corresponding font backcrawl woff file, xpath with regular use
    fonturl_str = py_data.xpath('//p/span/style/text()')
    font_url = (r"format\('eot'\); src: url\('(.*?)'\) format\('woff'\)",str_data)[0]
    print(font_url)
    # After getting the url, construct the request header to get the response
    headers_ = {
      'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
      'Referer':'/'
    }
    # Send a request, get a response
    font_response = (font_url,headers=headers_)
    # The file type is unknown, so use the content format.
    font_data = font_response.content
    # Saved locally
    with open('Encryptedfontfile.woff','wb')as f:
      (font_data)
    # Parsing encrypted font files
    font_obj = TTFont('Encryptedfontfile.woff')
    # Convert documents to plaintext xml files
    font_obj.saveXML('Encryptedfontfile.xml')
    # Get the relational mapping table for font encryption to convert hex to decimal
    cmap_list = font_obj.getBestCmap()
    print('Font cryptographic relationship mapping table:',cmap_list)
    # Create English-to-English dictionaries
    dict_e_a = {'one':'1','two':'2','three':'3','four':'4','five':'5','six':'6',
          'seven':'7','eight':'8','nine':'9','zero':'0'}
    # Convert English data
    for i in cmap_list:
      for j in dict_e_a:
        if j == cmap_list[i]:
          cmap_list[i] = dict_e_a[j]
    print('The mapping table for conversion to Arabic numerals is:',cmap_list)
    # Remove the symbols from the encrypted list of monthly ticket data
    new_mon_list = []
    for i in mon_list:
      list_ = (r'\d+',i)
      new_mon_list.append(list_)
    print('The list of monthly vote data after removing the symbols is:',new_mon_list)
    # Final parsing of monthly vote data
    for i in new_mon_list:
      for j in enumerate(i):
        for k in cmap_list:
          if j[1] == str(k):
            i[j[0]] = cmap_list[k]
    print('The monthly vote data after parsing is:',new_mon_list)
    # Splicing of monthly pass data
    new_list = []
    for i in new_mon_list:
      j = ''.join(i)
      new_list.append(j)
    print('The parsed plaintext data is:',new_list)
    # Put the name and corresponding monthly ticket data into a dictionary, convert it to json format and save it.
    for i in range(len(title_list)):
      dict_ = {}
      dict_[title_list[i]] = new_list[i]
      # Convert dictionary to json format
      json_data = (dict_,ensure_ascii=False)+',\n'
      # Save data locally
      with open('page flip start dot monthly vote list data crawl.json','a',encoding='utf-8')as f:
        (json_data)

Crawled two pages of data, each containing 20 pieces of data

The results of the implementation are as follows:

在这里插入图片描述

to this article on the python crawler to crack the font encryption case details of the article is introduced to this, more related python crawler to crack the font encryption content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!