I. Grabbing Objectives
Target URL:Mephoto video
II. Use of tools
Development environment: win10, python3.7
Development tools: pycharm, Chrome
Toolkit: requests, xpath, base64
III. Key learning elements
The process of parsing the data collected by the crawler
js code debugging skills
js reverse parsing code
Conversion of Python code
IV. Analysis of project ideas
Go to the home page of the website
Pick the category that interests you
Get the jump address of the hyperlink to the detail page based on the address of the home page.
Find the corresponding encrypted video playback address data
This data is static web data, decoded by js code
Find the corresponding parsing code
First find the address where the video will play
Find the encrypted js file that parses the video address
The file is triggered when you click play
I can roughly tell that this is base64 encrypted data.
Search for keywords in the corresponding js file
Find the js encryption
js function of some function usage
# The eplace() method is used to replace some characters in a string with others # parseInt data into the corresponding integer type # Decode base64-encoded strings The # substring method extracts a specified number of characters from a string starting with the start subscript.
Convert js code to python code
import base64 def decode(data): def getHex(a): return { 'str': a[4:], 'hex': ''.join(list(a[:4])[::-1]), } def getDec(a): b = str(int(a, 16)) return { 'pre': list(b[:2]), 'tail': list(b[2:]), } def substr(a, b): c = a[0: int(b[0])] d = a[int(b[0]): int(b[0]) + int(b[1])] return c + a[int(b[0]):].replace(d, "") def getPos(a, b): b[0] = len(a) - int(b[0]) - int(b[1]) return b b = getHex(data) c = getDec(b['hex']) d = substr(b['str'], c['pre']) return base64.b64decode(substr(d, getPos(d, c['tail']))) print(decode("e121Ly9tBrI84RdnZpZGVvMTAubWVpdHVkYXRhLmNvbS82MGJjZDcwNTE3NGZieXBueG5udnRwMTA5N19IMjY0XzFfNWY3YThmM2U0MTEwNy5tc2JVjAu3EDQ="))
Derive the final video playback address
V. Simple source code sharing
import requests from lxml import etree import base64 def decode_mp4(data): def getHex(a): return { 'str': a[4:], 'hex': ''.join(list(a[:4])[::-1]), } def getDec(a): b = str(int(a, 16)) return { 'pre': list(b[:2]), 'tail': list(b[2:]), } def substr(a, b): c = a[0: int(b[0])] d = a[int(b[0]): int(b[0]) + int(b[1])] return c + a[int(b[0]):].replace(d, "") def getPos(a, b): b[0] = len(a) - int(b[0]) - int(b[1]) return b b = getHex(data) c = getDec(b['hex']) d = substr(b['str'], c['pre']) return base64.b64decode(substr(d, getPos(d, c['tail']))) # Run the main function def main(): url = '' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36', } response = (url=url, headers=headers) html_data = () href_list = html_data.xpath('//div/a/@href') # print(href_list) for href in href_list: res = ('' + href, headers=headers) html = () name = ('//div[@]/img/@alt')[0] mp4_data = ('//div[@]/@data-video')[0] # print(name, mp4_data) mp4_url = decode_mp4(mp4_data).decode('utf-8') print(mp4_url) result = ("http:" + mp4_url) with open(name + ".mp4", 'wb') as f: () () if __name__ == '__main__': main()
To this article on Python crawl a shoot short video of the article is introduced to this, more related Python crawl video content, please search for my previous articles or continue to browse the following related articles I hope you will support me more in the future!