Python crawling a certain beat short video

I. Grabbing Objectives

Target URL:Mephoto video

在这里插入图片描述

II. Use of tools

Development environment: win10, python3.7
Development tools: pycharm, Chrome
Toolkit: requests, xpath, base64

III. Key learning elements

The process of parsing the data collected by the crawler
js code debugging skills
js reverse parsing code
Conversion of Python code

IV. Analysis of project ideas

Go to the home page of the website
Pick the category that interests you
Get the jump address of the hyperlink to the detail page based on the address of the home page.

在这里插入图片描述

Find the corresponding encrypted video playback address data

在这里插入图片描述

This data is static web data, decoded by js code
Find the corresponding parsing code
First find the address where the video will play
Find the encrypted js file that parses the video address
The file is triggered when you click play

在这里插入图片描述

I can roughly tell that this is base64 encrypted data.
Search for keywords in the corresponding js file
Find the js encryption

在这里插入图片描述

js function of some function usage

# The eplace() method is used to replace some characters in a string with others
    # parseInt data into the corresponding integer type
    # Decode base64-encoded strings
    The # substring method extracts a specified number of characters from a string starting with the start subscript.

在这里插入图片描述

Convert js code to python code

import base64

def decode(data):
    def getHex(a):
        return {
            'str': a[4:],
            'hex': ''.join(list(a[:4])[::-1]),
        }

    def getDec(a):
        b = str(int(a, 16))
        return {
            'pre': list(b[:2]),
            'tail': list(b[2:]),
        }

    def substr(a, b):
        c = a[0: int(b[0])]
        d = a[int(b[0]): int(b[0]) + int(b[1])]
        return c + a[int(b[0]):].replace(d, "")

    def getPos(a, b):
        b[0] = len(a) - int(b[0]) - int(b[1])
        return b

    b = getHex(data)
    c = getDec(b['hex'])
    d = substr(b['str'], c['pre'])
    return base64.b64decode(substr(d, getPos(d, c['tail'])))

print(decode("e121Ly9tBrI84RdnZpZGVvMTAubWVpdHVkYXRhLmNvbS82MGJjZDcwNTE3NGZieXBueG5udnRwMTA5N19IMjY0XzFfNWY3YThmM2U0MTEwNy5tc2JVjAu3EDQ="))

Derive the final video playback address

在这里插入图片描述

V. Simple source code sharing

import requests
from lxml import etree
import base64

def decode_mp4(data):
    def getHex(a):
        return {
            'str': a[4:],
            'hex': ''.join(list(a[:4])[::-1]),
        }

    def getDec(a):
        b = str(int(a, 16))
        return {
            'pre': list(b[:2]),
            'tail': list(b[2:]),
        }

    def substr(a, b):
        c = a[0: int(b[0])]
        d = a[int(b[0]): int(b[0]) + int(b[1])]
        return c + a[int(b[0]):].replace(d, "")

    def getPos(a, b):
        b[0] = len(a) - int(b[0]) - int(b[1])
        return b

    b = getHex(data)
    c = getDec(b['hex'])
    d = substr(b['str'], c['pre'])
    return base64.b64decode(substr(d, getPos(d, c['tail'])))
# Run the main function
def main():
    url = ''
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
    }
    response = (url=url, headers=headers)
    html_data = ()
    href_list = html_data.xpath('//div/a/@href')
    # print(href_list)
    for href in href_list:
        res = ('' + href, headers=headers)
        html = ()
        name = ('//div[@]/img/@alt')[0]
        mp4_data = ('//div[@]/@data-video')[0]
        # print(name, mp4_data)
        mp4_url = decode_mp4(mp4_data).decode('utf-8')
        print(mp4_url)
        result = ("http:" + mp4_url)
        with open(name + ".mp4", 'wb') as f:
            ()
            ()


if __name__ == '__main__':
    main()

To this article on Python crawl a shoot short video of the article is introduced to this, more related Python crawl video content, please search for my previous articles or continue to browse the following related articles I hope you will support me more in the future!