SoFunction
Updated on 2024-11-07

Explain how to write a crawler for listening to novels in Python

On the road found so many people like to use headphones to listen to novels, colleagues can actually listen to novels all day with one headset. I was very shocked. Today, I will use Python to download and listen to novels.The audio.

Book title and chapter list

Randomly clicking on a book, this page can use BeautifulSoup to get the book title and a list of all the individual chapter audios. Copy the browser address, e.g. /yousheng/disp_31086.htm.

from bs4 import BeautifulSoup
import requests
import re
import random
import os

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

def get_detail_urls(url):
    url_list = []
    response = (url, headers=headers)
     = 'gbk'
    soup = BeautifulSoup(, 'lxml')
    name = ('.red12')[0].
    if not (name):
        (name)
    div_list = (' a')
    for item in div_list:
        url_list.append({'name': , 'url': '/yousheng/{}'.format(item['href'])})
    return name, url_list

audio address

Opening the links to the individual chapters, and using the chapter name as a search term in the Elements panel, I found a script at the bottom, and this part is the address of the sound source.

As you can see in the Network panel, the url domain of the sound source is different from the domain of the chapter list. You need to be aware of this when getting the download link.

def get_mp3_path(url):
    response = (url, headers=headers)
     = 'gbk'
    soup = BeautifulSoup(, 'lxml')
    script_text = ('script')[-1].string
    fileUrl_search = ('fileUrl= "(.*?)";', script_text, )
    if fileUrl_search:
        return '' + fileUrl_search.group(1)

downloading

Surprise, surprise, surprise. Putting this /xxxx.mp3 into a browser and running it gives me a 404.

The key parameter is definitely missing. Going back to the above Network and looking closely at the url of the mp3, I see that there is a key key after the url. As you can see below, this key is the return value from /play/h5_jsonp.asp?0.5078556568562795, so you can use a regular expression to fetch the key.

def get_key(url):
    url = '/play/h5_jsonp.asp?{}'.format(str(()))
    headers['referer'] = url
    response = (url, headers=headers)
    matched = ('(key=.*?)";', , )
    if matched:
        temp = (1)
        return temp[len(temp)-42:]

The last of the last of the last of the last of the last of the last of the last of the last of the last in__main__ in which the above code is strung together.

if __name__ == "__main__":
    url = input("Please enter the address of the browser book page:")
    dir,url_list = get_detail_urls()

    for item in url_list:
        audio_url = get_mp3_path(item['url'])
        key = get_key(item['url'])
        audio_url = audio_url + '?key=' + key
        headers['referer'] = item['url']
        r = (audio_url, headers=headers,stream=True)
        with open((dir, item['name']),'ab') as f:
            ()
            ()

Full Code

from bs4 import BeautifulSoup
import requests
import re
import random
import os

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

def get_detail_urls(url):
    url_list = []
    response = (url, headers=headers)
     = 'gbk'
    soup = BeautifulSoup(, 'lxml')
    name = ('.red12')[0].
    if not (name):
        (name)
    div_list = (' a')
    for item in div_list:
        url_list.append({'name': , 'url': '/yousheng/{}'.format(item['href'])})
    return name, url_list
    
def get_mp3_path(url):
    response = (url, headers=headers)
     = 'gbk'
    soup = BeautifulSoup(, 'lxml')
    script_text = ('script')[-1].string
    fileUrl_search = ('fileUrl= "(.*?)";', script_text, )
    if fileUrl_search:
        return '' + fileUrl_search.group(1)
        
def get_key(url):
    url = '/play/h5_jsonp.asp?{}'.format(str(()))
    headers['referer'] = url
    response = (url, headers=headers)
    matched = ('(key=.*?)";', , )
    if matched:
        temp = (1)
        return temp[len(temp)-42:]

if __name__ == "__main__":
    url = input("Please enter the address of the browser book page:")
    dir,url_list = get_detail_urls()

    for item in url_list:
        audio_url = get_mp3_path(item['url'])
        key = get_key(item['url'])
        audio_url = audio_url + '?key=' + key
        headers['referer'] = item['url']
        r = (audio_url, headers=headers,stream=True)
        with open((dir, item['name']),'ab') as f:
            ()
            ()

summarize

This Python crawler is relatively simple, I'm not enough to use the monthly flow of 30 yuan, with this small program on the subway can not flow to listen to novels.

Above is a detailed explanation of how to use Python to write a crawler to listen to novels in detail, more information about Python crawler to listen to novels please pay attention to my other related articles!