On the road found so many people like to use headphones to listen to novels, colleagues can actually listen to novels all day with one headset. I was very shocked. Today, I will use Python to download and listen to novels.The audio.
Book title and chapter list
Randomly clicking on a book, this page can use BeautifulSoup to get the book title and a list of all the individual chapter audios. Copy the browser address, e.g. /yousheng/disp_31086.htm.
from bs4 import BeautifulSoup import requests import re import random import os headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36' } def get_detail_urls(url): url_list = [] response = (url, headers=headers) = 'gbk' soup = BeautifulSoup(, 'lxml') name = ('.red12')[0]. if not (name): (name) div_list = (' a') for item in div_list: url_list.append({'name': , 'url': '/yousheng/{}'.format(item['href'])}) return name, url_list
audio address
Opening the links to the individual chapters, and using the chapter name as a search term in the Elements panel, I found a script at the bottom, and this part is the address of the sound source.
As you can see in the Network panel, the url domain of the sound source is different from the domain of the chapter list. You need to be aware of this when getting the download link.
def get_mp3_path(url): response = (url, headers=headers) = 'gbk' soup = BeautifulSoup(, 'lxml') script_text = ('script')[-1].string fileUrl_search = ('fileUrl= "(.*?)";', script_text, ) if fileUrl_search: return '' + fileUrl_search.group(1)
downloading
Surprise, surprise, surprise. Putting this /xxxx.mp3 into a browser and running it gives me a 404.
The key parameter is definitely missing. Going back to the above Network and looking closely at the url of the mp3, I see that there is a key key after the url. As you can see below, this key is the return value from /play/h5_jsonp.asp?0.5078556568562795, so you can use a regular expression to fetch the key.
def get_key(url): url = '/play/h5_jsonp.asp?{}'.format(str(())) headers['referer'] = url response = (url, headers=headers) matched = ('(key=.*?)";', , ) if matched: temp = (1) return temp[len(temp)-42:]
The last of the last of the last of the last of the last of the last of the last of the last of the last in__main__
in which the above code is strung together.
if __name__ == "__main__": url = input("Please enter the address of the browser book page:") dir,url_list = get_detail_urls() for item in url_list: audio_url = get_mp3_path(item['url']) key = get_key(item['url']) audio_url = audio_url + '?key=' + key headers['referer'] = item['url'] r = (audio_url, headers=headers,stream=True) with open((dir, item['name']),'ab') as f: () ()
Full Code
from bs4 import BeautifulSoup import requests import re import random import os headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36' } def get_detail_urls(url): url_list = [] response = (url, headers=headers) = 'gbk' soup = BeautifulSoup(, 'lxml') name = ('.red12')[0]. if not (name): (name) div_list = (' a') for item in div_list: url_list.append({'name': , 'url': '/yousheng/{}'.format(item['href'])}) return name, url_list def get_mp3_path(url): response = (url, headers=headers) = 'gbk' soup = BeautifulSoup(, 'lxml') script_text = ('script')[-1].string fileUrl_search = ('fileUrl= "(.*?)";', script_text, ) if fileUrl_search: return '' + fileUrl_search.group(1) def get_key(url): url = '/play/h5_jsonp.asp?{}'.format(str(())) headers['referer'] = url response = (url, headers=headers) matched = ('(key=.*?)";', , ) if matched: temp = (1) return temp[len(temp)-42:] if __name__ == "__main__": url = input("Please enter the address of the browser book page:") dir,url_list = get_detail_urls() for item in url_list: audio_url = get_mp3_path(item['url']) key = get_key(item['url']) audio_url = audio_url + '?key=' + key headers['referer'] = item['url'] r = (audio_url, headers=headers,stream=True) with open((dir, item['name']),'ab') as f: () ()
summarize
This Python crawler is relatively simple, I'm not enough to use the monthly flow of 30 yuan, with this small program on the subway can not flow to listen to novels.
Above is a detailed explanation of how to use Python to write a crawler to listen to novels in detail, more information about Python crawler to listen to novels please pay attention to my other related articles!