Basic development environment
· Python 3.6
· Pycharm
Libraries to be imported
Target Page Analysis
The site is a static site, not encrypted and can be crawled directly
Overall Thoughts:
1, first in the list page to get each wallpaper details page address
2, in the wallpaper details page to get the wallpaper real HD url address
3. Save address
code implementation
Simulate the browser to request a web page and get the web page data
Here only the first 10 pages of data are selected to be crawled
The code is as follows
import threading import parsel import requests def get_html(html_url): ''' Get the source code of a web page :param html_url: page url :return. ''' response = (url=html_url, headers=headers) return response def get_par(html_data): ''' Convert to selector object, parse and extract data :param html_data. :return: selector object ''' selector = (html_data) return selector def download(img_url, title): ''' Save data :param img_url: image address :param title: title of the image :return. ''' content = get_html(img_url).content path = 'Wallpaper\\\' + title + '.jpg' with open(path, mode='wb') as f: (content) print('Saving', title) def main(url): ''' Main function :param url: list page url :return. ''' html_data = get_html(url).text selector = get_par(html_data) lis = ('.wb_listbox div dl dd a::attr(href)').getall() for li in lis: img_data = get_html(li).text img_selector = get_par(img_data) img_url = img_selector.css('.wb_showpic_main img::attr(src)').get() title = img_selector.css('.wb_pictitle::text').get().strip() download(img_url, title) end_time = () - s_time print(end_time) if __name__ == '__main__': for page in range(1, 11): url = '/min/list-{}.html'.format(page) main_thread = (target=main, args=(url,)) main_thread.start()
The above is python multi-threaded crawl wallpaper website example of the details, more information about python crawl wallpaper website please pay attention to my other related articles!