brief outline
This series may be a longer series, mainly on the "Python3 Web Crawler Development Practice" the first seven chapters of a summary of the content and familiar with the use of related frameworks and technologies.
mission objective
Crawling movie data site/, This website has no backcrawl, the data is rendered through server-side, and the part to be crawled is the movie data details inside the list page.
Analysis of mission objectives
- crawl/, The list page of the website, through the content of the list page to get the desired URL
- crawl/detail/{id}, The details of the data within the site that needs to be captured are in the section:
movie title
movie picture url
Movie Release Date
Movie Categories
movie rating
Synopsis
- Store content in the desired database
Technology Selection and Crawling
How to crawl
playwright library is a Microsoft open source library, this library is more powerful, in addition to synchronous operation, the same can also be achieved asynchronous operation, this library can be said to be the most powerful library is not too much, because it also supports xpath, css selector, and other elements of the selection of the operation, and even through the click of the mouse operation, and then automated build code, the overall function is really very powerful.
Building basic crawler functions
# Crawl web content def scrape_page(page, url): ('scraping %s ...', url) try: (url) page.wait_for_load_state('networkidle') except: ('error occured while scraping %s', url, exc_info=True)
For web content manipulation, we only need to manipulate the page tab, pass in the page tab object and url link to realize the page request we want to complete, in this request, we wait for the response of the network request to determine whether the page is fully responsive.
Crawling Functions for Building List Pages
This part only need to analyze the most basic URL of the page number rules can be completed to crawl the page content, after analysis we can find that/page/{page}
Changes can be found in the{page}
section, so the crawl is constructed as follows:
def scrape_index(page, page_index): index_url = f'{BASE_URL}/page/{page_index}' return scrape_page(page, index_url)
Crawling functions for building detail pages
Details page crawling is built on the basis of parsing the list to get, so the details page crawling function only need to know the url can directly call the basic crawling function, and here we only need to list the page parsing can be obtained after we need the url, so the overall construction is as follows:
def scrape_detail(page, url): return scrape_page(page, url)
How to parse
Get the URL of the detail page after parsing the list page
Quick and easy selector allows us to get the tags and attributes we need with a single line of code, very convenient to complete the URL we need to get the details of the page.
# Getting parsed content def parse_index(page): # Get Web Content Requests elements = page.query_selector_all('') # Get element information for element in elements: part_of_url = element.get_attribute('href') detail_url = urljoin(BASE_URL, part_of_url) ('get url: %s', detail_url) yield detail_url
Parsing the detail page to get the required data
Once the detail page data is acquired, the information within the page is parsed to achieve content acquisition of movie name, movie category, image address, plot synopsis, and rating:
def parse_detail(page): # Get Title name = None name_tag = page.query_selector('-b-sm') if name_tag: name = name_tag.text_content() # Get the picture cover = None cover_tag = page.query_selector('') if cover_tag: cover = cover_tag.get_attribute('src') # Get classified categories = [] category_tags = page.query_selector_all(' > button > span') if category_tags: categories = [category.text_content() for category in category_tags] # Getting ratings score = None score_tag = page.query_selector('') if score_tag: score = score_tag.text_content().strip() # Synopsis drama = None drama_tag = page.query_selector(' > p') if drama_tag: drama = drama_tag.text_content().strip() return { # Title 'name': name, # Pictures 'cover': cover, # Classification 'categories': categories, # Introduction 'drama': drama, # Rating 'score': score }
How to store
This storage uses txt text to store the contents of the file, which is written directly to a txt file.
# Data content storage def save_data(data): # Where the file is stored data_path = '{0}/'.format(RESULT_DIR) # Perform file writes with open(data_path, 'a+', encoding='utf-8') as file: name = ('name', None) cover = ('cover', None) categories = ('categories', None) drama = ('drama', None) score = ('score', None) ('name:'+name+'\n') ('cover:'+cover+'\n') ('categories:'+str(categories)+'\n') ('drama:'+drama+'\n') ('score:'+score+'\n') ('='*50 + '\n')
source code (computing)
import logging from os import makedirs from import exists from import urljoin from playwright.sync_api import sync_playwright (level=, format='%(asctime)s - %(levelname)s: %(message)s') # BASE_URL = '' TOTAL_PAGE = 10 RESULT_DIR = 'results' exists(RESULT_DIR) or makedirs(RESULT_DIR) # Crawl web content def scrape_page(page, url): ('scraping %s ...', url) try: (url) page.wait_for_load_state('networkidle') except: ('error occured while scraping %s', url, exc_info=True) def scrape_index(page, page_index): index_url = f'{BASE_URL}/page/{page_index}' return scrape_page(page, index_url) def scrape_detail(page, url): return scrape_page(page, url) # Getting parsed content def parse_index(page): # Get Web Content Requests elements = page.query_selector_all('') # Get element information for element in elements: part_of_url = element.get_attribute('href') detail_url = urljoin(BASE_URL, part_of_url) ('get url: %s', detail_url) yield detail_url def parse_detail(page): # Get Title name = None name_tag = page.query_selector('-b-sm') if name_tag: name = name_tag.text_content() # Get the picture cover = None cover_tag = page.query_selector('') if cover_tag: cover = cover_tag.get_attribute('src') # Get classified categories = [] category_tags = page.query_selector_all(' > button > span') if category_tags: categories = [category.text_content() for category in category_tags] # Getting ratings score = None score_tag = page.query_selector('') if score_tag: score = score_tag.text_content().strip() # Synopsis drama = None drama_tag = page.query_selector(' > p') if drama_tag: drama = drama_tag.text_content().strip() return { # Title 'name': name, # Pictures 'cover': cover, # Classification 'categories': categories, # Introduction 'drama': drama, # Rating 'score': score } # Data content storage def save_data(data): # Where the file is stored data_path = '{0}/'.format(RESULT_DIR) # Perform file writes with open(data_path, 'a+', encoding='utf-8') as file: name = ('name', None) cover = ('cover', None) categories = ('categories', None) drama = ('drama', None) score = ('score', None) ('name:'+name+'\n') ('cover:'+cover+'\n') ('categories:'+str(categories)+'\n') ('drama:'+drama+'\n') ('score:'+score+'\n') ('='*50 + '\n') def main(): with sync_playwright() as p: browser = (headless=False) page = browser.new_page() for page_index in range(1, TOTAL_PAGE + 1): scrape_index(page, page_index) detail_urls = list(parse_index(page)) for detail_url in detail_urls: scrape_detail(page, detail_url) data = parse_detail(page) ('get data: %s', data) save_data(data) () if __name__ == '__main__': main()
Copyright Information
This article was organized or written by PorterZhang
My Github.PorterZhang2021
My blog address.PorterZhang
to this article on how to use the Playwright library for the movie site to get the data of the article is introduced to this, more related Playwright library movie site to get the data of the content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future!