How to use Playwright library for movie website data acquisition

brief outline

This series may be a longer series, mainly on the "Python3 Web Crawler Development Practice" the first seven chapters of a summary of the content and familiar with the use of related frameworks and technologies.

mission objective

Crawling movie data site/, This website has no backcrawl, the data is rendered through server-side, and the part to be crawled is the movie data details inside the list page.

Analysis of mission objectives

crawl/, The list page of the website, through the content of the list page to get the desired URL
crawl/detail/{id}, The details of the data within the site that needs to be captured are in the section:
movie title

movie picture url

Movie Release Date

Movie Categories

movie rating

Synopsis
Store content in the desired database

Technology Selection and Crawling

How to crawl

playwright library is a Microsoft open source library, this library is more powerful, in addition to synchronous operation, the same can also be achieved asynchronous operation, this library can be said to be the most powerful library is not too much, because it also supports xpath, css selector, and other elements of the selection of the operation, and even through the click of the mouse operation, and then automated build code, the overall function is really very powerful.

Building basic crawler functions

# Crawl web content
def scrape_page(page, url):
    ('scraping %s ...', url)
    try:
        (url)
        page.wait_for_load_state('networkidle')
    except:
        ('error occured while scraping %s', url, exc_info=True)

For web content manipulation, we only need to manipulate the page tab, pass in the page tab object and url link to realize the page request we want to complete, in this request, we wait for the response of the network request to determine whether the page is fully responsive.

Crawling Functions for Building List Pages

This part only need to analyze the most basic URL of the page number rules can be completed to crawl the page content, after analysis we can find that/page/{page}Changes can be found in the{page}section, so the crawl is constructed as follows:

def scrape_index(page, page_index):
    index_url = f'{BASE_URL}/page/{page_index}'
    return scrape_page(page, index_url)

Crawling functions for building detail pages

Details page crawling is built on the basis of parsing the list to get, so the details page crawling function only need to know the url can directly call the basic crawling function, and here we only need to list the page parsing can be obtained after we need the url, so the overall construction is as follows:

def scrape_detail(page, url):
    return scrape_page(page, url)

How to parse

Get the URL of the detail page after parsing the list page

Quick and easy selector allows us to get the tags and attributes we need with a single line of code, very convenient to complete the URL we need to get the details of the page.

# Getting parsed content
def parse_index(page):
    # Get Web Content Requests
    elements = page.query_selector_all('')
    # Get element information
    for element in elements:
        part_of_url = element.get_attribute('href')
        detail_url = urljoin(BASE_URL, part_of_url)
        ('get url: %s', detail_url)
        yield detail_url

Parsing the detail page to get the required data

Once the detail page data is acquired, the information within the page is parsed to achieve content acquisition of movie name, movie category, image address, plot synopsis, and rating:

def parse_detail(page):
    # Get Title
    name = None
    name_tag = page.query_selector('-b-sm')
    if name_tag:
        name = name_tag.text_content()
    # Get the picture
    cover = None
    cover_tag = page.query_selector('')
    if cover_tag:
        cover = cover_tag.get_attribute('src')
    # Get classified
    categories = []
    category_tags = page.query_selector_all(' > button > span')
    if category_tags:
        categories = [category.text_content() for category in category_tags]
    # Getting ratings
    score = None
    score_tag = page.query_selector('')
    if score_tag:
        score = score_tag.text_content().strip()
    # Synopsis
    drama = None
    drama_tag = page.query_selector(' > p')
    if drama_tag:
        drama = drama_tag.text_content().strip()
    return {
        # Title
        'name': name,
        # Pictures
        'cover': cover,
        # Classification
        'categories': categories,
        # Introduction
        'drama': drama,
        # Rating
        'score': score
    }

How to store

This storage uses txt text to store the contents of the file, which is written directly to a txt file.

# Data content storage
def save_data(data):
    # Where the file is stored
    data_path = '{0}/'.format(RESULT_DIR)
    # Perform file writes
    with open(data_path, 'a+', encoding='utf-8') as file:
        name = ('name', None)
        cover = ('cover', None)
        categories = ('categories', None)
        drama = ('drama', None)
        score = ('score', None)
        ('name:'+name+'\n')
        ('cover:'+cover+'\n')
        ('categories:'+str(categories)+'\n')
        ('drama:'+drama+'\n')
        ('score:'+score+'\n')
        ('='*50 + '\n')

source code (computing)

import logging
from os import makedirs
from  import exists
from  import urljoin
from playwright.sync_api import sync_playwright
(level=,
                    format='%(asctime)s - %(levelname)s: %(message)s')
#
BASE_URL = ''
TOTAL_PAGE = 10
RESULT_DIR = 'results'
exists(RESULT_DIR) or makedirs(RESULT_DIR)
# Crawl web content
def scrape_page(page, url):
    ('scraping %s ...', url)
    try:
        (url)
        page.wait_for_load_state('networkidle')
    except:
        ('error occured while scraping %s', url, exc_info=True)
def scrape_index(page, page_index):
    index_url = f'{BASE_URL}/page/{page_index}'
    return scrape_page(page, index_url)
def scrape_detail(page, url):
    return scrape_page(page, url)
# Getting parsed content
def parse_index(page):
    # Get Web Content Requests
    elements = page.query_selector_all('')
    # Get element information
    for element in elements:
        part_of_url = element.get_attribute('href')
        detail_url = urljoin(BASE_URL, part_of_url)
        ('get url: %s', detail_url)
        yield detail_url
def parse_detail(page):
    # Get Title
    name = None
    name_tag = page.query_selector('-b-sm')
    if name_tag:
        name = name_tag.text_content()
    # Get the picture
    cover = None
    cover_tag = page.query_selector('')
    if cover_tag:
        cover = cover_tag.get_attribute('src')
    # Get classified
    categories = []
    category_tags = page.query_selector_all(' > button > span')
    if category_tags:
        categories = [category.text_content() for category in category_tags]
    # Getting ratings
    score = None
    score_tag = page.query_selector('')
    if score_tag:
        score = score_tag.text_content().strip()
    # Synopsis
    drama = None
    drama_tag = page.query_selector(' > p')
    if drama_tag:
        drama = drama_tag.text_content().strip()
    return {
        # Title
        'name': name,
        # Pictures
        'cover': cover,
        # Classification
        'categories': categories,
        # Introduction
        'drama': drama,
        # Rating
        'score': score
    }
# Data content storage
def save_data(data):
    # Where the file is stored
    data_path = '{0}/'.format(RESULT_DIR)
    # Perform file writes
    with open(data_path, 'a+', encoding='utf-8') as file:
        name = ('name', None)
        cover = ('cover', None)
        categories = ('categories', None)
        drama = ('drama', None)
        score = ('score', None)
        ('name:'+name+'\n')
        ('cover:'+cover+'\n')
        ('categories:'+str(categories)+'\n')
        ('drama:'+drama+'\n')
        ('score:'+score+'\n')
        ('='*50 + '\n')
def main():
    with sync_playwright() as p:
        browser = (headless=False)
        page = browser.new_page()
        for page_index in range(1, TOTAL_PAGE + 1):
            scrape_index(page, page_index)
            detail_urls = list(parse_index(page))
            for detail_url in detail_urls:
                scrape_detail(page, detail_url)
                data = parse_detail(page)
                ('get data: %s', data)
                save_data(data)
    ()
if __name__ == '__main__':
    main()

This article was organized or written by PorterZhang
My Github.PorterZhang2021
My blog address.PorterZhang

to this article on how to use the Playwright library for the movie site to get the data of the article is introduced to this, more related Playwright library movie site to get the data of the content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future!