SoFunction
Updated on 2024-11-21

Crawling 2022 Chinese New Year Movie Information with Python

pre-conditions

Familiar with basic HTML statements
Familiarity with basic Xpath statements

Introduction

Python is a cross-platform computer programming language. It is a high-level scripting language that combines interpreted, compiled, interactive, and object-oriented features. Originally designed for writing automated scripts (shells), it has been used more and more for stand-alone, large-scale projects as versions have been updated and new features have been added to the language.Requests is a useful Python HTTP client library.Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make structured (tabular, multidimensional, potentially heterogeneous) and temporal data structures more accessible. (tabular, multidimensional, potentially heterogeneous) and time-series data to be both easy and intuitive to work with. time is a python standard library that requires no additional downloads, and is primarily used for working with time. lxml is an XML and HTML parser whose main function is to parse and extract data from XML and HTML; lxml, like canonicals, is also implemented in C and is a high-performance The python HTML and XML parser can also utilize XPath syntax to locate specific elements and node information.
HTML is Hypertext Markup Language, mainly used for displaying data, his focus is on the appearance of the data XML is Extensible Markup Language, mainly used for transmitting and storing data, his focus is on the content of the data

Experimental Objectives:Python crawl 2022 Spring Festival movie information

experimental environment

Python (object-oriented high-level language)
Resquest 2.14.2 (python third-party library)
Pandas 1.1.0 (python third-party library)
Time (python standard library)
Lxml (python third-party library)

concrete step

Target website

/cinema/later/shenzhen/

在这里插入图片描述

Analyzing the website

Press F12 to open the browser console

在这里插入图片描述

Press Ctrl+Shift+C shortcut key

在这里插入图片描述

Press the Ctrl+F shortcut and the search box appears on the console

在这里插入图片描述

Copy Xpath

Xpath is //*[@id="showing-soon"]/div[1]/div/h3/a

在这里插入图片描述

Paste it into the search box and verify the Xpath

在这里插入图片描述

Check out the HTML for commonalities

在这里插入图片描述

Find that the target elements are all in a div box, modify the Xpath

Xpath changed to //*[@id="showing-soon"]/div/div/h3/a

在这里插入图片描述

The remaining target elements, and so on

在这里插入图片描述

Finally, save as a CSV file with Pandas

# Save files using pandas
df = ()
df['Release date'] =  Ondate
df['Title'] =  name
df['Type'] =  movie_class
df['Country/region of production'] = area
df['Would like to see the number of people'] = num
df['Hyperlink'] = href

在这里插入图片描述

code implementation

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 25 10:07:11 2022
@author: TFX
"""
import time
import requests # Request library
import pandas as pd
from lxml import etree# Extract the information base
# Date
today = ('%Y{y}%m{m}%d{d}',()).format(y='Year',m='Moon',d='Day')
# Web site
url = '/cinema/later/shenzhen/'
# Request header
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
        }
# Send request
response = (url=url,headers=headers)
# Data parsing, xpath can be obtained by inspecting the elements with the browser.
html = () #Type transformations
# Movie detail hyperlinks
href = ('//*[@]/div/div/h3/a/@href')
# Release date
Ondate =  ('//*[@]/div/div/ul/li[1]/text()')
# Title
name = ('//*[@]/div/div/h3/a/text()')
# Type
movie_class = ('//*[@]/div/div/ul/li[2]/text()')
# Country/region of production
area = ('//*[@]/div/div/ul/li[3]/text()')
# of people wanting to see
num = ('//*[@]/div/div/ul/li[4]/span/text()')
# Save files using pandas
df = ()
df['Release date'] =  Ondate
df['Title'] =  name
df['Type'] =  movie_class
df['Country/region of production'] = area
df['Would like to see the number of people'] = num
df['Hyperlink'] = href
df.to_csv('2022 Spring Festival Movie_'+today+'.csv',mode='w',index=None,encoding='gbk')
print('Save complete!')

output result

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

summarize

To this article on the Python crawl 2022 Spring Festival movie information is introduced to this article, more related Python Spring Festival movie information content please search my previous articles or continue to browse the following related articles I hope you will support me more in the future!