SoFunction
Updated on 2024-11-12

python3 xpath and requests application details

A simple use of the etree and request libraries based on a small application that crawls the ranking of Douban movies.

etree uses xpath syntax.

import requests
import ssl
from lxml import etree


ssl._create_default_https_context = ssl._create_unverified_context

session = ()
for id in range(0, 251, 25):
 URL = '/top250/?start=' + str(id)
 req = (URL)
 # Setting the web page encoding format
  = 'utf8'
 # Will be converted to Element
 root = ()
 # select ol/li/div[@class="item"] regardless of their position in the document
 items = ('//ol/li/div[@class="item"]')
 for item in items:
  # Note that there may be only Chinese names, not English names; there may be no quote brief comments
  rank, name, alias, rating_num, quote, url = "", "", "", "", "", ""
  try:
   url = ('./div[@class="pic"]/a/@href')[0]
   rank = ('./div[@class="pic"]/em/text()')[0]
   title = ('./div[@class="info"]//a/span[@class="title"]/text()')
   name = title[0].encode('gb2312', 'ignore').decode('gb2312')
   alias = title[1].encode('gb2312', 'ignore').decode('gb2312') if len(title) == 2 else ""
   rating_num = ('.//div[@class="bd"]//span[@class="rating_num"]/text()')[0]
   quote_tag = ('.//div[@class="bd"]//span[@class="inq"]')
   if len(quote_tag) is not 0:
    quote = quote_tag[0].('gb2312', 'ignore').decode('gb2312').replace('\xa0', '')
   # Output Rankings, Ratings, Profiles
   print(rank, rating_num, quote)
   # Output Chinese name, English name
   print(('gb2312', 'ignore').decode('gb2312'),
     ('gb2312', 'ignore').decode('gb2312').replace('/', ','))
  except:
   print('faild!')
   pass

Program run results:

Additional Knowledge: Requests Crawling and Xpath Parsing

Code:

# requests crawl
import requests
 
# The url of a story from Sina News
url = '/s/2018-05-09/'
 
res = (url)
# View encoding method
enconding = .get_encodings_from_content()
#print(enconding)
 
 
# Print the content of the web page
html_doc = ("utf-8")
print(html_doc[:500])
 
# Save web content
with open('', 'w') as f:
 (html_doc)

Run results:

<!DOCTYPE html>
<!-- [ published at 2018-05-09 18:23:13 ] -->
<!-- LLTJ_MT:name ="Pescadores." -->
 
<html>
<head>
<meta charset="utf-8"/>
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="sudameta" content="urlpath:s/; allCIDs:51924,257,51895,200856,56264,258,38790">
<title>Elementary school teacher punishes students for running barefoot on the playground official (relating a government office):Will be dealt with as required|barefoot|schoolchildren|Hualong Network_Sina News</title>
<meta name="keywords" content="Barefoot, Students, Wahlong." />
<meta name="tags" content="Barefoot, Students, Wahlong." />
<meta name="description" content="original title:潼南一小学体育老师罚schoolchildrenbarefoot跑操场续:区教委向Hualong Network发来情况

Code:

# xpath parsing
from lxml import etree
 
# Build a tree of html
tree = (html_doc)
 
# Setting the destination path (title)
path_title = '/html/body//h1[@class="main-title"]//text()'
 
# Extract nodes
node_title = (path_title)
print("===" * 20)
print(node_title[0])
 
# Set the content path
path_content = '//div[@class="article-content-left"]//div[@]//text()'
 
# Extract nodes
node_content = (path_content)
print("===" * 20)
print("。".join(node_content))

Run results:

============================================================
Elementary school teacher punishes students for running barefoot on the playground official (relating a government office):Will be dealt with as required
============================================================
 
 。original title:A Tongnan elementary school gym teacher punished students for running barefoot on the playground continued:The district board of education sent a fact sheet to Huron.com。
。Chongqing Client-Hualong Network5moon9Japanese news,the past two days,Grade 2, Chaoyang Primary School, Tongnan District, Chongqing, China6Many parents in the class are heartbroken,Because many children have blisters on the soles of their feet.。When you ask, you'll find out.,It's because some students don't wear sneakers in gym class.,Being asked to run barefoot on the playground by the gym teacher。After receiving this complaint from the Chongqing Network Question and Answer Platform,Hualong Network记者立即进行了调查。present(9)date,Hualong Network发布了。《Chongqing Tongnan an elementary school physical education teacher punished students barefoot running playground soles rubbed out blisters local education commission intervention》。post-reporting,潼南教委高度重视并给Hualong Network传来official (relating a government office)的情况说明。。
。 。 [Full description]。
。关于家长在Hualong Network投诉教师上体育课体罚学生的情况说明。
。Mr. Zou, a physical education teacher at Chaoyang Primary School in Tongnan District, was born in2018surname Nian5moon7date上午上体育课时,A small number of students in the class were found not to be wearing sneakers as required for physical education class。The teacher believes that,Running in sandals is a safety hazard for students and others.,Plastic runway will not affect students' barefoot exercise,So the students who were not wearing sneakers were told to,Remove sandals for warm-up run with class。At the time, Mr. Zou didn't notice anything unusual about the student.,There were no reports of any unusual student behavior.。It was later relayed to the school by the parents,A very small number of students running barefoot had abnormalities,The school immediately communicated with some of the parents,And the matter was investigated and understood in a timely manner,The teacher was also criticized for this inappropriate teaching method,The teacher will be dealt with accordingly in accordance with the relevant regulations。。
。Chongqing Tongnan District Education Committee。
。2018surname Nian5moon9date。
。  source (of information etc):Hualong Network。
 
。responsible editor:Zhang Yiling 。

Above this python3 xpath and requests application details is all I have shared with you, I hope to give you a reference, and I hope you support me more.