Examples of xpath crawlers in python

Case one:

A set of pictures website, set of pictures in the form of cover to show on the page, you need to click on the set of pictures in turn, click on the link to the advertising disk, and finally arrive at the Baidu.com disk display page.

This process is realized by crawling, collecting Baidu.com disk address and extraction code, using xpath crawling technology

1, first of all, analyze the picture list page, the page in accordance with the update order of temporary sets of pictures cover, view the HTML structure. Each set of "li" corresponds to a set of sets of pictures. Attribute href behind the set of pictures that is the address of the inner page (i.e., advertising disk link page). So, first we have to get the addresses of all the inner pages (i.e., the advertisement disk link pages) in the list page.

The code is as follows:

import requests pourrequestsstorehouse
from lxml import etree pourlxml storehouse（没有这个storehouse，pip install lxmlmounting）
url = "/gc/" request address
response = (url= url) Return results
wb_data =  文本展示Return results
html = (wb_data) Converting Pages to Document Trees
b = ('//ul[@class = "clearfix"]//@href') This step means that all attributes under class "clearfix" are assigned to " b", because our target content is displayed under class "clearfix", after the href attribute.
print(b) printableb，Here.bis an array
print(b[0]) printablebThe first data of the

Result: Successful return of all inner pages

2. Open the inner page (i.e. the advertisement disk link page) and get the address of the advertisement disk. The red arrow in the picture below is not the real Baidu disk page, you need to click it to see the Baidu disk address. So for this step, you only need to grab the red arrow content address;

The code is as follows:

url = "/gc/toutiao/"
response = (url= url)
wb_data =  # Convert pages to document trees
html = (wb_data)
b = ('//div[@class = "pictext"]//@href')
c=b[1]  # Need to pay attention to the place, class = "pictext" under the two href, we only need the value of the first href, so the return value and then assigned to the c and take the second item of data
print(c)

Result: Successful return of all inner pages

3, get the ad disk address, the next step is to open the address, catch Baidu disk real address. Link and extraction code in two different elements, all the last return two sets of data.

The code is as follows:

url = "/xam9I6"
response = (url= url)
wb_data = 
# Convert pages to document trees
html = (wb_data)
b = ('//tr/td/text()')
c=b[6]#Extract code
d = ('//tr//@href')#Baidu Address
print(c)
print(d)

Note that the way it is written here is a bit different from the above, the target element's superiors don't have a class and can only be fuzzy fetched

For example, the HTML structure of the extraction code is shown below, with the structure //tr/td/, single / for the child node under the parent node, and double / for the descendant node after the parent node. The extraction code is the child node of tr. But there are many groups of data under this structure, and finally an array b is output (see code b above). In this way, we find the extraction code is located in the array sequence, assigned to c (see code c above), so that the real Baidu disk address is obtained

The web address is good for crawling because it has the href attribute, so just pay attention to the number of /.

4, the above steps put together into a script, which involves the function and the function between the passing of parameters, as well as the cycle of the problem. Code directly posted

# -*-coding:utf8-*-
# encoding:utf-8

import requests
from lxml import etree

firstlink = "/gc/qt/"
AA=["/gc/",
 "/gc/index_2.html",
 "/gc/index_3.html",
 "/gq/",
 "/gq/index_2.html",
 "/gq/index_3.html",
 "/gq/index_4.html"]

#Step 1, get all the addresses on the first page
def stepa (AA):
 lit=[]
 for url in AA:
  response = (url=url)
  wb_data = 
  # Convert pages to document trees
  html = (wb_data)
  a = ('//ul[@class = "clearfix"]//@href')
  (a)
 return(lit) 
alllink = stepa(AA)

# Step 2, get the address, cycle read open, so as to obtain the Baidu.com disk information
def stepb(alllink,firstlink):
 for list in alllink:
  for url in list:
   if url in firstlink:
    continue
   elif "www" in url:
    url2 = url
   else:
    url2 ="" +url
   response = (url=url2)
   wb_data =  # Convert pages to document trees
   html = (wb_data)
   b = ('//div[@class = "pictext"]//@href')
   c = b[1]
   #print(c)
   # Get the address of the ad page
   url3 = c
   response = (url=url3)
   wb_data = 
   # Convert pages to document trees
   html = (wb_data)
   d = ('//tr/td/text()')
   #print(d)
   e=d[6]#Get Extraction Code
   f = ('//tr//@href')#Get Address
   test = e[-5:]# Extract code value retains only the extract code (4 bits)
   test2 = f[-1]#Links keep only the content of the link, remove the before and after ['']
   test3=test2+test# Splicing links and extract codes into a single piece of data
   print(test3)
   with open('C:/Users/Beckham/Desktop/python/', 'a',encoding='utf-8') as w:
    ('\n'+test3)
    ()
stepb(alllink,firstlink)

#Step 3: Prompt for Crawl Completion
def over():
 print("ok")
over()

What to look for:

1, the use of return, if you want to pass the value generated by the function to the function with the back, you need to return the value, such as def stepa defined in the a to climb to the cover address of the set of pictures (by opening the address for the next step), you need to return (a) to return to the value of a, otherwise after the implementation of no data

2, the application of Continue, because the first set of map address to open the content of the content of the target content, so that can not find the element will report an error, so you need to read the set of map address to skip the first address. Add an if judgment, when the first address is equal to the pre-defined non-normal address, skip the loop

Print results:

Case two:

Crawling Douban's reviews for reading books

Analyzing the html, the comments are stored in the red element position, and observing the structure, all other comments are stored in the same position in the li node

So, xpath parses //*[@]//div[2]/p/span

The previous example of "//" represents the selection of descendant nodes from the current node, here you can directly skip the li node, directly after the selection of li div[2]/p/span content

The code is as follows:

# -*-coding:utf8-*-
# encoding:utf-8

import requests
from lxml import etree

firstlink = "/subject/30172069/comments/hot?p=6"

def stepa (firstlink):
 response = (url=firstlink)
 wb_data = 
 html = (wb_data)
 a = ('//*[@]//div[2]/p/span')
 print(a)
stepa (firstlink)

Run the code, the printout is as follows, and you don't get the desired comment content

Later, it was realized that to get the content, it had to be output as text, i.e., xpath was parsed as //*[@]//div[2]/p/span/text()

modified code

# -*-coding:utf8-*-
# encoding:utf-8

import requests
from lxml import etree

firstlink = "/subject/30172069/comments/hot?p=6"

def stepa (firstlink):
 response = (url=firstlink)
 wb_data = 
 html = (wb_data)
 a = ('//*[@]//div[2]/p/span/text()')
 print(a)
stepa (firstlink)

Execute it. Here comes the content.

Reference address: /

summarize

The above is a small introduction to the python xpath crawler example details, I hope to help you, if you have any questions please leave me a message, I will reply to you in a timely manner. Here also thank you very much for your support of my website!
If you find this article helpful, please feel free to reprint it, and please note the source, thank you!