SoFunction
Updated on 2024-11-07

Using selenium and pyquery to crawl the Jingdong product list process analysis

Today together to learn to use selenium and pyquery crawl Jingdong's product list. All the code in this article is done in pycharm IDE, operating system window 10.

1. Preparation

Install pyquery and selenium libraries. Click file->settings in turn, and the following screen will pop up:


Then click: project->Project Interpreter-> "+", as shown in the red box above. Then the following interface will pop up:

Type selenium, select "selenium" in the result list, and click the "install package" button to install the selenium library. pyquery is also installed in the same way.

Install chrome and chrome driver plugins. chrome dirver plugin download at /mirrors/chromedriver/. Remember to use the same version of chrome and chrome dirver. My chrome version is 70, the corresponding chrome driver is 2.44,2.43,2.42.

Download the chrome driver and unzip it, then copy the exe file to the Scripts folder in the pycharm dev space:

2, analyze the page to crawl

This time it is to crawl the information of books in the category of computer books in Jingdong books.

Open chrome, open developer tools and type in, analyze the css code for the query input box and query button:


Through the analysis found that the search box css code is id="key", query button css code is class="button". The following is the code of using selenium to call chrome browser to enter the keyword "computer books" in the search box and click the query button to start the query request:

from selenium import webdriver
from  import expected_conditions as EC
from  import By
from  import WebDriverWait
from pyquery import PyQuery as pq

#Open chrome browser via Chrome() method
browser = ()
#Visit the Jingdong website
("")
# Wait 50 seconds
wait = WebDriverWait(browser, 50)
#Get the input box via the id attribute of the css selector
input = browser.find_element_by_id('key')
# Write the information to be queried in the input box
input.send_keys('Computer books')
#Get Query Button
submit_button = browser.find_element_by_class_name('button')
#Click the inquiry button
submit_button.click()

The above code successfully launches the chrome browser and automatically completes the action of entering the keyword in the search box and clicking the query button.

After clicking on the query button, the eligible books will be loaded as shown below:

When you scroll down to the bottom of the page, you will see the paging interface:

The next step to do is to analyze the css code of the product listing page and pagination.

We want to crawl to the book title, picture, price, publisher, rating quantity information. The following picture shows the interface of the product list also

Through the developer tools can be seen class = "gl-item" li node is a product information, the above picture of this red box.

  • The green box is the image information of the product. The corresponding div node is class="p-img".
  • The blue box is the price information of the product, which corresponds to the div node with class="p-price".
  • The black box is the name information of the product, which corresponds to the div node with class="p-name".
  • The purple mad is the product's rating information, which corresponds to the div node with class="p-commit".
  • The brown box is the publisher information for the product, corresponding to the div node with class="p-shopnum".

We use pyquery to parse the information about the product, and when we open a page using selenium, we can get the source code of the page by using the page_source attribute.

Here is a pit need to pay attention to: Jingdong's product list page is to display a fixed number of goods, when loading a new product page, not all at once to the page of goods are loaded out, but the mouse scroll down to dynamically load the new goods. Therefore, when we use selenium, we should set the mouse to automatically scroll to the bottom of the product list page, so that all the products will be displayed on the page, crawl the data to be complete, otherwise there will be lost.

The code for parsing a commodity is given below:

from selenium import webdriver
from  import expected_conditions as EC
from  import By
from  import WebDriverWait
from pyquery import PyQuery as pq
import time
#Open chrome browser via Chrome() method
browser = ()
#Visit the Jingdong website
("")
# Wait 50 seconds
wait = WebDriverWait(browser, 50)
#Get the input box via the id attribute of the css selector
input = browser.find_element_by_id('key')
# Write the information to be queried in the input box
input.send_keys('Computer books')
#Get Query Button
submit_button = browser.find_element_by_class_name('button')
#Click the inquiry button
submit_button.click()

# Simulate a slide to the bottom action
for i in range(1, 5):
  browser.execute_script("(0, );")
  (1)

# of total pages of product listings
total = (
  EC.presence_of_all_elements_located(
    (By.CSS_SELECTOR, '#J_bottomPage > -skip > em:nth-child(1) > b')
  )
)

html = browser.page_source.replace('xmlns', 'another_attr')

doc = pq(html)
# A product information is stored inside a li node with class="gl-item", and the items() method is to get a list of all the products.
li_list = doc('.gl-item').items()
# Loop to parse each product's information
for item in li_list:
  image_html = item('.gl-i-wrap .p-img')
  book_img_url = ('img').attr('data-lazy-img')
  if book_img_url == "done":
    book_img_url = ('img').attr('src')
  print('Image address:' + book_img_url)
  item('.p-name').find('font').remove()
  book_name = item('.p-name').find('em').text()
  print('Book Title:' + book_name)
  price = item('.p-price').find('em').text() + str(item('.p-price').find('i').text())
  print('Price:' + price)
  commit = item('.p-commit').find('strong').text()
  print('Number of evaluations:' + commit)
  shopnum = item('.p-shopnum').find('a').text()
  print('Publisher:' + shopnum)
  print('++++++++++++++++++++++++++++++++++++++++++++')

For the case of paging, you need to analyze the product page by page, we can selenium call "next page" button to get the source code of the next page. Let's analyze the css code of the next page, scroll the mouse to the bottom of the page, you will see the paging:

As you can see from the above figure, you need to get the "Next" button and then call the click method. The corresponding code is:

  next_page_button = (
    EC.element_to_be_clickable((By.CSS_SELECTOR, '#J_bottomPage > -num > -next > em'))
  )
  next_page_button.click()

  #Slide to the bottom of the page for loading data
  for i in range(0,3):
    browser.execute_script("(0, );")
    (10)

  # Show 60 items on one page, "#J_goodsList > ul > li:nth-child(60) ensures that all 60 items are loaded out properly.
  (
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#J_goodsList > ul > li:nth-child(60)"))
  )
  # Determine the success of page turning, when the page number is displayed on the paging interface at the bottom, it shows the success of page turning.
  (
    EC.text_to_be_present_in_element((By.CSS_SELECTOR, "#J_bottomPage > -num > "), str(page_num))
  )

The complete code is given below:

from selenium import webdriver
from  import expected_conditions as EC
from  import By
from  import WebDriverWait
from pyquery import PyQuery as pq
import time
# Open different browser instances
def openBrower(brower_type):
  if brower_type == 'chrome':
    return ()
  elif brower_type == 'firefox':
    return ()
  elif brower_type == 'safari':
    return ()
  elif brower_type == 'PhantomJS':
    return ()
  else :
    return ()
def parse_website():
  # Open the chrome browser via the Chrome() method
  browser = openBrower('chrome')
  # Visit the Jingdong website
  ("")
  # Wait 50 seconds
  wait = WebDriverWait(browser, 50)
  # Get the input box via the id attribute of the css selector. the until method indicates that the browser fully loads the corresponding node before returning the corresponding object. presence_of_all_elements_located is loading the node via the css selector.
  input = (
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#key'))
  )

  # input = browser.find_element_by_id('key')
  # Write the information to be queried in the input box
  input[0].send_keys('Computer books')
  # Query button fully loaded, return query button object
  submit_button = (
    EC.element_to_be_clickable((By.CSS_SELECTOR, '.button'))
  )
  # Click on the query button
  submit_button.click()

  # Simulate a slide to the bottom action
  for i in range(0,3):
    browser.execute_script("(0, );")
    (3)

  # of total pages of product listings
  total = (
    EC.presence_of_all_elements_located(
      (By.CSS_SELECTOR, '#J_bottomPage > -skip > em:nth-child(1) > b')
    )
  )
  html = browser.page_source.replace('xmlns', 'another_attr')
  parse_book(1,html)

  for page_num in range(2,int(total[0].text) + 1):
    parse_next_page(page_num,browser,wait)

##Parsing the next page
def parse_next_page(page_num,browser,wait):

  next_page_button = (
    EC.element_to_be_clickable((By.CSS_SELECTOR, '#J_bottomPage > -num > -next > em'))
  )
  next_page_button.click()

  #Slide to the bottom of the page for loading data
  for i in range(0,3):
    browser.execute_script("(0, );")
    (10)

  # Show 60 items on one page, "#J_goodsList > ul > li:nth-child(60) ensures that all 60 items are loaded out properly.
  (
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#J_goodsList > ul > li:nth-child(60)"))
  )
  # Determine the success of page turning, when the page number is displayed on the paging interface at the bottom, it shows the success of page turning.
  (
    EC.text_to_be_present_in_element((By.CSS_SELECTOR, "#J_bottomPage > -num > "), str(page_num))
  )

  html = browser.page_source.replace('xmlns', 'another_attr')
  parse_book(page_num, html)

def parse_book(page,html):
  doc = pq(html)
  li_list = doc('.gl-item').items()
  print('------------------- section' + str(page) + 'Pages of book information ---------------------')
  for item in li_list:
    image_html = item('.gl-i-wrap .p-img')
    book_img_url = ('img').attr('data-lazy-img')
    if book_img_url == "done":
      book_img_url = ('img').attr('src')
    print('Image address:' + book_img_url)
    item('.p-name').find('font').remove()
    book_name = item('.p-name').find('em').text()
    print('Book Title:' + book_name)
    price = item('.p-price').find('em').text() + str(item('.p-price').find('i').text())
    print('Price:' + price)
    commit = item('.p-commit').find('strong').text()
    print('Number of evaluations:' + commit)
    shopnum = item('.p-shopnum').find('a').text()
    print('Publisher:' + shopnum)
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')

def main():
  parse_website()
if __name__ == "__main__":
  main()

3. Summary

(1) You have to remember to call the browser.execute_script("(0, );") method to simulate the mouse scroll down operation to load the data, otherwise the data will be incomplete.

(2) When getting the webpage source through page_source, if there is xmlns namespace, you have to replace the namespace with empty other fields, otherwise when using pyquery to parse the webpage, it will not parse out the data. pyquery will automatically hide out some attributes when parsing xmlns namespace. The reason for not being able to journey to parse the web page is unknown, if anyone knows the reason please advise.

(3) Try to use the (EC.presence_of_all_elements_located()) method, so that you can avoid the situation that the web page can not be loaded properly and return the web page information in advance. Ensure the accuracy of the data.

This is the entire content of this article.