python crawler to achieve crawl the same site's multi-page data example to explain

For a website's pictures, text audio and video, etc., if we download one by one, not only waste of time, and it is easy to make a mistake.Python crawler to help us get the data we need, this data can be obtained quickly and in bulk. In this article, I will lead you through the python crawler to get to get the total number of pages and change the url method, to realize the crawl the same site of multiple pages of data.

I. Purpose of the Crawler

Get the data you need from the web

II. Crawling process

1. Get the url (web address).

2. Send a request and get a response.

3. Extract data.

4. Save data.

III. Crawler function

You can quickly get the data you want in bulk, without having to manually download them one by one (images, text, audio, video, etc.).

Fourth, the use of python crawler crawl the same site multi-page data

1, need to locate to the tab and get the total number of pages

def get_page_size(soup):
  pcxt=('div',{'class':'babynames-term-articles'}).find('nav')
  pcxt1=('div',{'class':'nav-links'}).findAll('a')
  for i in pcxt1[:-1]:
    link=('href')
    s=str(i)
  page=('<a href="','',s)
  page1=(link,'',page)
  page2=('">','',page1)
  page3=('</a>','',page2)
  pagesize=int(page3)
  print(pagesize)
  return pagesize
Pass

2, change the url to access the URL, that is, to carry out the preparation of the main function

if __name__ == '__main__':
    url="/baby-names/browse/a/"
    soup=get_requests(url)
    page=get_page_size(soup)
    for i in range(1,page+1):
      url1=url+"page/"+str(i)+"/"
      soup1=get_requests(url1)
      draw_base_list(soup1)

Instance Extension:

import requests
from lxml import etree
import re

url="/top250"
header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}

allMovieList=[]
flag = True
while flag:
  html = (url, headers=header).text
  list = (html)
  lis = ('//ol[@class="grid_view"]/li')
  for oneSelector in lis:
    name = ("div/div[2]/div[1]/a/span[1]/text()")[0]
    score = ("div/div[2]/div[2]/div/span[2]/text()")[0]
    people = ("div/div[2]/div[2]/div/span[4]/text()")[0]
    people = ("(.*?)person's opinion",people)[0]
    oneMovieList = [name,score,people]
    (oneMovieList)
  #GetNextAddress
  try:
    next_url = ('//span[@class="next"]/a/@href')[0]
    if next_url:
      url = "/top250"+ next_url
  except:
    flag = False
print(allMovieList)

to this article on the python crawler to achieve the same site to crawl the multi-page data example explains the article is introduced to this, more related to the python crawler how to achieve the same site to crawl the multi-page data content, please search for my previous posts or continue to browse the following related articles I hope that you will support me in the future more!