For a website's pictures, text audio and video, etc., if we download one by one, not only waste of time, and it is easy to make a mistake.Python crawler to help us get the data we need, this data can be obtained quickly and in bulk. In this article, I will lead you through the python crawler to get to get the total number of pages and change the url method, to realize the crawl the same site of multiple pages of data.
I. Purpose of the Crawler
Get the data you need from the web
II. Crawling process
1. Get the url (web address).
2. Send a request and get a response.
3. Extract data.
4. Save data.
III. Crawler function
You can quickly get the data you want in bulk, without having to manually download them one by one (images, text, audio, video, etc.).
Fourth, the use of python crawler crawl the same site multi-page data
1, need to locate to the tab and get the total number of pages
def get_page_size(soup): pcxt=('div',{'class':'babynames-term-articles'}).find('nav') pcxt1=('div',{'class':'nav-links'}).findAll('a') for i in pcxt1[:-1]: link=('href') s=str(i) page=('<a href="','',s) page1=(link,'',page) page2=('">','',page1) page3=('</a>','',page2) pagesize=int(page3) print(pagesize) return pagesize Pass
2, change the url to access the URL, that is, to carry out the preparation of the main function
if __name__ == '__main__': url="/baby-names/browse/a/" soup=get_requests(url) page=get_page_size(soup) for i in range(1,page+1): url1=url+"page/"+str(i)+"/" soup1=get_requests(url1) draw_base_list(soup1)
Instance Extension:
import requests from lxml import etree import re url="/top250" header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"} allMovieList=[] flag = True while flag: html = (url, headers=header).text list = (html) lis = ('//ol[@class="grid_view"]/li') for oneSelector in lis: name = ("div/div[2]/div[1]/a/span[1]/text()")[0] score = ("div/div[2]/div[2]/div/span[2]/text()")[0] people = ("div/div[2]/div[2]/div/span[4]/text()")[0] people = ("(.*?)person's opinion",people)[0] oneMovieList = [name,score,people] (oneMovieList) #GetNextAddress try: next_url = ('//span[@class="next"]/a/@href')[0] if next_url: url = "/top250"+ next_url except: flag = False print(allMovieList)
to this article on the python crawler to achieve the same site to crawl the multi-page data example explains the article is introduced to this, more related to the python crawler how to achieve the same site to crawl the multi-page data content, please search for my previous posts or continue to browse the following related articles I hope that you will support me in the future more!