Chat about Python and web crawlers.
1. Definition of crawler
Crawler: a program that automatically crawls the Internet for data.
2, the main framework of the crawler
The main framework of the crawler program is shown in the above figure, the crawler scheduler obtains the URL link to be crawled through the URL manager, if the URL link to be crawled exists in the URL manager, the crawler scheduler calls the web page downloader to download the corresponding web page, and then calls the web page parser to parse the web page, and adds the new URL in the web page to the URL manager, and outputs the valuable data.
3. Timing diagram of the crawler
4. URL Manager
The URL manager manages the collection of URLs to be crawled and the collection of URLs that have been crawled to prevent duplicate crawling and circular crawling.The main functions of the URL manager are shown in the following figure:
The URL manager is implemented in Python using mainly in-memory (set), and relational databases (MySQL). For small programs, they are usually implemented in memory, and Python's built-in set() type can automatically determine whether an element is duplicated. For larger programs, it is usually implemented using a database.
5. Web downloader
The web downloader in Python mainly uses the urllib library, which is a module that comes with python. For the version of urllib2 library, it is integrated into urllib in its request submodule. urlopen function in urllib is used to open the url and get the url data. urlopen function's parameter can be the url link or request object, for simple webpage, it is enough to use url directly as parameter, but for complex webpage with anti-crawler mechanism, it is enough to use url again. For simple web pages, it is enough to use the url string as parameter, but for complex web pages with anti-crawler mechanism, you need to add the http header when you use the urlopen function again, and for web pages with login mechanism, you need to set a cookie.
6. Web page parser
The web parser extracts valuable data and new url from the url data downloaded by the web downloader.For data extraction, methods such as regular expressions and BeautifulSoup can be used. Regular expressions use fuzzy matching based on strings, which is good for target data with more distinctive features, but not very versatile.BeautifulSoup is a third-party module for structured parsing of url content. The downloaded web content is parsed into a DOM tree. The following figure shows a portion of the output of a web page in Baidu Encyclopedia that was crawled using BeautifulSoup to print.
About the specific use of BeautifulSoup, in a future article to write again. The following code uses python to grab other League of Legends related entries in Baidu's encyclopedia in the League of Legends entry, and saves these entries in a newly created excel. On the code:
from bs4 import BeautifulSoup import re import xlrd <span style="font-size:18px;">import xlwt from import urlopen excelFile=() sheet=excelFile.add_sheet('league of legend') ## Baidu Encyclopedia: League of Legends ## html=urlopen("/subview/3049782/") bsObj=BeautifulSoup((),"") #print(()) row=0 for node in ("div",{"class":"main-content"}).findAll("div",{"class":"para"}): links=("a",href=("^(/view/)[0-9]+\.htm$")) for link in links: if 'href' in : print(['href'],link.get_text()) (row,0,['href']) (row,1,link.get_text()) row=row+1 ('E:\Project\Python\')</span>
A partial screenshot of the output is shown below:
A screenshot of the excel section is below:
This is the whole content of this article, I hope to help you learn Python web crawler.