Python Web Crawler Hands-on

I. Overview

Web crawler (Web crawler), also known as Web spider (Web spider) or Web robot (Web robot), is a program or script that is mainly used to crawl the content of a target website.

Distinguish web crawlers by their function:

data acquisition
data processing
Data storage

With the above three components, the basic work framework flow is shown below:

请添加图片描述

II. Principles

Function: download web page data, provide data source for search engine system. Components: controller, parser, repository.

The web crawler system first puts the seed URL into the download queue and then simply takes a URL from the top of the queue and downloads its corresponding web page. After getting the content of the web page and storing it, it parses the link information in the web page to get some new URLs, which are added to the download queue. Then a URL is taken out, its corresponding web page is downloaded, parsed, and so on until the entire network is traversed or some condition is met.

III. Crawler classification

1. Traditional crawlers

Traditional crawlers start with the URL of one or a number of initial web pages, obtain the URL on the initial web page, and in the process of crawling the web page, continuously extract new URLs from the current page into the queue until certain stopping conditions of the system are met.

2. Focus on Crawlers

Focused crawlers have a complex workflow that requires them to filter out irrelevant links based on certain web analytics algorithms, retain useful links and place them in a queue of URLs to be crawled. Then it will select the next URL from the queue according to a certain search strategy and repeat the process until it reaches a certain condition. In addition, all web pages crawled by the crawler will be stored by the system, analyzed, filtered, and indexed for subsequent queries and retrieval. For the focused crawler, the analysis results obtained from this process may also give feedback and guidance for the future crawling process.

3. Generalized web crawler (full web crawler)

General-purpose web crawlers, also known as the whole web crawler, crawling object from some seed URL expansion to the entire Web, mainly for the portal site search engine and large Web service providers to collect data. This kind of web crawler crawling range and the number of huge, for crawling speed and storage space requirements are high, for crawling page order requirements are relatively low, at the same time, due to too many pages to be refreshed, usually using parallel work, but it takes a long time to refresh a page. Although there are certain defects, but the generalized web crawler is suitable for searching a wide range of topics for the search engine, there is a strong application value.

Actual web crawling systems are usually realized as a combination of several crawling techniques.

IV. Web crawling strategy

In a crawler system, the queue of URLs to be crawled is an important part. The order in which the URLs are listed in the queue is also an important issue, as it involves which pages are crawled first and which pages are crawled later.

And the method of deciding the order in which these URLs are arranged is called a crawling strategy.

1. Width-first search:

During the crawling process, the next level of search is performed only after the current level of search is completed.

Pros: The algorithm design and implementation is relatively simple. Disadvantages: As more web pages are crawled, a large number of irrelevant web pages will be downloaded and filtered, and the efficiency of the algorithm will become low.

2. Depth-first search:

Starting from the start page, select a URL to enter, analyze the URLs in this page, and crawl down link by link until one route is processed before processing the route in the next URL.

For example, the depth-first search in the following figure traverses A to B to D to E to F (ABDECF), while the width-first search traverses A B C D E F .

在这里插入图片描述

3. Best-first search:

According to certain web analytics, the similarity of the candidate URLs to the target web page, or the relevance to the topic, is predicted, and the best rated URL or URLs are selected for crawling.

4. Backlink count strategy:

The backlink count is the number of times a web page is pointed to by links from other web pages. The number of backlinks indicates the extent to which a page's content is recommended by others.

5. Partial PageRank strategy:

The Partial PageRank algorithm borrows the idea of the PageRank algorithm, for the downloaded web pages, together with the URLs in the queue of URLs to be crawled, form a collection of web pages, and calculate the PageRank value of each page, and after calculating the PageRank value of each page, the URLs in the queue of URLs to be crawled are arranged according to the size of the PageRank value, and the pages are crawled according to the order. PageRank.

V. Methods of web crawling

1. Distributed crawler

Used for massive URL management in the current Internet, it contains multiple crawlers (programs), each of which needs to perform tasks similar to those of a single crawler. They download web pages from the Internet, save them locally on disk, extract URLs from them and continue crawling along the pointers of these URLs. Since parallel crawlers need to split the download task, it is possible that a crawler may send the URLs it extracts to other crawlers.

These crawlers may be located on the same LAN or geographically dispersed.

Distributed crawlers are more popular now:

Apache Nutch: depends on hadoop to run, hadoop itself will consume a lot of time. Nutch is designed for search engine crawlers, if not to do search engine, try not to choose Nutch.

2, Java crawler

Developed in Java to capture the network resources of small programs , commonly used tools include Crawler4j, WebMagic, WebCollector and so on.

3. Non-Java crawlers

Scrapy: Written in Python, a lightweight, high-level screen scraping framework. The most attractive thing is that Scrapy is a framework that any user can modify to suit his or her needs, and has a number of high-level functions that can simplify the capture process.

VI. Project practice

1、Crawl the specified web page

Grabbing the front page of a website

Use urllib module, this module provides an interface to read web page data, you can read the data on www and ftp as if it were a local file. urllib is a URL processing package, this package has a collection of modules for processing URLs.

module: used to open and read URLs. module: contains some errors generated by the URLs, which can be caught with try. module: contains methods for parsing URLs. module: used to parse text files. It provides a separate RobotFileParser class, which provides the can_fetch() method to test whether the crawler can download a page.

The following code is for crawling a web page:

import 

url = "/"
# This side needs to emulate a browser in order to crawl
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'}
request = (url, headers=headers)
response = (request)
data = ()
# This side needs to be transcoded to display properly
print(str(data, 'utf-8'))

# The following code prints various types of information from the crawled web page
print(type(response))
print(())
print(())
print(())

2、Crawl web pages containing keywords

The code is as follows:

import 

data = {'word': 'King of Thieves'}
url_values = (data)
url = "/s?"
full_url = url + url_values
data = (full_url).read()
print(str(data, 'utf-8'))

3、Download the pictures in the posting

The code is as follows:

import re
import 

# Getting the source code of a web page
def getHtml(url):
    page = (url)
    html = ()
    return html

# Get all the images on the page
def getImg(html):
    reg = r'src="([.*\S]*\.jpg)" pic_ext="jpeg"'
    imgre = (reg)
    imglist = (imgre, html)
    return imglist


html = getHtml('/p/3205263090')
html = ('utf-8')
imgList = getImg(html)
imgName = 0
# Cyclic saving of images
for imgPath in imgList:
    f = open(str(imgName) + ".jpg", 'wb')
    (((imgPath)).read())
    ()
    imgName += 1
    print('Downloading %s image ' % imgName)
print('The site has run out of images to download')

4、Stock data capture

The code is as follows:

import random
import re
import time
import 

# Grab the desired content
user_agent = ["Mozilla/5.0 (Windows NT 10.0; WOW64)", 'Mozilla/5.0 (Windows NT 6.3; WOW64)',
              'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
              'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
              'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36',
              'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; rv:11.0) like Gecko)',
              'Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1',
              'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3',
              'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12',
              'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0',
              'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6',
              'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)',
              'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)',
              'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E)',
              'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Maxthon/4.0.6.2000 Chrome/26.0.1410.43 Safari/537.1 ',
              'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E; QQBrowser/7.3.9825.400)',
              'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 ',
              'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.92 Safari/537.1 LBBROWSER',
              'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; BIDUBrowser )',
              'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/3.0 Safari/536.11']

stock_total = []
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'}
for page in range(1, 8):
    url = '/stock/ranklist_a_3_1_' + str(page) + '.html'
    request = (url=url, headers={"User-Agent": (user_agent)})
    response = (request)
    content = str((), 'gbk')
    pattern = ('<tbody[\s\S]*</tbody')
    body = (pattern, str(content))
    pattern = ('>(.*?)<')
    # Regular Match
    stock_page = (pattern, body[0])
    stock_total.extend(stock_page)
    ((1, 4))
# Remove blank characters
stock_last = stock_total[:]
print(' Code', '\t', ' Abbreviations', '\t', 'Latest price', '\t', 'Up and down', '\t', 'Amount of increase/decrease', '\t', '5 Minute Gains')

for i in range(0, len(stock_last), 13):
    print(stock_last[i], '\t', stock_last[i + 1], '\t', stock_last[i + 2], '   ', '\t', stock_last[i + 3], '   ', '\t',
          stock_last[i + 4], '\t', stock_last[i + 5])

VI. Conclusion

The above uses Python version 3.9.

This is a reference from the book Python3 Data Analytics and Machine Learning in Action, which was written with a focus on learning.

I'm a bit lazy after writing this.

在这里插入图片描述

to this article on the python web crawler battle is introduced to this article, more related python crawler content please search my previous posts or continue to browse the following related articles I hope that you will support me more in the future!