Python Scrapy crawler framework examples of the use of analysis

typical example

The following is a simple Python crawler Scrapy framework code example, the code can crawl the Baidu search results page in the specified keywords links and titles and other information:

import scrapy
class BaiduSpider():
    name = 'baidu'
    allowed_domains = ['']
    start_urls = ['/s?wd=python']
    def parse(self, response):
        for link in ('h3 a'):
            item = {'title': ('::text').get(),
                    'link': ['href']}
            yield item

explicit explanation

First of all, we define a Spider class named "BaiduSpider", inherited from . The name attribute indicates the name of the crawler, the allowed_domains attribute indicates the range of domains that can be crawled, and the start_urls attribute lists the initial URLs of the pages to be crawled.

A method named "parse" is defined in the class to process the content of the crawled page. The method uses CSS selector syntax to extract the information we need from the page, such as links and titles under the grabbed tags.

An item object of type dict is constructed in the parse method, containing the title and URL address corresponding to each link.

Finally, the item object is generated and returned using a yield statement, causing the Scrapy framework to transform it into data in a format such as CSV, JSON, or XML and save it to disk.

The example is just the tip of the iceberg of the Scrapy framework code, in fact, Scrapy provides a large number of modules and tools, such as Item, Pipeline, Downloader and other components that can assist in the completion of the page parsing, data cleaning, storage and other operations. Therefore, when using Scrapy for crawler development, you need to carefully read the official documentation and familiarize yourself with its API interfaces and mechanisms.

Scrapy framework crawler using proxy ip

To use proxy IPs for web crawling in the Scrapy framework, you need to first define a Downloader Middleware for adding proxies to requests. Note that the proxy must support the HTTP protocol, otherwise it will not work properly. Below is a basic example of using proxy IP for Scrapy crawler:

Add the following configuration item to the

DOWNLOADER_MIDDLEWARES = {
    '': 400,
    'my_project.': 410,
}

where '' is the downloader middleware provided by Scrapy by default, which automatically fetches information about the proxy from the file, and 'my_project.' is our custom downloader middleware for setting up the proxy.

Create a new Python script in the project directory that defines the ProxyMiddleware class:

import random
class ProxyMiddleware(object):
    # List of proxy server addresses
    def __init__(self, proxy_list):
         = proxy_list
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            proxy_list=('PROXY_LIST')
        )
    # Each time a request executes this method, it randomly selects a proxy from the proxy pool to send the request to
    def process_request(self, request, spider):
        proxy = ()
        ['proxy'] = proxy
        print('Use proxy: ', proxy)

Where proxy_list is a list of proxy server addresses that need to be defined as a configuration item in the file as shown below:

PROXY_LIST = [
    'http://123.45.67.89:8080',
    'http://123.45.67.90:8080',
    # ...
]

Finally, you need to specify the settings file and proxy pool address to be used when running the command before the crawler starts, for example:

scrapy crawl my_spider -s PROXY_LIST='proxy_list.txt'

Where the proxy_list.txt file contains the proxy server addresses, one per line, for example:

http://123.45.67.89:8080
http://123.45.67.90:8080

In this way, randomized proxy addresses can be automatically used to send requests when making network requests, improving the efficiency and reliability of crawler data capture.

To this article on the Python Scrapy crawler framework to use the example of the analysis of the article is introduced to this, more related Python Scrapy content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!