SoFunction
Updated on 2024-12-13

Scrapy based on Python to build a powerful web crawler framework example to explore

Scrapy is a powerful Python-based web crawler framework specialized in extracting data from web pages. Its features and benefits make it the tool of choice for many data mining engineers and developers.

Scrapy features

  • Asynchronous framework: Scrapy is based on the Twisted asynchronous network library, which can handle multiple requests at the same time to improve crawling efficiency.

  • Modular design: It includes various modules such as middleware, extensions, pipelines, etc. that can be easily extended and customized.

  • Selector: Supports XPath and CSS selectors for easy extraction of page data.

  • Automatic speed limit: Built-in automatic speed limit to prevent crawlers from overburdening the site.

  • Request Filtering: Requests can be filtered based on URL patterns or other rules.

  • Data processing: Extracted data can be stored to files or databases, as well as cleaned and transformed.

The Scrapy Advantage

  • Efficient crawling: Multi-threaded asynchronous processing of requests for fast and efficient crawling of web content.

  • Flexibility: Many customization options and settings are provided to be able to adapt to different crawling needs.

  • Rich selector: Support for XPath and CSS selectors makes it easier to locate and extract data.

  • Automated functions: The automated mechanism helps to handle common web crawling tasks, saving time and effort.

Applicable Scenarios

  • Data Mining and Analysis: Used to capture web page data for further analysis.

  • Search Engine: Used to build search engine crawling engines.

  • Content Syndication: Services and applications for building content syndication.

  • Competitive Analysis: Used for data analysis of competitors.

  • Monitor and update: Used to monitor website content changes and updates.

Scrapy's features and benefits make it a powerful web crawler framework for many different domains and needs, from simple data crawling to complex web crawling tasks.

Scrapy vs. other crawler frameworks

Asynchronous Architecture: Scrapy is based on the Twisted asynchronous framework, which allows requests to be processed asynchronously, increasing efficiency. Compared to some synchronous frameworks, it can process multiple requests more quickly.

Flexibility: Compared to some rule-based configuration of the crawler framework , Scrapy provides more customization options and flexibility . Users can customize requests, data processing and storage as needed.

Fully functional: Scrapy is a full-featured crawler framework with various built-in functional modules such as middleware, pipelines, extensions, etc. which can be easily extended and customized.

Data processing capabilities: Compared to some frameworks, Scrapy offers more data processing tools, such as XPath and CSS selectors, as well as data cleaning, storage, and other features.

Community and documentation support: Scrapy has huge community support and extensive documentation, making it easier to learn and solve problems.

Learning Curve: While Scrapy offers great features, its learning curve may be steeper for some novices, and some other frameworks may be more accessible in comparison.

Locate the object: Scrapy is more suitable for users with a certain programming foundation, and is more friendly to users who have some knowledge of crawler frameworks.

While Scrapy has a clear advantage in terms of functionality and flexibility, some other frameworks may also be better suited to certain needs in specific situations. For example, for tasks that require only a simple, quick data crawl, some simpler crawler frameworks may be preferred. Choosing the right framework depends on the specific needs and individual skill level.

Install Scrapy and other dependencies

Before you can start using Scrapy, you first need to install Scrapy and its dependencies. Open the command line interface (command prompt for Windows or terminal for macOS and Linux) and execute the following command:

pip install scrapy

This command will install the Scrapy framework using pip (Python's package manager). It will automatically install any other dependencies needed by Scrapy.

Creating a Virtual Environment

A virtual environment ensures that different Python packages and their versions are used in different projects, avoiding conflicts between packages. It is possible to use thevirtualenv maybevenv Create a virtual environment.

utilizationvirtualenv Create a virtual environment:

pip install virtualenv  # If virtualenv is not installed
virtualenv myenv  # Create a virtual environment named myenv
source myenv/bin/activate  # Activate the virtual environment (use myenv\Scripts\activate on Windows)

utilizationvenv Create a virtual environment (Python 3 comes with one):

python -m venv myenv  # Create a virtual environment named myenv
source myenv/bin/activate  # Activate the virtual environment (use myenv\Scripts\activate on Windows)

Installing the Scrapy Shell and other useful tools

Scrapy provides Scrapy Shell for testing and debugging web crawling. The Shell comes with the installation of Scrapy. You can type directly on the command linescrapy shell to start the Scrapy Shell.

In addition to the Scrapy Shell, Scrapy provides other useful tools and commands, such as thescrapy crawl Used to run crawlers,scrapy startproject for creating new projects, etc. All of these tools are installed with Scrapy and require no additional installation steps.

Scrapy Basics

Scrapy Architecture: Components and How They Work

The architecture of Scrapy is based on the Twisted asynchronous web framework. It consists of a variety of components, including engines, schedulers, downloaders, middleware, crawlers, and so on. These components work together to complete the entire process from sending HTTP requests to processing the response data. The engine is responsible for coordinating the work of the components, getting requests from the scheduler and sending them to the downloader, which then gets the response and sends it back to the Spider for processing. The middleware, on the other hand, allows the user to intercept and manipulate the data in the process, thus enabling some customization.

Scrapy's data flow (Request/Response)

In Scrapy, data flow follows Request and Response. When the Spider generates an initial request, it sends it to the engine, which in turn sends it to the scheduler. The scheduler queues the request and sends it to the downloader, which downloads the page and returns a response, which is then sent to the Spider for processing. the Spider analyzes the data in the response and generates a new request, and the process continues in a loop until all tasks are completed.

Selectors for Scrapy: XPath and CSS

Scrapy provides powerful selector functionality with support for XPath and CSS selectors. These selectors allow users to extract data from HTML pages in a simple and flexible way.XPath is a language for selecting nodes in an XML document, while CSS selectors locate elements with CSS-like selectors. Developers can choose the more appropriate selector for data extraction according to their needs.

Spider in Scrapy: Creating a Custom Spider

A Spider is a core concept in Scrapy that defines how to crawl a website(s) for information. Users can write custom Spiders to specify how to crawl a site and how to parse page content to extract data. By inheriting Scrapy's Spider class and implementing custom parsing methods, it is possible to create crawlers that adapt to different site structures.

Customizing Scrapy's Settings

Scrapy provides a set of configurable settings that allow the user to customize the behavior of the crawler. These settings include the number of concurrent requests, download delays, user agents, and more. By adjusting these settings, users can fine-tune the behavior of the crawler to suit a particular website or need.

Website Crawling

When it comes to code examples, here is a simple example to explain the concepts mentioned in the website crawling practical:

Creating a Crawler Project: New Project, Defining Item and Pipeline

scrapy startproject myproject

exist Define the Item in:

import scrapy

class MyItem():
    title = ()
    link = ()
    publish_date = ()

exist Write the Pipeline in (example stored as a JSON file):

import json

class MyPipeline(object):
    def open_spider(self, spider):
         = open('', 'w')

    def close_spider(self, spider):
        ()

    def process_item(self, item, spider):
        line = (dict(item)) + "\n"
        (line)
        return item

Parsing web data: XPath/CSS selectors, regular expressions

import scrapy

class MySpider():
    name = 'myspider'
    start_urls = ['']

    def parse(self, response):
        for post in ('//div[@class="post"]'):
            yield {
                'title': ('h2/a/text()').get(),
                'link': ('h2/a/@href').get(),
                'publish_date': ('span/text()').get()
            }

Processing data: cleansing and storing data into databases or files

exist Pipeline is enabled in the

ITEM_PIPELINES = {
    '': 300,
}

Page tracking and following links in Scrapy

class MySpider():
    name = 'myspider'
    start_urls = ['']

    def parse(self, response):
        for post in ('//div[@class="post"]'):
            yield {
                'title': ('h2/a/text()').get(),
                'link': ('h2/a/@href').get(),
                'publish_date': ('span/text()').get()
            }

        # Examples of page tracking and follow links
        next_page = ('//a[@class="next_page"]/@href').get()
        if next_page is not None:
            yield (next_page, callback=)

The sample code covers the basic process from creating a Scrapy project to defining a Spider, extracting data using selectors, processing the data and following the links. In practice, more specific code will need to be written depending on the structure and needs of the site to be crawled.

advanced application

Extending Scrapy with Middleware: Handling User-Agents, IP Proxies, etc.

Processing User-Agent: Random User-Agent is set through middleware to simulate request headers from different browsers or devices to avoid being recognized as a crawler by websites.

from scrapy import signals
from fake_useragent import UserAgent
class RandomUserAgentMiddleware:
    def __init__(self):
         = UserAgent()
    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        (middleware.spider_opened, signals.spider_opened)
        return middleware
    def spider_opened(self, spider):
        pass
    def process_request(self, request, spider):
        ('User-Agent', )

IP proxy settings: Set up an IP proxy via middleware to hide the real IP address.

class ProxyMiddleware:
    def process_request(self, request, spider):
        ['proxy'] = 'http://your_proxy_address'

Simulated login: how to perform a simulated login to access content that requires authorization

Use FormRequest to simulate a login: Send a login POST request in Spider to get the login cookie information.

import scrapy
class LoginSpider():
    name = 'login_spider'
    start_urls = ['/login']
    def parse(self, response):
        return .from_response(
            response,
            formdata={'username': 'your_username', 'password': 'your_password'},
            callback=self.after_login
        )
    def after_login(self, response):
        # Check if login was successful and proceed to next request
        if "Welcome" in :
            yield (url='/protected_page', callback=self.parse_protected)
    def parse_protected(self, response):
        # Handle protected pages after login
        pass

Scrapy and Dynamic Web Pages: Handling JavaScript Rendered Pages

Handles JavaScript rendering of the page: Crawl dynamically generated content in JavaScript using tools such as Selenium or Splash.

from scrapy_selenium import SeleniumRequest
class MySpider():
    name = 'js_spider'
    start_urls = ['']
    def start_requests(self):
        for url in self.start_urls:
            yield SeleniumRequest(url=url, callback=)
    def parse(self, response):
        # Handle JavaScript rendered pages
        pass

These code snippets demonstrate advanced applications of Scrapy, including processing request headers, IP proxy settings, simulating logins, and handling dynamic web pages. Depending on the actual needs, developers can further customize and adapt the code to meet specific crawler requirements.

Debugging and Optimization

Debugging with the Scrapy Shell

Scrapy Shell is an interactive Python console that allows you to test and debug code before the crawler runs.

scrapy shell ""

In the Shell, you can use thefetchcommand to fetch pages, run selectors to extract data, or test Spider's request and response logic.

Optimizing crawlers: avoiding blocking, reducing the risk of detection

Set the download delay: Avoiding too many requests to the target site can be accomplished with theDOWNLOAD_DELAY Set the download delay.

# Set the download delay in
DOWNLOAD_DELAY = 2

Random User-Agent and IP Proxy: Using the middleware mentioned earlier, set up a random User-Agent and IP proxy to prevent being recognized as a crawler.

Avoid too many duplicate requests: utilizationDUPEFILTER_CLASS to avoid duplicate requests.

# Set the DUPEFILTER_CLASS in the
DUPEFILTER_CLASS = ''

Managing large-scale data: distributed crawling, Scrapy clustering

Distributed crawling: Using a distributed crawling framework such as Scrapy-Redis allows multiple crawler instances to share the same task queue.

Scrapy Cluster: Deploy multiple instances of Scrapy crawler to manage task scheduling and data storage to improve crawling efficiency.

For the management of large-scale data, this involves a more complex architecture and setup, requiring more code and configuration to ensure that multiple crawler instances can work together efficiently to avoid data redundancy and duplication of task execution.

practical example

Commodity Price Comparison Sites

Suppose you want to create a product price comparison website. Using Scrapy it is easy to grab item prices from multiple e-commerce sites and then display those prices on your site so that users can compare prices from different sites.

  • Create Spider: Create Spider to crawl multiple e-commerce sites for specific product information.

  • Data processing: Processing and cleansing of product price data crawled from different websites.

  • Data storage: Store the processed data in a database.

  • Website Showcase: Use a web framework (such as Django or Flask) to create a web interface that displays comparative product price data.

Code and Analysis

Creating a Spider

import scrapy

class PriceComparisonSpider():
    name = 'price_spider'
    start_urls = ['/products']

    def parse(self, response):
        products = ('//div[@class="product"]')
        for product in products:
            yield {
                'name': ('h2/a/text()').get(),
                'price': ('span[@class="price"]/text()').get()
            }

Data processing and storage

class CleanDataPipeline:
    def process_item(self, item, spider):
        item['price'] = self.clean_price(item['price'])
        return item

    def clean_price(self, price):
        # Logic to achieve price data cleansing
        return cleaned_price

Data Display

Create web applications using frameworks such as Django or Flask to present the crawled data to the user.

summarize

Scrapy is a powerful and flexible Python web crawler framework for extracting data from web pages. This post covers all aspects of Scrapy, from basic concepts to advanced applications and real-world case studies. It begins with an introduction to Scrapy's basic architecture, workings, and data flow, including the use of selectors, the creation of Spiders, and how to customize Scrapy's Settings.This is followed by an in-depth look at how to create a crawler project in the real world, parse web page data, process and store the data, and how to do page tracking and follow links.

The advanced section describes some advanced applications, including extending Scrapy with middleware, simulating logins to obtain authorized content, and handling dynamic web pages. Also, debugging and optimization methods are discussed, including using the Scrapy Shell for debugging, optimizing the crawler to avoid being banned, and solutions for managing large-scale data, such as distributed crawling and Scrapy clustering, are presented.

Finally, a real-world example shows how to create a product price comparison website, use Scrapy to crawl product price data from multiple e-commerce websites, clean and store the data, and finally display it to the user through a web framework.Scrapy's power, flexibility, and rich ecosystem make it ideal for handling web crawling and data extraction to meet a wide range of crawler Scrapy is the ideal choice to handle web crawling and data extraction to meet various crawler needs.

Above is Scrapy based on Python to build a powerful web crawler framework examples to explore the details, more information about Python Scrapy web crawler framework, please pay attention to my other related articles!