SoFunction
Updated on 2024-11-12

Python's Scrapy crawler framework installation and simple to use details

Title: I have long heard of the great name of python crawler framework. In recent days, I have studied the Scrapy crawler framework, and I will share my understanding with you. There are improperly expressed, look forward to the gods corrected.

A First Glimpse of Scrapy

Scrapy is an application framework written to crawl website data and extract structured data. It can be applied in a range of programs including data mining, information processing or storing historical data.

Its original purpose was topage capture(More precisely.web crawling), and can also be used to obtain data returned by the API (e.g., theAmazon Associates Web Services) or a generic web crawler.

This document will give you an understanding of how Scrapy works by introducing you to the concepts behind it and determining if Scrapy is what you need.

When you are ready to start your project, you can refer to theprimer

Second, Scrapy installation introduction

Scrapy framework running platform and related auxiliary tools

  1. Python 2.7 (the latest version of Python is 3.5, version 2.7 was chosen here)
  2. Python Package:pipandsetuptools. pip now relies on setuptools, which are automatically installed if not already installed.
  3. lxml. Most Linux distributions come with lxml, but if it's missing, check out the/
  4. OpenSSL. is provided on all systems except Windows (see the Platform Installation Guide).

You can use pip to install Scrapy (pip is recommended for installing Python packages).

pip install Scrapy

Installation process under Windows:

1、After installing Python 2.7, you need to modify thePATHenvironment variable to add the Python executable and additional scripts to the system path. Add the following paths to thePATHIn.

C:\Python27\;C:\Python27\Scripts\;

In addition to this, the Path can be set with the cmd command:

c:\python27\ c:\python27\tools\scripts\win_add2path.py

After the installation and configuration is complete, you can execute the command python --version to view the installed python version. (as shown here)

2. From/projects/pywin32/mountingpywin32

Please make sure to download the version that matches your system (win32 or amd64).

through (a gap)/en/latest/Install pip

3. Open a command line window and confirmpipCorrectly installed.

pip --version

4. So far Python 2.7 andpipIt's already working correctly. Next, install Scrapy.

pip install Scrapy

At this point the installation of Scrapy for windows has been completed.

Three, Scrapy Getting Started Tutorials

1. Create a Scrapy project project in cmd.

scrapy startproject tutorial

H:\python\scrapyDemo>scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory 'f:\\python27\\lib\\site-packages\\scrapy\\templates\\project', created in:
  H:\python\scrapyDemo\tutorial

You can start your first spider with:
  cd tutorial
  scrapy genspider example 

2. The file directory structure is as follows:

Parsing scrapy framework structure:

  1. : Configuration file for the project.
  2. tutorial/: Python module for this project. After that you will add the code here.
  3. tutorial/: The item file in the project.
  4. tutorial/: The pipelines file in the project.
  5. tutorial/: The project's settings file.
  6. tutorial/spiders/: The directory where the spider code is placed.

3. Writing a simple crawler

1. Configure the field instances of the page to be collected in.

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# /en/latest/topics/
import scrapy
from  import Item, Field
class TutorialItem(Item):
  title = Field()
  author = Field()
  releasedate = Field()

2, in the tutorial/spiders/ in writing to collect the site as well as the collection of each field.

# -*-coding:utf-8-*-
import sys
from  import SgmlLinkExtractor
from  import CrawlSpider, Rule
from  import TutorialItem
reload(sys)
("utf-8")
class ListSpider(CrawlSpider):
  # Crawler name
  name = "tutorial"
  # Set the download delay
  download_delay = 1
  # Allowed domain names
  allowed_domains = [""]
  # Start URL
  start_urls = [
    ""
  ]
  # Crawl rules, without callback means to crawl recursively to the url class
  rules = (
    Rule(SgmlLinkExtractor(allow=(r'/n/page/\d',))),
    Rule(SgmlLinkExtractor(allow=(r'/n/\d+',)), callback='parse_content'),
  )

  # Parsing content functions
  def parse_content(self, response):
    item = TutorialItem()

    # Current URL
    title = ('//div[@]')[0].extract().decode('utf-8')
    item['title'] = title

    author = ('//div[@]/span/a/text()')[0].extract().decode('utf-8')
    item['author'] = author

    releasedate = ('//div[@]/span[@class="time"]/text()')[0].extract().decode(
      'utf-8')
    item['releasedate'] = releasedate

    yield item

3. Save the data in tutorial/pipeline.

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: /en/latest/topics/
import json
import codecs
class TutorialPipeline(object):
  def __init__(self):
     = ('', mode='wb', encoding='utf-8')# Data is stored to

  def process_item(self, item, spider):
    line = (dict(item)) + "\n"
    (("unicode_escape"))

    return item

4. tutorial/ configure the execution environment.

# -*- coding: utf-8 -*-
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['']
NEWSPIDER_MODULE = ''

# Block cookies to prevent banning
COOKIES_ENABLED = False
COOKIES_ENABLES = False

# Set the pipeline, here to realize the data written to the file
ITEM_PIPELINES = {
  '': 300
}

# Set the maximum depth of the crawler's crawl
DEPTH_LIMIT = 100

5. New main file to execute the crawler code.

from scrapy import cmdline
("scrapy crawl tutorial".split())

Eventually, after execution, get the json data of the collection result in the file.

This is the whole content of this article.