Title: I have long heard of the great name of python crawler framework. In recent days, I have studied the Scrapy crawler framework, and I will share my understanding with you. There are improperly expressed, look forward to the gods corrected.
A First Glimpse of Scrapy
Scrapy is an application framework written to crawl website data and extract structured data. It can be applied in a range of programs including data mining, information processing or storing historical data.
Its original purpose was topage capture(More precisely.web crawling), and can also be used to obtain data returned by the API (e.g., theAmazon Associates Web Services) or a generic web crawler.
This document will give you an understanding of how Scrapy works by introducing you to the concepts behind it and determining if Scrapy is what you need.
When you are ready to start your project, you can refer to theprimer。
Second, Scrapy installation introduction
Scrapy framework running platform and related auxiliary tools
- Python 2.7 (the latest version of Python is 3.5, version 2.7 was chosen here)
- Python Package:pipandsetuptools. pip now relies on setuptools, which are automatically installed if not already installed.
- lxml. Most Linux distributions come with lxml, but if it's missing, check out the/
- OpenSSL. is provided on all systems except Windows (see the Platform Installation Guide).
You can use pip to install Scrapy (pip is recommended for installing Python packages).
pip install Scrapy
Installation process under Windows:
1、After installing Python 2.7, you need to modify thePATH
environment variable to add the Python executable and additional scripts to the system path. Add the following paths to thePATH
In.
C:\Python27\;C:\Python27\Scripts\;
In addition to this, the Path can be set with the cmd command:
c:\python27\ c:\python27\tools\scripts\win_add2path.py
After the installation and configuration is complete, you can execute the command python --version to view the installed python version. (as shown here)
2. From/projects/pywin32/mountingpywin32
Please make sure to download the version that matches your system (win32 or amd64).
through (a gap)/en/latest/Install pip
3. Open a command line window and confirmpip
Correctly installed.
pip --version
4. So far Python 2.7 andpip
It's already working correctly. Next, install Scrapy.
pip install Scrapy
At this point the installation of Scrapy for windows has been completed.
Three, Scrapy Getting Started Tutorials
1. Create a Scrapy project project in cmd.
scrapy startproject tutorial
H:\python\scrapyDemo>scrapy startproject tutorial New Scrapy project 'tutorial', using template directory 'f:\\python27\\lib\\site-packages\\scrapy\\templates\\project', created in: H:\python\scrapyDemo\tutorial You can start your first spider with: cd tutorial scrapy genspider example
2. The file directory structure is as follows:
。
Parsing scrapy framework structure:
-
: Configuration file for the project.
-
tutorial/
: Python module for this project. After that you will add the code here. -
tutorial/
: The item file in the project. -
tutorial/
: The pipelines file in the project. -
tutorial/
: The project's settings file. -
tutorial/spiders/
: The directory where the spider code is placed.
3. Writing a simple crawler
1. Configure the field instances of the page to be collected in.
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# /en/latest/topics/
import scrapy
from import Item, Field
class TutorialItem(Item):
title = Field()
author = Field()
releasedate = Field()
2, in the tutorial/spiders/ in writing to collect the site as well as the collection of each field.
# -*-coding:utf-8-*- import sys from import SgmlLinkExtractor from import CrawlSpider, Rule from import TutorialItem reload(sys) ("utf-8") class ListSpider(CrawlSpider): # Crawler name name = "tutorial" # Set the download delay download_delay = 1 # Allowed domain names allowed_domains = [""] # Start URL start_urls = [ "" ] # Crawl rules, without callback means to crawl recursively to the url class rules = ( Rule(SgmlLinkExtractor(allow=(r'/n/page/\d',))), Rule(SgmlLinkExtractor(allow=(r'/n/\d+',)), callback='parse_content'), ) # Parsing content functions def parse_content(self, response): item = TutorialItem() # Current URL title = ('//div[@]')[0].extract().decode('utf-8') item['title'] = title author = ('//div[@]/span/a/text()')[0].extract().decode('utf-8') item['author'] = author releasedate = ('//div[@]/span[@class="time"]/text()')[0].extract().decode( 'utf-8') item['releasedate'] = releasedate yield item
3. Save the data in tutorial/pipeline.
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: /en/latest/topics/ import json import codecs class TutorialPipeline(object): def __init__(self): = ('', mode='wb', encoding='utf-8')# Data is stored to def process_item(self, item, spider): line = (dict(item)) + "\n" (("unicode_escape")) return item
4. tutorial/ configure the execution environment.
# -*- coding: utf-8 -*- BOT_NAME = 'tutorial' SPIDER_MODULES = [''] NEWSPIDER_MODULE = '' # Block cookies to prevent banning COOKIES_ENABLED = False COOKIES_ENABLES = False # Set the pipeline, here to realize the data written to the file ITEM_PIPELINES = { '': 300 } # Set the maximum depth of the crawler's crawl DEPTH_LIMIT = 100
5. New main file to execute the crawler code.
from scrapy import cmdline ("scrapy crawl tutorial".split())
Eventually, after execution, get the json data of the collection result in the file.
This is the whole content of this article.