When crawling data with Scrapy, you sometimes run into situations where you need to decide which Url to crawl or which page to crawl based on the parameters passed to the Spider.
For example, the Baidu posting bar of the placement of the bar address is as follows, where the kw parameter is used to specify the name of the posting bar, the pn parameter is used to turn the page of the post.
/f?kw=Placing the Wizards&ie=utf-8&.pn=250
If we want to pass parameters such as posting name and number of pages to Spider by parameter passing to control which posting we want to crawl and which pages we want to crawl. There are two ways to pass parameters to Spider in this case.
Mode 1
Pass parameters to the spider via the -a argument to the scrapy crawl command.
# -*- coding: utf-8 -*- import scrapy class TiebaSpider(): name = 'tieba' # Poster Crawler allowed_domains = [''] # of crawls allowed start_urls = [] # Crawl start address # Command format: scrapy crawl tieba -a tiebaName=Placing Wizards -a pn=250 def __init__(self, tiebaName=None, pn=None, *args, **kwargs): print(' < posting name >: ' + tiebaName) super(eval(self.__class__.__name__), self).__init__(*args, **kwargs) self.start_urls = ['/f?kw=%s&ie=utf-8&pn=%s' % (tiebaName,pn)] def parse(self, response): print() # in the end:/f?kw=%E6%94%BE%E7%BD%AE%E5%A5%87%E5%85%B5&ie=utf-8&pn=250
Mode 2
A specialized command, modeled after the source code of scrapy's crawl command.
First, you need to add the following configuration to the file to specify the directory where the customized scrapy commands are stored.
# Specify the directory where Scrapy commands are stored COMMANDS_MODULE = 'baidu_tieba.commands'
Create a command file in the specified command storage directory, here we create a command file as , the format of the command to be executed in the future is:scrapy run [ -option option_value]
。
import as crawl from import UsageError from import ScrapyCommand class Command(): def add_options(self, parser): # Add options to commands ScrapyCommand.add_options(self, parser) parser.add_option("-k", "--keyword", type="str", dest="keyword", default="", help="set the tieba's name you want to crawl") parser.add_option("-p", "--pageNum", type="int", action="store", dest="pageNum", default=0, help="set the page number you want to crawl") def process_options(self, args, opts): # Handle incoming option arguments from the command line ScrapyCommand.process_options(self, args, opts) if : tiebaName = () if tiebaName != '': ('TIEBA_NAME', tiebaName, priority='cmdline') else: raise UsageError("U must specify the tieba's name to crawl,use -kw TIEBA_NAME!") ('PAGE_NUM', , priority='cmdline') def run(self, args, opts): # Start the crawler self.crawler_process.crawl('tieba') self.crawler_process.start()
Initialize the TiebaSpider in BaiduTiebaPipeline's open_spider() method using the parameters passed in by the run command, where the example sets the start_urls.
# -*- coding: utf-8 -*- import json class BaiduTiebaPipeline(object): @classmethod def from_settings(cls, settings): return cls(settings) def __init__(self, settings): = settings def open_spider(self, spider): # Start the crawler spider.start_urls = [ '/f?kw=%s&ie=utf-8&pn=%s' % (['TIEBA_NAME'], ['PAGE_NUM'])] def close_spider(self, spider): # Shut down the crawler pass def process_item(self, item, spider): # Save post content to file with open('', 'a', encoding='utf-8') as f: (dict(item), f, ensure_ascii=False, indent=2) return item
Once the setup is complete, don't forget to enable BaiduTiebaPipeline in.
ITEM_PIPELINES = { 'baidu_tieba.': 50, }
Startup example
Great job, refer to the following command format to start the posting crawler.
scrapy run -k casting pearls before swine -p 250
Reference Article:
/c0411034/article/details/81750028
/qq_24760381/article/details/80361400
/qq_38282706/article/details/80991196
to this article on how to pass parameters to the Spider Scrapy method of implementation of the article is introduced to this, more related to Scrapy Spider pass parameters content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future!