SoFunction
Updated on 2024-11-20

Method implementation of how to pass parameters to Spider in Scrapy

When crawling data with Scrapy, you sometimes run into situations where you need to decide which Url to crawl or which page to crawl based on the parameters passed to the Spider.

For example, the Baidu posting bar of the placement of the bar address is as follows, where the kw parameter is used to specify the name of the posting bar, the pn parameter is used to turn the page of the post.

/f?kw=Placing the Wizards&ie=utf-8&.pn=250

If we want to pass parameters such as posting name and number of pages to Spider by parameter passing to control which posting we want to crawl and which pages we want to crawl. There are two ways to pass parameters to Spider in this case.

Mode 1

Pass parameters to the spider via the -a argument to the scrapy crawl command.

# -*- coding: utf-8 -*-
import scrapy

class TiebaSpider():
  name = 'tieba' # Poster Crawler
  allowed_domains = [''] # of crawls allowed
  start_urls = [] # Crawl start address

  # Command format: scrapy crawl tieba -a tiebaName=Placing Wizards -a pn=250
  def __init__(self, tiebaName=None, pn=None, *args, **kwargs):
    print(' < posting name >: ' + tiebaName)
    super(eval(self.__class__.__name__), self).__init__(*args, **kwargs)
    self.start_urls = ['/f?kw=%s&ie=utf-8&pn=%s' % (tiebaName,pn)]

  def parse(self, response):
    print() # in the end:/f?kw=%E6%94%BE%E7%BD%AE%E5%A5%87%E5%85%B5&ie=utf-8&pn=250

Mode 2

A specialized command, modeled after the source code of scrapy's crawl command.

First, you need to add the following configuration to the file to specify the directory where the customized scrapy commands are stored.

# Specify the directory where Scrapy commands are stored
COMMANDS_MODULE = 'baidu_tieba.commands'

Create a command file in the specified command storage directory, here we create a command file as , the format of the command to be executed in the future is:
scrapy run [ -option option_value]

import  as crawl
from  import UsageError
from  import ScrapyCommand


class Command():

  def add_options(self, parser):
    # Add options to commands
    ScrapyCommand.add_options(self, parser)
    parser.add_option("-k", "--keyword", type="str", dest="keyword", default="",
             help="set the tieba's name you want to crawl")
    parser.add_option("-p", "--pageNum", type="int", action="store", dest="pageNum", default=0,
             help="set the page number you want to crawl")

  def process_options(self, args, opts):
    # Handle incoming option arguments from the command line
    ScrapyCommand.process_options(self, args, opts)
    if :
      tiebaName = ()
      if tiebaName != '':
        ('TIEBA_NAME', tiebaName, priority='cmdline')
    else:
      raise UsageError("U must specify the tieba's name to crawl,use -kw TIEBA_NAME!")
    ('PAGE_NUM', , priority='cmdline')

  def run(self, args, opts):
    # Start the crawler
    self.crawler_process.crawl('tieba')
    self.crawler_process.start()

Initialize the TiebaSpider in BaiduTiebaPipeline's open_spider() method using the parameters passed in by the run command, where the example sets the start_urls.

# -*- coding: utf-8 -*-
import json

class BaiduTiebaPipeline(object):

  @classmethod
  def from_settings(cls, settings):
    return cls(settings)

  def __init__(self, settings):
     = settings

  def open_spider(self, spider):
    # Start the crawler
    spider.start_urls = [
      '/f?kw=%s&ie=utf-8&pn=%s' % (['TIEBA_NAME'], ['PAGE_NUM'])]

  def close_spider(self, spider):
    # Shut down the crawler
    pass

  def process_item(self, item, spider):
    # Save post content to file
    with open('', 'a', encoding='utf-8') as f:
      (dict(item), f, ensure_ascii=False, indent=2)
    return item

Once the setup is complete, don't forget to enable BaiduTiebaPipeline in.

ITEM_PIPELINES = {
  'baidu_tieba.': 50,
}

Startup example

Great job, refer to the following command format to start the posting crawler.

scrapy run -k casting pearls before swine -p 250

Reference Article:

/c0411034/article/details/81750028 

/qq_24760381/article/details/80361400 

/qq_38282706/article/details/80991196 

to this article on how to pass parameters to the Spider Scrapy method of implementation of the article is introduced to this, more related to Scrapy Spider pass parameters content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future!