SoFunction
Updated on 2024-11-19

Python crawler scrapy framework of pear video case studies

Previously we usedlxml on the pear video site in the video downloads, click and check it out if you're interested.

Below I used the scrapy framework to crawl the video title and the description of the video on the video page of the pear video website

在这里插入图片描述
在这里插入图片描述

Analysis: we want to crawl the content is not in the same page, the video description of the content we need to click on the video, jump to a new url in order to get, we can not be in a method to parse the different contents we need

1. Crawler files

  • Here we can write a new parse method modeled after the parse method in the crawler file, and can pass the response object of the new url to this new parse method
  • If you need to use the same item object in different parse methods, you can pass the item to the callback function using the meta parameter dictionary
  • The parse in the crawler file requires a Request request for yield, and the item is passed to the next parse method or pipeline file using the yield item in the new parse method

import scrapy

# Import BossprojectItem class from file
from  import BossprojectItem

class BossSpider():
 name = 'boss'
 # allowed_domains = ['']
 start_urls = ['/category_5']

 # The callback function accepts the response object and accepts the passed meata parameter
 def content_parse(self,response):
 # meta parameter is included in the response response object, call meta and then take out the corresponding value based on the key:item
 item = ['item']

 # Parsing the description of the video in the video link
 des = ('//div[@class="summary"]/text()').extract()
 des = "".join(des)
 item['des'] = des

 yield item 

 # Parse the title of the video on the home page and the link to the video
 def parse(self, response):
 li_list = ('//div[@]/ul/li')
 for li in li_list:
  href = ('./div/a/@href').extract()
  href = "/" + "".join(href)

  title = ('./div[1]/a/div[2]/text()').extract()
  title = "".join(title)

  item = BossprojectItem()
  item["title"] = title

  # Manually send the request and pass the response object to the callback function
  #Request pass:meta={}, which allows you to pass the meta dictionary to the corresponding callback function of the request
  yield (href,callback=self.content_parse,meta={'item':item})

To put the BossprojectItem classImport the crawler file to be able to create the item object

import scrapy
class BossprojectItem():
 # define the fields for your item here like:
 # name = ()
 # Defines the item attribute
 title = ()
 des = ()

open_spider(self,spider) and close_spider(self,spider) override these two parent methods, and both methodsThey are all executed only once.It is a good idea to keep the return item in the process_item method, because if there are multiple pipeline classes, the return item automatically passes the item object to the pipeline class with a lower priority than itself

from itemadapter import ItemAdapter
class BossprojectPipeline:

 def __init__(self):
  = None

 # Override parent class methods and call them only once
 def open_spider(self,spider):
 print("Crawl on.")
  = open('./','w')

 # Accept the item object passed by field in the crawler file, and store the contents of the item persistently
 def process_item(self, item, spider):
 (item['title'] + '\n\t' + item['des'] + '\n')

 # If there is more than one pipeline class, the item will be passed to the next pipeline class
 # The priority of the pipeline class depends on the corresponding value in the ITEM_PIPELINES attribute in the
  ## ITEM_PIPELINES = {'': 300,} The smaller the value in the key value the higher the priority
 return item

 # Override parent class methods and call them only once
 def close_spider(self,spider): 
 ()
 print("Crawler over.")

4. Perform persistent storage

在这里插入图片描述

to this article on the python crawler scrapy framework pear video case analysis of the article is introduced to this, more related python crawler scrapy framework content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!