Previously we usedlxml on the pear video site in the video downloads, click and check it out if you're interested.
Below I used the scrapy framework to crawl the video title and the description of the video on the video page of the pear video website
Analysis: we want to crawl the content is not in the same page, the video description of the content we need to click on the video, jump to a new url in order to get, we can not be in a method to parse the different contents we need
1. Crawler files
- Here we can write a new parse method modeled after the parse method in the crawler file, and can pass the response object of the new url to this new parse method
- If you need to use the same item object in different parse methods, you can pass the item to the callback function using the meta parameter dictionary
- The parse in the crawler file requires a Request request for yield, and the item is passed to the next parse method or pipeline file using the yield item in the new parse method
import scrapy # Import BossprojectItem class from file from import BossprojectItem class BossSpider(): name = 'boss' # allowed_domains = [''] start_urls = ['/category_5'] # The callback function accepts the response object and accepts the passed meata parameter def content_parse(self,response): # meta parameter is included in the response response object, call meta and then take out the corresponding value based on the key:item item = ['item'] # Parsing the description of the video in the video link des = ('//div[@class="summary"]/text()').extract() des = "".join(des) item['des'] = des yield item # Parse the title of the video on the home page and the link to the video def parse(self, response): li_list = ('//div[@]/ul/li') for li in li_list: href = ('./div/a/@href').extract() href = "/" + "".join(href) title = ('./div[1]/a/div[2]/text()').extract() title = "".join(title) item = BossprojectItem() item["title"] = title # Manually send the request and pass the response object to the callback function #Request pass:meta={}, which allows you to pass the meta dictionary to the corresponding callback function of the request yield (href,callback=self.content_parse,meta={'item':item})
To put the BossprojectItem classImport the crawler file to be able to create the item object
import scrapy class BossprojectItem(): # define the fields for your item here like: # name = () # Defines the item attribute title = () des = ()
open_spider(self,spider) and close_spider(self,spider) override these two parent methods, and both methodsThey are all executed only once.It is a good idea to keep the return item in the process_item method, because if there are multiple pipeline classes, the return item automatically passes the item object to the pipeline class with a lower priority than itself
from itemadapter import ItemAdapter class BossprojectPipeline: def __init__(self): = None # Override parent class methods and call them only once def open_spider(self,spider): print("Crawl on.") = open('./','w') # Accept the item object passed by field in the crawler file, and store the contents of the item persistently def process_item(self, item, spider): (item['title'] + '\n\t' + item['des'] + '\n') # If there is more than one pipeline class, the item will be passed to the next pipeline class # The priority of the pipeline class depends on the corresponding value in the ITEM_PIPELINES attribute in the ## ITEM_PIPELINES = {'': 300,} The smaller the value in the key value the higher the priority return item # Override parent class methods and call them only once def close_spider(self,spider): () print("Crawler over.")
4. Perform persistent storage
to this article on the python crawler scrapy framework pear video case analysis of the article is introduced to this, more related python crawler scrapy framework content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!