The previous section has covered how to create a scrapy project, and a basic introduction to the functionality of the files in the project.
This time, let's talk about the basic process of using it:
(1) First of all, open a terminal and find the path where you want to create your scrapy project. Here, I am created on the desktop. Open the terminal and type:cd Desktop
It goes to the desktop file storage location.
(2) Create a scrapy project. Terminal input:scrapy startproject image
Terminal Input:cd image
Continue typing:scrapy genspider imageSpider
(3) Open the file on your desktop in pycharm and go to theSet up crawler rules. The rules can be commented out directly, or changed to
False
。
(4) Go back to the crawler file.
variationstart_url
,Just change the default first URL of the crawler to the URL of the website that needs to be crawled.
(5) The following can crawl the data, here choose to download the image.
After crawling the data, it is important to store the data in theTransfer to pipeline in file
Next introduce the pipeline model in the crawler file.
from ..items import ImageItem
The data model created in the file is used in the parse function in the crawler file.
item = ImageItem()
Points of Attention:
Sometimes when outputting from the terminal, if the return content is
, if the object type is, then the object can be continued to be iterated over, or xpath can be used to continue to find the contents.
If the terminal encounters this problem:
# ValueError:Missing scheme in request url:h
Then it is necessary to use theextract()
Converts an xpath object into a list object. List objects, on the other hand, can continue to be iterated over, but not using thexpath
to find the objects inside.
You also need to set the path and storage location for the image to be downloaded, in the file, before downloading.
The code attached below is as follows. Code for crawler file only:
# -*- coding: utf-8 -*- import scrapy from ..items import ImageItem class ImagespiderSpider(): name = 'imageSpider' allowed_domains = [''] start_urls = ['/4kmeinv/'] def parse(self, response): img_list = ('//ul[@class="clearfix"]/li/a/img/@src') # A number of src attribute values were found, now iterate through them, using each one separately for img in img_list: # Use the data model item created in the item = ImageItem() print('--------------------') img = () # Splicing the url of the image to get the full download address src = '' +img # Put the resulting data into the model # Because it's a download address, wrap it in a list or it will report an error. item['src'] = [src] yield item next_url = ('//div[@class="page"]/a[text()="next page"]/@href').extract() print('*****************************************************************') if len(next_url)!=0: url = ''+next_url[0] # Pass the url and the result will be processed. yield (url=url,callback=)
summarize
Above is the entire content of this article, I hope the content of this article for your study or work has a certain reference learning value, thank you for your support. If you want to know more about the content please check the following related links