The basic process of using Scrapy with examples explained

The previous section has covered how to create a scrapy project, and a basic introduction to the functionality of the files in the project.

This time, let's talk about the basic process of using it:

(1) First of all, open a terminal and find the path where you want to create your scrapy project. Here, I am created on the desktop. Open the terminal and type:
cd Desktop It goes to the desktop file storage location.

(2) Create a scrapy project. Terminal input:scrapy startproject image

Terminal Input:cd image

Continue typing:scrapy genspider imageSpider

(3) Open the file on your desktop in pycharm and go to theSet up crawler rules. The rules can be commented out directly, or changed toFalse。

(4) Go back to the crawler file.

variationstart_url,Just change the default first URL of the crawler to the URL of the website that needs to be crawled.

(5) The following can crawl the data, here choose to download the image.

After crawling the data, it is important to store the data in theTransfer to pipeline in file

Next introduce the pipeline model in the crawler file.

from ..items import ImageItem

The data model created in the file is used in the parse function in the crawler file.

item = ImageItem()

Points of Attention:

Sometimes when outputting from the terminal, if the return content is , if the object type is, then the object can be continued to be iterated over, or xpath can be used to continue to find the contents.

If the terminal encounters this problem:

# ValueError:Missing scheme in request url:h

Then it is necessary to use theextract（）Converts an xpath object into a list object. List objects, on the other hand, can continue to be iterated over, but not using thexpathto find the objects inside.

You also need to set the path and storage location for the image to be downloaded, in the file, before downloading.

The code attached below is as follows. Code for crawler file only:

# -*- coding: utf-8 -*-
import scrapy
from ..items import ImageItem
class ImagespiderSpider():
  name = 'imageSpider'
  allowed_domains = ['']
  start_urls = ['/4kmeinv/']
 
  def parse(self, response):
    img_list = ('//ul[@class="clearfix"]/li/a/img/@src')
    # A number of src attribute values were found, now iterate through them, using each one separately
    for img in img_list:
      # Use the data model item created in the
      item = ImageItem()
      print('--------------------')
      img = ()
      # Splicing the url of the image to get the full download address
      src = '' +img
      # Put the resulting data into the model
      # Because it's a download address, wrap it in a list or it will report an error.
      item['src'] = [src]
      yield item
    next_url = ('//div[@class="page"]/a[text()="next page"]/@href').extract()
    print('*****************************************************************')
    if len(next_url)!=0:
      url = ''+next_url[0]
      # Pass the url and the result will be processed.
      yield (url=url,callback=)

summarize

Above is the entire content of this article, I hope the content of this article for your study or work has a certain reference learning value, thank you for your support. If you want to know more about the content please check the following related links