Python using the Scrapy crawler framework to crawl the whole site and save the local implementation of the image code

Everyone can clone the full source code on Github.

Github：/williamzxl/Scrapy_CrawlMeiziTu

Official Scrapy documentation:/zh_CN/latest/

Basically, follow the documented process once and you'll basically be able to use it.

Step1：

Before you start crawling, you must create a new Scrapy project. Go into the directory where you intend to store the code and run the following command.

scrapy startproject CrawlMeiziTu

This command will create a tutorial directory containing the following.

CrawlMeiziTu/
 
 CrawlMeiziTu/
  __init__.py
  
  
  
 
  spiders/
   __init__.py
   ...
cd CrawlMeiziTu
scrapy genspider Meizitu /a/list_1_1.html

This command will create a tutorial directory containing the following.

CrawlMeiziTu/
 
 CrawlMeiziTu/
 __init__.py
  
  
  
 
  spiders/
　
   __init__.py
   ...

Our main edits are shown in the arrow below:

It was added later. Two commands were added.

from scrapy import cmdline
("scrapy crawl Meizitu".split())

Mainly for ease of operation.

Step2: Edit Settings, as shown below

 BOT_NAME = 'CrawlMeiziTu' 
 SPIDER_MODULES = ['']
 NEWSPIDER_MODULE = ''
 ITEM_PIPELINES = {
 '': 300,
 }
 IMAGES_STORE = 'D://pic2'
 DOWNLOAD_DELAY = 0.3

 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
 ROBOTSTXT_OBEY = True

Mainly set USER_AGENT, download path, download delay time.

Step3:Edit Items.

Items is mainly used to access the information crawled through the Spider program. Since we are crawling girl pictures, we need to capture the name of each picture, the link to the picture, the tags and so on.

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# /en/latest/topics/
import scrapy
class CrawlmeizituItem():
 # define the fields for your item here like:
 # name = ()
 #title is the name of the folder
 title = ()
 url = ()
 tags = ()
 #Photo link
 src = ()
 #alt is the name of the image
 alt = ()

Step4:Edit Pipelines

Pipelines mainly process the information obtained inside the items. For example, according to the title to create a folder or the name of the image, according to the image link to download the image.

# -*- coding: utf-8 -*-
import os
import requests
from  import IMAGES_STORE
class CrawlmeizituPipeline(object):
 def process_item(self, item, spider):
  fold_name = "".join(item['title'])
  header = {
   'USER-Agent': 'User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
   'Cookie': 'b963ef2d97e050aaf90fd5fab8e78633',
   # Need to view cookie information for images, otherwise downloaded images cannot be viewed
  }
  images = []
  # All the pictures in one folder
  dir_path = '{}'.format(IMAGES_STORE)
  if not (dir_path) and len(item['src']) != 0:
   (dir_path)
  if len(item['src']) == 0:
   with open('..//', 'a+') as fp:
    ("".join(item['title']) + ":" + "".join(item['url']))
    ("\n")
  for jpg_url, name, num in zip(item['src'], item['alt'],range(0,100)):
   file_name = name + str(num)
   file_path = '{}//{}'.format(dir_path, file_name)
   (file_path)
   if (file_path) or (file_name):
    continue
   with open('{}//{}.jpg'.format(dir_path, file_name), 'wb') as f:
    req = (jpg_url, headers=header)
    ()
  return item

Step5:Edit the main program of Meizitu.

The most important main program:

# -*- coding: utf-8 -*-
import scrapy
from  import CrawlmeizituItem
#from  import CrawlmeizituItemPage
import time
class MeizituSpider():
 name = "Meizitu"
 #allowed_domains = ["/"]
 start_urls = []
 last_url = []
 with open('..//', 'r') as fp:
  crawl_urls = ()
  for start_url in crawl_urls:
   last_url.append(start_url.strip('\n'))
 start_urls.append("".join(last_url[-1]))
 def parse(self, response):
  selector = (response)
  #item = CrawlmeizituItemPage()
  next_pages = ('//*[@]/ul/li/a/@href').extract()
  next_pages_text = ('//*[@]/ul/li/a/text()').extract()
  all_urls = []
  if 'Next Page' in next_pages_text:
   next_url = "/a/{}".format(next_pages[-2])
   with open('..//', 'a+') as fp:
    ('\n')
    (next_url)
    ("\n")
   request = (next_url, callback=)
   (2)
   yield request
  all_info = ('//h3[@class="tit"]/a')
  #Read the connection for each image folder
  for info in all_info:
   links = ('//h3[@class="tit"]/a/@href').extract()
  for link in links:
   request = (link, callback=self.parse_item)
   (1)
   yield request
  # next_link = ('//*[@]/ul/li/a/@href').extract()
  # next_link_text = ('//*[@]/ul/li/a/text()').extract()
  # if 'Next Page' in next_link_text:
  #  nextPage = "/a/{}".format(next_link[-2])
  #  item['page_url'] = nextPage
  #  yield item

   #Grabbing information about each folder
 def parse_item(self, response):
   item = CrawlmeizituItem()
   selector = (response)

   image_title = ('//h2/a/text()').extract()
   image_url = ('//h2/a/@href').extract()
   image_tags = ('//div[@class="metaRight"]/p/text()').extract()
   if ('//*[@]/p/img/@src').extract():
   image_src = ('//*[@]/p/img/@src').extract()
   else:
   image_src = ('//*[@]/div/p/img/@src').extract()
   if ('//*[@]/p/img/@alt').extract():
    pic_name = ('//*[@]/p/img/@alt').extract()
   else:
   pic_name = ('//*[@]/div/p/img/@alt').extract()
   #//*[@]/div/p/img/@alt
   item['title'] = image_title
   item['url'] = image_url
   item['tags'] = image_tags
   item['src'] = image_src
   item['alt'] = pic_name
   print(item)
   (1)
   yield item

summarize

The above is a small introduction to the Python using Scrapy crawler framework for the whole site to crawl the picture and save the local implementation of the code, I hope to help you, if you ah have any questions welcome to leave me a message, I will reply to you in a timely manner!