Everyone can clone the full source code on Github.
Github:/williamzxl/Scrapy_CrawlMeiziTu
Official Scrapy documentation:/zh_CN/latest/
Basically, follow the documented process once and you'll basically be able to use it.
Step1:
Before you start crawling, you must create a new Scrapy project. Go into the directory where you intend to store the code and run the following command.
scrapy startproject CrawlMeiziTu
This command will create a tutorial directory containing the following.
CrawlMeiziTu/ CrawlMeiziTu/ __init__.py spiders/ __init__.py ... cd CrawlMeiziTu scrapy genspider Meizitu /a/list_1_1.html
This command will create a tutorial directory containing the following.
CrawlMeiziTu/ CrawlMeiziTu/ __init__.py spiders/ __init__.py ...
Our main edits are shown in the arrow below:
It was added later. Two commands were added.
from scrapy import cmdline ("scrapy crawl Meizitu".split())
Mainly for ease of operation.
Step2: Edit Settings, as shown below
BOT_NAME = 'CrawlMeiziTu' SPIDER_MODULES = [''] NEWSPIDER_MODULE = '' ITEM_PIPELINES = { '': 300, } IMAGES_STORE = 'D://pic2' DOWNLOAD_DELAY = 0.3 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' ROBOTSTXT_OBEY = True
Mainly set USER_AGENT, download path, download delay time.
Step3:Edit Items.
Items is mainly used to access the information crawled through the Spider program. Since we are crawling girl pictures, we need to capture the name of each picture, the link to the picture, the tags and so on.
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # /en/latest/topics/ import scrapy class CrawlmeizituItem(): # define the fields for your item here like: # name = () #title is the name of the folder title = () url = () tags = () #Photo link src = () #alt is the name of the image alt = ()
Step4:Edit Pipelines
Pipelines mainly process the information obtained inside the items. For example, according to the title to create a folder or the name of the image, according to the image link to download the image.
# -*- coding: utf-8 -*- import os import requests from import IMAGES_STORE class CrawlmeizituPipeline(object): def process_item(self, item, spider): fold_name = "".join(item['title']) header = { 'USER-Agent': 'User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36', 'Cookie': 'b963ef2d97e050aaf90fd5fab8e78633', # Need to view cookie information for images, otherwise downloaded images cannot be viewed } images = [] # All the pictures in one folder dir_path = '{}'.format(IMAGES_STORE) if not (dir_path) and len(item['src']) != 0: (dir_path) if len(item['src']) == 0: with open('..//', 'a+') as fp: ("".join(item['title']) + ":" + "".join(item['url'])) ("\n") for jpg_url, name, num in zip(item['src'], item['alt'],range(0,100)): file_name = name + str(num) file_path = '{}//{}'.format(dir_path, file_name) (file_path) if (file_path) or (file_name): continue with open('{}//{}.jpg'.format(dir_path, file_name), 'wb') as f: req = (jpg_url, headers=header) () return item
Step5:Edit the main program of Meizitu.
The most important main program:
# -*- coding: utf-8 -*- import scrapy from import CrawlmeizituItem #from import CrawlmeizituItemPage import time class MeizituSpider(): name = "Meizitu" #allowed_domains = ["/"] start_urls = [] last_url = [] with open('..//', 'r') as fp: crawl_urls = () for start_url in crawl_urls: last_url.append(start_url.strip('\n')) start_urls.append("".join(last_url[-1])) def parse(self, response): selector = (response) #item = CrawlmeizituItemPage() next_pages = ('//*[@]/ul/li/a/@href').extract() next_pages_text = ('//*[@]/ul/li/a/text()').extract() all_urls = [] if 'Next Page' in next_pages_text: next_url = "/a/{}".format(next_pages[-2]) with open('..//', 'a+') as fp: ('\n') (next_url) ("\n") request = (next_url, callback=) (2) yield request all_info = ('//h3[@class="tit"]/a') #Read the connection for each image folder for info in all_info: links = ('//h3[@class="tit"]/a/@href').extract() for link in links: request = (link, callback=self.parse_item) (1) yield request # next_link = ('//*[@]/ul/li/a/@href').extract() # next_link_text = ('//*[@]/ul/li/a/text()').extract() # if 'Next Page' in next_link_text: # nextPage = "/a/{}".format(next_link[-2]) # item['page_url'] = nextPage # yield item #Grabbing information about each folder def parse_item(self, response): item = CrawlmeizituItem() selector = (response) image_title = ('//h2/a/text()').extract() image_url = ('//h2/a/@href').extract() image_tags = ('//div[@class="metaRight"]/p/text()').extract() if ('//*[@]/p/img/@src').extract(): image_src = ('//*[@]/p/img/@src').extract() else: image_src = ('//*[@]/div/p/img/@src').extract() if ('//*[@]/p/img/@alt').extract(): pic_name = ('//*[@]/p/img/@alt').extract() else: pic_name = ('//*[@]/div/p/img/@alt').extract() #//*[@]/div/p/img/@alt item['title'] = image_title item['url'] = image_url item['tags'] = image_tags item['src'] = image_src item['alt'] = pic_name print(item) (1) yield item
summarize
The above is a small introduction to the Python using Scrapy crawler framework for the whole site to crawl the picture and save the local implementation of the code, I hope to help you, if you ah have any questions welcome to leave me a message, I will reply to you in a timely manner!