SoFunction
Updated on 2024-11-17

python implement scrapy crawler daily timed crawl data sample code

1. Preamble.

1.1 Background of needs.

  • The data is grabbed daily for the same item and is used for trend analysis.
  • The requirement is that you need to grab one copy per day, and you are limited to grabbing one copy of the data.
  • But the whole process of crawling data in time is not certain, subject to the local network, proxy speed, crawl data volume related, generally in about 20 hours, rare cases will be more than 24 hours.

1.2 Realization of functions.

Ensure that the crawler automatically crawls the data every other day by following these three steps:
The monitoring script is started at 00:01 AM every day to monitor the running status of the crawler and start the crawler once it enters the idle state.

Once the crawler has finished executing, it automatically exits the script, ending the task for the day.

Once the script is more than 24 hours away from starting, automatically exit the script and wait for the next day's monitoring script to start, repeating the three steps.

2. Environment.

python 3.6.1

System: Win7

IDE:pycharm

Scrapy installed

3. Design thinking.

3.1 Prerequisites:

Currently the crawler is launched via the scrapy module that comes with it.

from scrapy import cmdline
('scrapy crawl mySpider'.split())

3.2. Making auto-execution scripts external to scrapy crawlers

(1) Start the script every day at 00:01 AM (control the survival time of the script for 24 hours) and monitor the running status of the crawler (a marker message is needed to indicate the status of the crawler: running or stopped).

  • If the crawler is in the running state (the previous day's crawl has not yet finished), go to step (2);
  • If the crawler is in a non-running state (the previous day's crawling task has been completed and today's has not yet started), go to step (3);

(2) The script enters the waiting phase and every 10 minutes, it checks the running status of the crawler as in (1). But once it is found that, the waiting time of the script exceeds 24 hours, the script is automatically exited because the monitoring script of the next day is already running, taking over its task.

(3) Do some preparatory work before the start of the crawler (delete the file used to continue crawling, to prevent the crawler does not run), start the crawler to crawl the data, and when the crawler is finished normally, exit the script to complete the day's crawling task.

4. Preparatory work.

4.1. mark the operational status of the crawler.

Determine if the crawler is in a running state by determining if a file exists:

  • Create a file when the crawler starts.
  • At the end of the crawl, delete this file.

Then presence means that the crawler is running; absence of the file means that the crawler is not running.

# Documentation
# When the crawler starts
checkFile = ""
class myPipeline:
  def open_spider(self, spider):
     = MongoClient('localhost:27017') # Connect to Mongodb
     = ['mydata']        # Mydata, the database where the data is to be stored
    f = open(checkFile, "w")     # Create a file that represents the crawler in action
    ()
# Documentation
# At the end of a normal crawl
checkFile = ""
class myPipeline:
  def close_spider(self, spider):
    ()
    isFileExsit = (checkFile)
    if isFileExsit:
      (checkFile)

4.2. The crawler supports continuous crawling, and can be paused at any time to facilitate debugging.

# Add file to scrapy project for starting crawler
from scrapy import cmdline
# In the process of running the crawler, it will automatically store the status information in the crawls/storeMyRequest directory, to support continuous crawling.
('scrapy crawl mySpider -s JOBDIR=crawls/storeMyRequest'.split())
# Note:If you want to keep climbing,existctrl+cWhen terminating the crawler,You can only press it once.,爬虫exist终止时需要进行善后工作,Do not press multiple times in a rowctrl+c

这里写图片描述

4.3. Logs are named according to each day's date for easy viewing and debugging

Set the Log level:

# Documentation
class mySpider(CrawlSpider):
  name = "mySpider"
  allowed_domains = ['/']
  custom_settings = {
    'LOG_LEVEL':'INFO', # Reduce the amount of Log output and keep only the necessary information
    # ...... Inside the crawler, you can use custom_setting to make this configuration information available only to this crawler
  }

Naming Log files by date

# Documentation
import datetime
BOT_NAME = 'mySpider'
ROBOTSTXT_OBEY = False
startDate = ().strftime('%Y%m%d')
LOG_FILE=f"mySpiderlog{startDate}.txt"

4.4. Storing data into different tables (collections in mongodb) by date

# Documentation
import datetime
GALANCE=f'galance{().strftime("%Y%m%d")}' # table name
class myPipeline:
  def open_spider(self, spider):
     = MongoClient('localhost:27017') # Connect to Mongodb
     = ['mydata']        # Databases in which data are to be storedmydata
[GALANCE].insert(dict(item))

这里写图片描述

4.5 Write a batch file to start the crawler

# Documentation
cd /d F:/newClawer20170831/mySpider
call python 
pause

这里写图片描述

5. Code implementation

5.1 Writing python scripts

# Documentation
from scrapy import cmdline
import datetime
import time
import shutil
import os

recoderDir = r"crawls"  # This is a directory created for the crawler to be able to continue crawling, storing the data needed to continue crawling.
checkFile = "" # Flags whether the crawler is running or not

startTime = ()
print(f"startTime = {startTime}")

i = 0
miniter = 0
while True:
  isRunning = (checkFile)
  if not isRunning:            # The crawler is not executing, start the crawler #
    # Take care of a few things before the crawler starts, clearing out JOBDIR = crawls
    isExsit = (recoderDir) # Check for the existence of the JOBDIR directory crawls
    print(f"mySpider not running, ready to start. isExsit:{isExsit}")
    if isExsit:
      removeRes = (recoderDir) # Delete the crawl directory crawls and all files in it.
      print(f"At time:{()}, delete res:{removeRes}")
    else:
      print(f"At time:{()}, Dir:{recoderDir} is not exsit.")
    (20)
    clawerTime = ()
    waitTime = clawerTime - startTime
    print(f"At time:{clawerTime}, start clawer: mySpider !!!, waitTime:{waitTime}")
    ('scrapy crawl mySpider -s JOBDIR=crawls/storeMyRequest'.split())
    break # Exit the script after the crawler is finished
  else:
    print(f"At time:{()}, mySpider is running, sleep to wait.")
  i += 1
  (600)    # Checked every 10 minutes
  miniter += 10
  if miniter >= 1440:  # Wait for 24 hours and exit the monitoring script automatically
    break

5.2 Preparation of bat batch files

# Documentation
cd /d F:/newClawer20170831/mySpider
call python 
pause

6. Deployment.

6.1. add scheduled tasks.

Refer to this blog below to deploy windows scheduled tasks:

https:///article/

Detailed instructions about the settings related to windows scheduled tasks are as follows:

/zh-cn/library/

6.2.

(1) When adding a scheduled task, check the box as shown below (Runs only when the user is logged in) in order to pop up the following cmd task screen for easy observation and debugging.

这里写图片描述 

这里写图片描述

(2) Since the crawler runs for a long time, if you follow the default settings, running the instance in the early hours of the morning when the previous startup has not yet finished will cause this startup to fail, so change the default settings to (If this task is already running: run the new instance in parallelThe protection mechanism is that each startup script waits for 24 hours and then automatically quits. (The protection mechanism is that each startup script will automatically exit after waiting 24 hours to ensure that it will not be started repeatedly).

这里写图片描述

(3) If you want to support the continuous transmission, you can only press it once.ctrl + c to stop the crawler from running. This is because theWhen terminating the crawler, the crawler needs to do some aftermath workIf you press ctrl + c several times in a row to stop the crawler, the crawler will be too late to make good on its mistake and will fail to renew the crawl. 6.3. effect display.

Normal execution is complete:

这里写图片描述

Implementation is in progress:

这里写图片描述

to this article on the python implementation of scrapy crawler daily sample code to grab data on a regular basis is introduced to this article, more related python scrapy timed to grab content please search my previous posts or continue to browse the following related articles I hope you will support me more in the future!