SoFunction
Updated on 2024-11-20

A simple tutorial on how to use Scrapy

In this Getting Started tutorial, we assume that you already have python installed. if you don't have it installed, then see theInstallation Guide

First step one: enter the development environment, workon article_spider

Enter this environment:

Installation of Scrapy, in the process of installing some errors: usually these errors are part of the file is not installed to cause, because the university often appear, so on the solution to this problem, it is very real, directly to /~gohlke/pythonlibs/ this site to download the corresponding file, download and install with pip, the specific process is not in the details.

Then go to the project directory and open our newly created virtual environment:

New scrapy project: ArticleSpider

Creating a good project framework: importing in pycharm

 

: Configuration file for the project.
ArticleSpeder/: python module for this project. You will add the code here later.
ArticleSpeder/: The item file in the project.
ArticleSpeder/: pipelines file in the project.
ArticleSpeder/: Settings file for the project.
ArticleSpeder/spiders/: The directory where the spider code is placed.

Go back to the dos window and create a template with basic

It's already created in the pycharm screenshot above:

For better development in the future, create a class for debugging

from  import execute
import sys
import os
print(((__file__)))
(((__file__)))
execute(["scrapy","crawl","jobbole"])

This is the code content

import sys In order to set the project directory, call the command to take effect

Inside the path is better not to write dead: you can get the path through os, more flexible

execute is used to execute the target program's

content

class JobboleSpider():
name = 'jobbole'
allowed_domains = ['']
start_urls = ['/110287']

def parse(self, response):
re_selector = ("/html/body/div[1]/div[3]/div[1]/div[1]/h1")
re2_selector = ('//*[@]/div[1]/h1')
title = ('//div[@class="entry-header"]/h1/text()')
create_date = ("")
#//*[@]
dian_zan = int(("//span[contains(@class,'vote-post-up ')]/h10/text()").extract()[0])
pass

Through the xpath technology to obtain the corresponding article some field information, including the title, time, number of comments, number of likes, etc., because it is relatively simple, so do not go into detail

As I write this, we all know that debugging in pycharm is troublesome because scrapy is quite large, so we can use the Scrapy shell to debug at this time.

The markup section is the address of the target site: now we can debug more pleasantly.

That's it for today's scrapy primer!