In this Getting Started tutorial, we assume that you already have python installed. if you don't have it installed, then see theInstallation Guide。
First step one: enter the development environment, workon article_spider
Enter this environment:
Installation of Scrapy, in the process of installing some errors: usually these errors are part of the file is not installed to cause, because the university often appear, so on the solution to this problem, it is very real, directly to /~gohlke/pythonlibs/ this site to download the corresponding file, download and install with pip, the specific process is not in the details.
Then go to the project directory and open our newly created virtual environment:
New scrapy project: ArticleSpider
Creating a good project framework: importing in pycharm
: Configuration file for the project.
ArticleSpeder/: python module for this project. You will add the code here later.
ArticleSpeder/: The item file in the project.
ArticleSpeder/: pipelines file in the project.
ArticleSpeder/: Settings file for the project.
ArticleSpeder/spiders/: The directory where the spider code is placed.
Go back to the dos window and create a template with basic
It's already created in the pycharm screenshot above:
For better development in the future, create a class for debugging
from import execute import sys import os print(((__file__))) (((__file__))) execute(["scrapy","crawl","jobbole"])
This is the code content
import sys In order to set the project directory, call the command to take effect
Inside the path is better not to write dead: you can get the path through os, more flexible
execute is used to execute the target program's
content
class JobboleSpider(): name = 'jobbole' allowed_domains = [''] start_urls = ['/110287'] def parse(self, response): re_selector = ("/html/body/div[1]/div[3]/div[1]/div[1]/h1") re2_selector = ('//*[@]/div[1]/h1') title = ('//div[@class="entry-header"]/h1/text()') create_date = ("") #//*[@] dian_zan = int(("//span[contains(@class,'vote-post-up ')]/h10/text()").extract()[0]) pass
Through the xpath technology to obtain the corresponding article some field information, including the title, time, number of comments, number of likes, etc., because it is relatively simple, so do not go into detail
As I write this, we all know that debugging in pycharm is troublesome because scrapy is quite large, so we can use the Scrapy shell to debug at this time.
The markup section is the address of the target site: now we can debug more pleasantly.
That's it for today's scrapy primer!