Python crawler Scrapy environment setup
How to build a Scrapy environment
First of all, you need to install Python environment, Python environment build see: /alice_tl/article/details/76793590
Next, install Scrapy
1, install Scrapy, use pip install Scrapy in the terminal (note that the best foreign environment)
Progress tips are below:
alicedeMacBook-Pro:~ alice$ pip install Scrapy Collecting Scrapy Using cached /packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2. Collecting w3lib>=1.17.0 (from Scrapy) Using cached /packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2. Collecting six>=1.5.2 (from Scrapy) xxxxxxxxxxxxxxxxxxxxx File "/System/Library/Frameworks//Versions/2.7/Extras/lib/python/setuptools/", line 380, in fetch_build_egg return cmd.easy_install(req) File "/System/Library/Frameworks//Versions/2.7/Extras/lib/python/setuptools/command/easy_install.py", line 632, in easy_install raise DistutilsError(msg) : Could not find suitable distribution for ('incremental>=16.10.1') ---------------------------------------- Command "python egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/
A missing Twisted error message appears:
Command "python egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/
2. To install Twiseted, enter: sudo pip install twisted==13.1.0 in the terminal.
alicedeMacBook-Pro:~ alice$ pip install twisted==13.1.0 Collecting twisted==13.1.0 Downloading /packages/10/38/0d1988d53f140ec99d37ac28c04f341060c2f2d00b0a901bf199ca6ad984/Twisted-13.1..bz2 (2.7MB) 100% |████████████████████████████████| 2.7MB 398kB/s Requirement already satisfied: >=3.6.0 in /System/Library/Frameworks//Versions/2.7/Extras/lib/python (from twisted==13.1.0) (4.1.1) Requirement already satisfied: setuptools in /System/Library/Frameworks//Versions/2.7/Extras/lib/python (from >=3.6.0->twisted==13.1.0) (18.5) Installing collected packages: twisted Running install for twisted ... error Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-inJwZ2/twisted/';f=getattr(tokenize, 'open', open)(__file__);code=().replace('\r\n', '\n');();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-record-OmuVWF/ --single-version-externally-managed --compile: running install running build running build_py creating build creating build/-10.13-intel-2.7 creating build/-10.13-intel-2.7/twisted copying twisted/ -> build/-10.13-intel-2.7/twisted copying twisted/_version.py -> build/li
3, once again using sudo pip install scrapy installation, found that there is still an error message, this time is not installed lxml error message:
Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )
No matching distribution found for lxml (from Scrapy)
alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. Collecting Scrapy Downloading /packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2. (249kB) 100% |████████████████████████████████| 256kB 210kB/s Collecting w3lib>=1.17.0 (from Scrapy) xxxxxxxxxxxx Downloading /packages/90/50/4c315ce5d119f67189d1819629cae7908ca0b0a6c572980df5cc6942bc22/Twisted-18.7..bz2 (3.1MB) 100% |████████████████████████████████| 3.1MB 59kB/s Collecting lxml (from Scrapy) Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: ) No matching distribution found for lxml (from Scrapy)
4, install lxml, use: sudo pip install lxml
alicedeMacBook-Pro:~ alice$ sudo pip install lxml The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. Collecting lxml Downloading /packages/a1/2c/6b324d1447640eb1dd240e366610f092da98270c057aeb78aa596cda4dab/lxml-4.2.4-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.7MB) 100% |████████████████████████████████| 8.7MB 187kB/s Installing collected packages: lxml Successfully installed lxml-4.2.4
5, install scrapy again, use sudo pip install scrapy, the installation is successful!
alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. Collecting Scrapy Downloading /packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2. (249kB) 100% |████████████████████████████████| 256kB 11.5MB/s Collecting w3lib>=1.17.0 (from Scrapy) xxxxxxxxx Requirement already satisfied: lxml in /Library/Python/2.7/site-packages (from Scrapy) (4.2.4) Collecting functools32; python_version < "3.0" (from parsel>=1.1->Scrapy) Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='', port=443): Read timed out. (read timeout=15)",)': /simple/functools32/ Downloading /packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746a97fab05a372e4a2c6a6b876165/idna-2.7-py2. (58kB) 100% |████████████████████████████████| 61kB 66kB/s Installing collected packages: w3lib, cssselect, functools32, parsel, queuelib, PyDispatcher, attrs, pyasn1-modules, service-identity, , constantly, incremental, Automat, idna, hyperlink, PyHamcrest, Twisted, Scrapy Running install for functools32 ... done Running install for PyDispatcher ... done Found existing installation: 4.1.1 Uninstalling -4.1.1: Successfully uninstalled -4.1.1 Running install for ... done Running install for Twisted ... done Successfully installed Automat-0.7.0 PyDispatcher-2.0.5 PyHamcrest-1.9.0 Scrapy-1.5.1 Twisted-18.7.0 attrs-18.1.0 constantly-15.1.0 cssselect-1.0.3 functools32-3.2.3.post2 hyperlink-18.0.0 idna-2.7 incremental-17.5.0 parsel-1.5.0 pyasn1-modules-0.2.2 queuelib-1.5.0 service-identity-17.0.0 w3lib-1.19.0 -4.5.0
6, check whether scrapy is installed successfully, enter scrapy --version
The version information of scrapy appears, e.g., Scrapy 1.5.1 - no active project is sufficient.
alicedeMacBook-Pro:~ alice$ scrapy --version Scrapy 1.5.1 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command
PS: If you are not able to access the org network properly and install with sudo administrator privileges in the middle of the process, you will get a similar error message
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip/_internal/", line 141, in main
status = (options, args)
File "/Library/Python/2.7/site-packages/pip/_internal/commands/", line 299, in run
(requirement_set)
Exception: Traceback (most recent call last): File "/Library/Python/2.7/site-packages/pip/_internal/", line 141, in main status = (options, args) File "/Library/Python/2.7/site-packages/pip/_internal/commands/", line 299, in run (requirement_set) File "/Library/Python/2.7/site-packages/pip/_internal/", line 102, in resolve self._resolve_one(requirement_set, req) File "/Library/Python/2.7/site-packages/pip/_internal/", line 256, in _resolve_one abstract_dist = self._get_abstract_dist_for(req_to_install) File "/Library/Python/2.7/site-packages/pip/_internal/", line 209, in _get_abstract_dist_for self.require_hashes File "/Library/Python/2.7/site-packages/pip/_internal/operations/", line 283, in prepare_linked_requirement progress_bar=self.progress_bar File "/Library/Python/2.7/site-packages/pip/_internal/", line 836, in unpack_url progress_bar=progress_bar File "/Library/Python/2.7/site-packages/pip/_internal/", line 673, in unpack_http_url progress_bar) File "/Library/Python/2.7/site-packages/pip/_internal/", line 897, in _download_http_url _download_url(resp, link, content_file, hashes, progress_bar) File "/Library/Python/2.7/site-packages/pip/_internal/", line 617, in _download_url hashes.check_against_chunks(downloaded_chunks) File "/Library/Python/2.7/site-packages/pip/_internal/utils/", line 48, in check_against_chunks for chunk in chunks: File "/Library/Python/2.7/site-packages/pip/_internal/", line 585, in written_chunks for chunk in chunks: File "/Library/Python/2.7/site-packages/pip/_internal/", line 574, in resp_read decode_content=False): File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/", line 465, in stream data = (amt=amt, decode_content=decode_content) File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/", line 430, in read raise IncompleteRead(self._fp_bytes_read, self.length_remaining) File "/System/Library/Frameworks//Versions/2.7/lib/python2.7/", line 35, in __exit__ (type, value, traceback) File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/", line 345, in _error_catcher raise ReadTimeoutError(self._pool, None, 'Read timed out.') ReadTimeoutError: HTTPSConnectionPool(host='', port=443): Read timed out.
Followed the guide on setting up the Scrapy environment.
Scrapy crawler running common errors and solutions
Follow the first Spider code exercise, saved in the dmoz_spider.py file in the tutorial/spiders directory:.
import scrapy class DmozSpider(): name = "dmoz" allowed_domains = [""] start_urls = [ "/Computers/Programming/Languages/Python/Books/", "/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = ("/")[-2] with open(filename, 'wb') as f: ()
Run: scrapy crawl dmoz in terminal to try to start the crawler
Reporting Error Tip One:
Scrapy 1.6.0 - no active project
Unknown command: crawl
alicedeMacBook-Pro:~ alice$ scrapy crawl dmoz Scrapy 1.6.0 - no active project Unknown command: crawl Use "scrapy" to see available commands
The reason:It is automatically generated when you startproject using the command line. And theWhen you start the crawler using the command line cmd, crawl searches for files in the current directory of cmd, as explained in the official documentation. If no file is found, it is assumed that the project does not exist.
Solution: So cd into the root directory of the dmoz project, i.e. the directory where the files are, and execute the command scrapy crawl dmoz
The output obtained normally should be:
2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened
2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET /Computers/Programming/Languages/Python/Resources/> (referer: None)
2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET /Computers/Programming/Languages/Python/Books/> (referer: None)
But it's not.
Error Reporting Tip Two:
File "/Library/Frameworks//Versions/3.7/lib/python3.7/site-packages/scrapy/", line 71, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: dmoz'
alicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz 2019-04-19 09:28:23 [] INFO: Scrapy 1.6.0 started (bot: tutorial) 2019-04-19 09:28:23 [] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:39:00) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.3.0-x86_64-i386-64bit Traceback (most recent call last): File "/Library/Frameworks//Versions/3.7/lib/python3.7/site-packages/scrapy/", line 69, in load return self._spiders[spider_name] KeyError: 'dmoz' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Library/Frameworks//Versions/3.7/lib/python3.7/site-packages/scrapy/", line 71, in load raise KeyError("Spider not found: {}".format(spider_name)) KeyError: 'Spider not found: dmoz'
Reason: Incorrectly located directory, to get to the directory where dmoz is located.
Solution: It's also relatively simple to re-check the directory into the
Reporting Error Tip Three:
File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in <module>
from OpenSSL._util import lib as pyOpenSSLlib
ImportError: No module named _util
alicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz 2018-08-06 22:25:23 [] INFO: Scrapy 1.5.1 started (bot: tutorial) 2018-08-06 22:25:23 [] INFO: Versions: lxml 4.2.4.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.10 (default, Jul 15 2017, 17:16:57) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)], pyOpenSSL 0.13.1 (LibreSSL 2.2.7), cryptography unknown, Platform Darwin-17.3.0-x86_64-i386-64bit 2018-08-06 22:25:23 [] INFO: Overridden settings: {'NEWSPIDER_MODULE': '', 'SPIDER_MODULES': [''], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'} Traceback (most recent call last): File "/usr/local/bin/scrapy", line 11, in <module> (execute()) File "/Library/Python/2.7/site-packages/scrapy/", line 150, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "/Library/Python/2.7/site-packages/scrapy/", line 90, in _run_print_help func(*a, **kw) File "/Library/Python/2.7/site-packages/scrapy/", line 157, in _run_command t/", line 230, in <module> from ._sslverify import ( File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in <module> from OpenSSL._util import lib as pyOpenSSLlib ImportError: No module named _util
I've been checking online for a long time and still no solution. Some bloggers said that there is a problem with the installation of pyOpenSSL or Scrapy, so I re-installed pyOpenSSL and Scrapy, but it still reported the same error, so I really don't know how to solve it.
Reinstalling pyOpenSSL and Scrapy later seems to have solved it~
2019-04-19 09:46:37 [] INFO: Spider opened 2019-04-19 09:46:37 [] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-04-19 09:46:39 [] DEBUG: Crawled (403) <GET /> (referer: None) 2019-04-19 09:46:39 [] DEBUG: Crawled (403) <GET /Computers/Programming/Languages/Python/Books/> (referer: None) 2019-04-19 09:46:40 [] INFO: Ignoring response <403 /Computers/Programming/Languages/Python/Books/>: HTTP status code is not handled or not allowed 2019-04-19 09:46:40 [] DEBUG: Crawled (403) <GET /Computers/Programming/Languages/Python/Resources/> (referer: None) 2019-04-19 09:46:40 [] INFO: Ignoring response <403 /Computers/Programming/Languages/Python/Resources/>: HTTP status code is not handled or not allowed 2019-04-19 09:46:40 [] INFO: Closing spider (finished) 2019-04-19 09:46:40 [] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 737, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'downloader/response_bytes': 2103, 'downloader/response_count': 3, 'downloader/response_status_count/403': 3, 'finish_reason': 'finished', 'finish_time': (2019, 4, 19, 1, 46, 40, 570939), 'httperror/response_ignored_count': 2, 'httperror/response_ignored_status_count/403': 2, 'log_count/DEBUG': 3, 'log_count/INFO': 9, 'log_count/WARNING': 1, 'memusage/max': 65601536, 'memusage/startup': 65597440, 'response_received_count': 3, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/403': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': (2019, 4, 19, 1, 46, 37, 468659)} 2019-04-19 09:46:40 [] INFO: Spider closed (finished) alicedeMacBook-Pro:tutorial alice$
This article on the Python crawler Scrapy environment to build a case tutorial article is introduced to this, more related to the Python crawler Scrapy environment to build content, please search for my previous posts or continue to browse the following related articles I hope you will support me in the future!