I. What is a crawler?
In the big data analysis or for data mining, the data source can be obtained from certain websites that provide data statistics, or from certain literature or internal information, but these ways of obtaining data are sometimes difficult to meet our needs for data, and manually searching for these data from the Internet consumes too much energy. At this point you can use crawler technology, automatically from the Internet to obtain the data content we are interested in, and will crawl these data content back as our data source, so as to carry out a deeper level of data analysis, and obtain more valuable information. Before using a crawler first of all you need to understand the library (requests) or ( ) required for the crawler, which is created for the task of crawling data.
II. Steps for use
1. Introduction of libraries
The code is as follows (example):
import os import import random import time class BeikeSpider: def __init__(self, save_path="./beike"): """ Shell Crawler Constructor :param save_path: directory where pages are saved """
2. Read data
The code is as follows :
# Web site model self.url_mode = "http://{}./loupan/pg{}/" # Cities to be crawled = ["cd", "sh", "bj"] # of pages crawled per city self.total_pages = 20 # Put the crawler to sleep for 5-10 seconds randomly = (5, 10) # Web download save root directory self.save_path = save_path # Setting up a user agent that is a crawler masquerading as a browser = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"} # Proxy IP information = [ {"https": "123.163.67.50:8118"}, {"https": "58.56.149.198:53281"}, {"https": "14.115.186.161:8118"} ] # Create save directories if not (self.save_path): (self.save_path) def crawl(self): """ Execute the crawl task :return: None """
The data of the url web request used there.
3. Randomly select an ip address to build a proxy server
for city in : print("Cities being crawled:", city) # Separate directory for each city's web page path = (self.save_path, city) if not (path): (path) for page in range(1, self.total_pages+1): # Build the full url url = self.url_mode.format(city, page) # Build the Request object, put the url and request headers into the object request = (url, headers=) # Randomly select a proxy IP proxy = () # Build the proxy server processor proxy_handler = (proxy) # Build the opener opener = .build_opener(proxy_handler) # Open a web page with a built opener response = (request) html = ().decode("utf-8") # Web page save file name (with path) filename = (path, str(page)+".html") # Save the page (html, filename) print("Page %d saved successfully!" % page) # Random hibernation sleep_time = ([0], [1]) (sleep_time)
In addition to randomly selecting ip addresses, it will also limit the speed of crawling data to avoid violent crawling.
4. Run the code
def save(self, html, filename): """ Save the downloaded web page :param html: content of the page :param filename: the name of the saved file :return. """ f = open(filename, 'w', encoding="utf-8") (html) () def parse(self): """ Parsing web page data :return. """ pass if __name__ == "__main__": spider = BeikeSpider() ()
The result of the run will look like this and will be saved in your folder.
summarize
Here is a summary of the article: the purpose of analyzing this wave of code today is to give you a clear and bright understanding of the workings of the python crawler, and to learn with you
Above is the content of today's talk, this article only a brief introduction to the use of pandas, while providing a large number of can make us quickly and easily crawl the data.