SoFunction
Updated on 2024-11-20

Python hand in hand to teach you to crawl shell listing data of the real tutorials

I. What is a crawler?

In the big data analysis or for data mining, the data source can be obtained from certain websites that provide data statistics, or from certain literature or internal information, but these ways of obtaining data are sometimes difficult to meet our needs for data, and manually searching for these data from the Internet consumes too much energy. At this point you can use crawler technology, automatically from the Internet to obtain the data content we are interested in, and will crawl these data content back as our data source, so as to carry out a deeper level of data analysis, and obtain more valuable information. Before using a crawler first of all you need to understand the library (requests) or ( ) required for the crawler, which is created for the task of crawling data.

II. Steps for use

1. Introduction of libraries

The code is as follows (example):

import os
import 
import random
import time
class BeikeSpider:
    def __init__(self, save_path="./beike"):
        """
        Shell Crawler Constructor
        :param save_path: directory where pages are saved
        """

2. Read data

The code is as follows :

# Web site model
        self.url_mode = "http://{}./loupan/pg{}/"
        # Cities to be crawled
         = ["cd", "sh", "bj"]
        # of pages crawled per city
        self.total_pages = 20
        # Put the crawler to sleep for 5-10 seconds randomly
         = (5, 10)
        # Web download save root directory
        self.save_path = save_path
        # Setting up a user agent that is a crawler masquerading as a browser
         = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"}
        # Proxy IP information
         = [
            {"https": "123.163.67.50:8118"},
            {"https": "58.56.149.198:53281"},
            {"https": "14.115.186.161:8118"}
        ]

        # Create save directories
        if not (self.save_path):
            (self.save_path)
   def crawl(self):
        """
        Execute the crawl task
        :return: None
        """

The data of the url web request used there.

3. Randomly select an ip address to build a proxy server

 for city in :
            print("Cities being crawled:", city)
            # Separate directory for each city's web page
            path = (self.save_path, city)
            if not (path):
                (path)

            for page in range(1, self.total_pages+1):
                # Build the full url
                url = self.url_mode.format(city, page)
                # Build the Request object, put the url and request headers into the object
                request = (url, headers=)

                # Randomly select a proxy IP
                proxy = ()
                # Build the proxy server processor
                proxy_handler = (proxy)
                # Build the opener
                opener = .build_opener(proxy_handler)
                # Open a web page with a built opener
                response = (request)
                html = ().decode("utf-8")
                # Web page save file name (with path)
                filename = (path, str(page)+".html")

                # Save the page
                (html, filename)
                print("Page %d saved successfully!" % page)

                # Random hibernation
                sleep_time = ([0], [1])
                (sleep_time)

In addition to randomly selecting ip addresses, it will also limit the speed of crawling data to avoid violent crawling.

4. Run the code

def save(self, html, filename):
        """
        Save the downloaded web page
        :param html: content of the page
        :param filename: the name of the saved file
        :return.
        """

        f = open(filename, 'w', encoding="utf-8")
        (html)
        ()

    def parse(self):
        """
        Parsing web page data
        :return.
        """
        pass

if __name__ == "__main__":
    spider = BeikeSpider()
    ()

在这里插入图片描述

The result of the run will look like this and will be saved in your folder.

summarize

Here is a summary of the article: the purpose of analyzing this wave of code today is to give you a clear and bright understanding of the workings of the python crawler, and to learn with you
Above is the content of today's talk, this article only a brief introduction to the use of pandas, while providing a large number of can make us quickly and easily crawl the data.