Python using aiohttp to crawl fund data by setting proxy simple example

python crawler

When it comes to python crawler, we will think of its powerful libraries, many novice white in the choice of frameworks will think of using Scrapy, but only stay in the stage of will be used. In the actual crawling process encountered anti-climbing mechanism is no longer common, today in order to increase the understanding of the crawling mechanism, we will manually implement the multi-threaded crawling process, while the introduction of the IP proxy pool for the basic anti-climbing operations.

Here we take the daily fund data as a practical project, the site has an anti-climbing mechanism, at the same time, the number is large enough, the effect of multi-threading is more obvious. So here you need to use to the technical route are

IP proxy pool
multi-threaded
Crawlers and Anti-crawlers

Analyzing some data from dailyfunds.com through basic. After packet grabbing analysis, it is known that: . /fundcode_search.js contains all the fund data, at the same time, the address has an anti-climbing mechanism, multiple visits will fail or even block the IP situation. After analyzing the data of the daily fund network, we choose to use to build IP proxy pool for anti-climbing role. Proxy pool directly through the agent manufacturers can provide, there are too many agents many students do not know how to choose, after many years of experience in crawling and the use of agent experience here recommended YNU Yun agent, long-term use of both the agent quality or after-sales service is better than other agents long home.

After building the IP proxy pool, we started working on multi-threaded crawling of data. Once you use multithreading, you need to take into account some of the issues that can arise in crawling.

python using aiohttp to get data by setting proxy IP procedure

import asyncio
import aiohttp
from aiohttp_socks import ProxyConnector
from bs4 import BeautifulSoup
# Define parameters for the target site and proxy server
url = "/#os_0;isall_0;ft_;pt_1"
proxy = "socks5://16yun:16ip@:11111"
# Define asynchronous functions to send GET requests and use a proxy server to connect to the target site
async def fetch(session, url):
    try:
        async with (url) as response:
            # Check if the response status code is 200, otherwise throw an exception
            if  != 200:
                raise Exception(f"Bad status code: {}")
            # Return the text format of the response
            return await ()
    except Exception as e:
        # Print the exception message and return None
        print(e)
        return None
# Define asynchronous functions to process the response results and parse the HTML content
async def parse(html):
    # If the response result is not null, parse operation is performed
    if html is not None:
        # Use the bs4 library to create a BeautifulSoup object and specify the parser to be
        soup = BeautifulSoup(html, "")
        # Extract the title tag from a web page and print its text content
        title = ("title")
        print()
    else:
        # Otherwise print None for invalid results
        print(None)
# Define asynchronous function to count successes and print results
async def count(results):
    # of successful initializations to 0
    success = 0
    # Iterate over all results, if not null, increase success count, otherwise skip
    for result in results:
        if result is not None:
            success += 1
    # Print total number of requests and successes
    print(f"Total requests: {len(results)}")
    print(f"Success requests: {success}")
# Define asynchronous master functions to create and run multiple concurrent tasks and control parameters such as the number of concurrencies and timeouts
async def main():
    # Create an aiohttp_socks.ProxyConnector object to set the proxy server's parameters
    connector = ProxyConnector.from_url(proxy)
    # Create an object to send an HTTP request and pass in the connector parameter
    async with (connector=connector) as session:
        # Create an empty list to store all the concurrent tasks
        tasks = []
        # Loop 10,000 times, each time creating a coprocessor task for the fetch function and adding it to the list
        for i in range(10000):
            task = asyncio.create_task(fetch(session, url))
            (task)
        # Use a function to collect and execute all the concurrent tasks and return a list with all the results
        results = await (*tasks)
        # Create an empty list to store all the parsing tasks
        parse_tasks = []
         for result in results:
             parse_task = asyncio.create_task(parse(result))
             parse_tasks.append(parse_task)
         await (*parse_tasks)   
         await count(results)
# Call the asynchronous main function at the program entry and start the event loop
if __name__ == "__main__":
     (main())

The above is python using aiohttp to crawl fund data by setting the proxy simple example of the details, more information about python aiohttp crawl fund data please pay attention to my other related articles!