python crawler
When it comes to python crawler, we will think of its powerful libraries, many novice white in the choice of frameworks will think of using Scrapy, but only stay in the stage of will be used. In the actual crawling process encountered anti-climbing mechanism is no longer common, today in order to increase the understanding of the crawling mechanism, we will manually implement the multi-threaded crawling process, while the introduction of the IP proxy pool for the basic anti-climbing operations.
Here we take the daily fund data as a practical project, the site has an anti-climbing mechanism, at the same time, the number is large enough, the effect of multi-threading is more obvious. So here you need to use to the technical route are
- IP proxy pool
- multi-threaded
- Crawlers and Anti-crawlers
Analyzing some data from dailyfunds.com through basic. After packet grabbing analysis, it is known that: . /fundcode_search.js contains all the fund data, at the same time, the address has an anti-climbing mechanism, multiple visits will fail or even block the IP situation. After analyzing the data of the daily fund network, we choose to use to build IP proxy pool for anti-climbing role. Proxy pool directly through the agent manufacturers can provide, there are too many agents many students do not know how to choose, after many years of experience in crawling and the use of agent experience here recommended YNU Yun agent, long-term use of both the agent quality or after-sales service is better than other agents long home.
After building the IP proxy pool, we started working on multi-threaded crawling of data. Once you use multithreading, you need to take into account some of the issues that can arise in crawling.
python using aiohttp to get data by setting proxy IP procedure
import asyncio import aiohttp from aiohttp_socks import ProxyConnector from bs4 import BeautifulSoup # Define parameters for the target site and proxy server url = "/#os_0;isall_0;ft_;pt_1" proxy = "socks5://16yun:16ip@:11111" # Define asynchronous functions to send GET requests and use a proxy server to connect to the target site async def fetch(session, url): try: async with (url) as response: # Check if the response status code is 200, otherwise throw an exception if != 200: raise Exception(f"Bad status code: {}") # Return the text format of the response return await () except Exception as e: # Print the exception message and return None print(e) return None # Define asynchronous functions to process the response results and parse the HTML content async def parse(html): # If the response result is not null, parse operation is performed if html is not None: # Use the bs4 library to create a BeautifulSoup object and specify the parser to be soup = BeautifulSoup(html, "") # Extract the title tag from a web page and print its text content title = ("title") print() else: # Otherwise print None for invalid results print(None) # Define asynchronous function to count successes and print results async def count(results): # of successful initializations to 0 success = 0 # Iterate over all results, if not null, increase success count, otherwise skip for result in results: if result is not None: success += 1 # Print total number of requests and successes print(f"Total requests: {len(results)}") print(f"Success requests: {success}") # Define asynchronous master functions to create and run multiple concurrent tasks and control parameters such as the number of concurrencies and timeouts async def main(): # Create an aiohttp_socks.ProxyConnector object to set the proxy server's parameters connector = ProxyConnector.from_url(proxy) # Create an object to send an HTTP request and pass in the connector parameter async with (connector=connector) as session: # Create an empty list to store all the concurrent tasks tasks = [] # Loop 10,000 times, each time creating a coprocessor task for the fetch function and adding it to the list for i in range(10000): task = asyncio.create_task(fetch(session, url)) (task) # Use a function to collect and execute all the concurrent tasks and return a list with all the results results = await (*tasks) # Create an empty list to store all the parsing tasks parse_tasks = [] for result in results: parse_task = asyncio.create_task(parse(result)) parse_tasks.append(parse_task) await (*parse_tasks) await count(results) # Call the asynchronous main function at the program entry and start the event loop if __name__ == "__main__": (main())
The above is python using aiohttp to crawl fund data by setting the proxy simple example of the details, more information about python aiohttp crawl fund data please pay attention to my other related articles!