Crawling soft exam questions using python ip auto

preamble

Recently there is a software professional level exam, hereinafter referred to as the soft exam, in order to better review and prepare for the exam, I am going to grab the online soft exam questions.

First of all, I'll tell the story (keng) of how I crawled the soft exam questions (shi). Now I have been able to automatically crawl all the questions of a particular module, as shown below:

Currently it is possible to capture all 30 test records for Information Systems Supervisors and the results are shown below:

Grab down content images:

Although some of the information can be captured, but the quality of the code is not high, to capture the information systems supervisor, for example, because the goal is clear, the parameters are clear, in pursuit of being able to capture the information of the examination paper in a short period of time, so there is no exception handling, and filled in the hole for a long time last night.

Back to the topic, today I write this blog because I encountered a new pit. From the title of the article we can guess a rough, must be too many requests, so the ip is blocked by the site's anti-crawler mechanism.

In the process of web crawlers crawling for information, if the crawling frequency is higher than the threshold set by the website, it will be banned. Usually, the anti-crawler mechanism of the website is based on the IP to identify the crawler.

So developers in crawlers usually need to take two means to solve this problem:

1, slow down the crawl speed, reduce the pressure on the target site. But this will reduce the amount of unit time class crawl.

2, the second method is to set the IP and other means to break through the anti-crawler mechanism to continue high-frequency crawling. But this requires more than one stable IP.

Without further ado, let's get straight to the code:

# IP address taken from the domestic Koran IP website:/nn/
# Crawling just the home page IP address is enough for general use

from bs4 import BeautifulSoup
import requests
import random

# Get the ip on the current page
def get_ip_list(url, headers):
 web_data = (url, headers=headers)
 soup = BeautifulSoup(web_data.text)
 ips = soup.find_all('tr')
 ip_list = []
 for i in range(1, len(ips)):
 ip_info = ips[i]
 tds = ip_info.find_all('td')
 ip_list.append(tds[1].text + ':' + tds[2].text)
 return ip_list

# Get a random ip from the crawled Ip
def get_random_ip(ip_list):
 proxy_list = []
 for ip in ip_list:
 proxy_list.append('http://' + ip)
 proxy_ip = (proxy_list)
 proxies = {'http': proxy_ip}
 return proxies

# Domestic high stash IP network primary address
url = '/nn/'
# Request header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
#counter, based on the counter to cycle through all the pages to capture the ip
num = 0
# Create an array and store the captured ip into the array
ip_array = []
while num < 1537:
 num += 1
 ip_list = get_ip_list(url+str(num), headers=headers)
 ip_array.append(ip_list)
for ip in ip_array:
 print(ip)
#Create a random number to pick up a random ip
# proxies = get_random_ip(ip_list)
# print(proxies)

Screenshots of the run results:

In this way, when the crawler request, set the request ip to auto ip, it can effectively evade the simple blocking of fixed ip in the anti-crawler mechanism as a means.

-------------------------------------------------------------------------------------------------------------------------------------

In order to stabilize the site, the speed of the crawler we still control, after all, the webmaster is not easy. In this paper, the test only crawled 17 pages of ip.

summarize

Above is the entire content of this article, I hope that the content of this article for everyone to learn or use python can bring some help, if there are questions you can leave a message to exchange, thank you for my support.