This article introduces Python3 Web Crawler of using User Agent and proxy IP to hide the identity, share the following:
- Platform: Windows
- Python version:
- IDE:Sublime text3
I. Why set up User Agent
There are some sites do not like to be accessed by crawlers, so they will detect the connection object, if it is a crawler program, that is, a non-human click to access, it will not allow you to continue to visit, so in order to make the program can be run normally, you need to hide the identity of their own reptile program. At this point, we can set the User Agent to achieve the purpose of hiding the identity of the User Agent's Chinese name for the user agent, referred to as UA.
User Agent is stored in Headers, the server is checking the User Agent in Headers to determine who is accessing. In Python, if you don't set the User Agent, the program will use the default parameters, then this User Agent will have the word Python, if the server checks the User Agent, then the Python program that doesn't have the User Agent set won't be able to access the website normally.
Python allows us to modify this User Agent to simulate browser access, and it's undeniably powerful.
II. Common User Agent
- Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19
- Mozilla/5.0 (Linux; U; Android 4.0.4; en-gb; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
- Mozilla/5.0 (Linux; U; Android 2.2; en-gb; GT-P1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1
- Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0
- Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0
Chrome
- Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36
- Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19
- Mozilla/5.0 (iPad; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3
- Mozilla/5.0 (iPod; U; CPU like Mac OS X; en) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/3A101a Safari/419.3
Some User Agents for Andriod, Firefox, Google Chrome, and iOS are listed above, so just copy them and use them.
Setting up the User Agent
Let's look at it first()
As you can see from the image above, the headers parameter can be passed in when creating the Request object.
Therefore, there are two ways to set up a User Agent:
1. In the creation of the Request object, fill in the headers parameter (including User Agent information), the Headers parameter requires a dictionary;
2. In the creation of the Request object does not add headers parameter, in the creation of the completion of the use of add_header () method, add headers.
Method I:
Create the file urllib_test09.py, use the first User Agent for Android mentioned above, pass in the headers parameter when creating the Request object and write the code as follows:
# -*- coding: UTF-8 -*- from urllib import request if __name__ == "__main__": # Take CSDN as an example, CSDN can't be accessed without changing the User Agent url = '/' head = {} # Write User Agent information head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19' # Create Request object req = (url, headers=head) # Pass in the created Request object. response = (req) # Read the response message and decode it html = ().decode('utf-8') #Printing information print(html)
The results of the run are as follows:
Method II:
Create the file urllib_test10.py, use the first User Agent for Android mentioned above, create the Request object without passing the headers parameter, use the add_header() method after creation, add headers, write the code as follows:
# -*- coding: UTF-8 -*- from urllib import request if __name__ == "__main__": # Take CSDN as an example, CSDN can't be accessed without changing the User Agent url = '/' # Create Request object req = (url) # Pass in the headers req.add_header('User-Agent', 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19') # Pass in the created Request object. response = (req) # Read the response message and decode it html = ().decode('utf-8') #Printing information print(html)
The result of the run is the same as the previous method.
IV. Use of IP proxies
1.Why use IP Proxy
User Agent has been set up, but should also consider a problem, the program's operating speed is very fast, if we use a crawler program to crawl things in the site, a fixed IP access frequency will be very high, which does not meet the standards of human operation, because the human operation is not possible in a few ms, to make such frequent visits. So some sites will set a threshold for IP access frequency, if an IP access frequency exceeds this threshold, it means that this is not a human visit, but a crawler program.
2. General step-by-step instructions
A very simple solution is to set a delay, but this obviously does not meet the crawler's purpose of crawling information quickly, so another better way is to use an IP proxy. Steps to use a proxy:
(1) Calls () with the proxies argument being a dictionary.
(2) create Opener (similar to urlopen, this generation of open method is our own customization)
(3) Installation of Opener
The install_opener method replaces the program's default urlopen method. That is to say, if you use install_opener, in that file, call urlopen again will use your own created opener, if you don't want to replace it, but just want to use it temporarily, you can use (url), so it won't have any effect on the program's default urlopen.
3. Proxy IP selection
Before writing the code, pick an IP address from the Proxy IP website, which recommends West Spur Proxy IP.
URL:/
Note: Of course, you can also write a regular expression from the site to crawl directly to the IP, but keep in mind not to crawl too often, plus a delay or something, too often to the server to bring pressure, the server will directly block you, do not allow you to visit, I was blocked for two days.
Selecting IPs with good signals from the West Stinger website, my choices are as follows: (106.46.136.112:808)
Write code to access/The site is a URL to test how many IPs you have, and the server will return the visitor's IP.
4. Code examples
Create the file urllib_test11.py and write the code as follows:
# -*- coding: UTF-8 -*- from urllib import request if __name__ == "__main__": # Visiting the web site url = '/' # It's a proxy IP proxy = {'http':'106.46.136.112:808'} #CreateProxyHandler proxy_support = (proxy) # Create Opener opener = request.build_opener(proxy_support) # Add User Angent = [('User-Agent','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36')] #Install OPener request.install_opener(opener) #Use your own installed Opener response = (url) # Read the corresponding message and decode it html = ().decode("utf-8") #Printing information print(html)
The results of the run are as follows:
As you can see from the image above, the visiting IP has been disguised as 106.46.136.112.
This is the whole content of this article.