Python3 web crawler using User Agent and proxy IP to hide identity

This article introduces Python3 Web Crawler of using User Agent and proxy IP to hide the identity, share the following:

Platform: Windows
Python version:
IDE：Sublime text3

I. Why set up User Agent

There are some sites do not like to be accessed by crawlers, so they will detect the connection object, if it is a crawler program, that is, a non-human click to access, it will not allow you to continue to visit, so in order to make the program can be run normally, you need to hide the identity of their own reptile program. At this point, we can set the User Agent to achieve the purpose of hiding the identity of the User Agent's Chinese name for the user agent, referred to as UA.

User Agent is stored in Headers, the server is checking the User Agent in Headers to determine who is accessing. In Python, if you don't set the User Agent, the program will use the default parameters, then this User Agent will have the word Python, if the server checks the User Agent, then the Python program that doesn't have the User Agent set won't be able to access the website normally.

Python allows us to modify this User Agent to simulate browser access, and it's undeniably powerful.

II. Common User Agent

Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19
Mozilla/5.0 (Linux; U; Android 4.0.4; en-gb; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
Mozilla/5.0 (Linux; U; Android 2.2; en-gb; GT-P1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1

Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0
Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0

Chrome

Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36
Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19

Mozilla/5.0 (iPad; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3
Mozilla/5.0 (iPod; U; CPU like Mac OS X; en) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/3A101a Safari/419.3

Some User Agents for Andriod, Firefox, Google Chrome, and iOS are listed above, so just copy them and use them.

Setting up the User Agent

Let's look at it first()

As you can see from the image above, the headers parameter can be passed in when creating the Request object.

Therefore, there are two ways to set up a User Agent:

1. In the creation of the Request object, fill in the headers parameter (including User Agent information), the Headers parameter requires a dictionary;

2. In the creation of the Request object does not add headers parameter, in the creation of the completion of the use of add_header () method, add headers.

Method I:

Create the file urllib_test09.py, use the first User Agent for Android mentioned above, pass in the headers parameter when creating the Request object and write the code as follows:

# -*- coding: UTF-8 -*-
from urllib import request

if __name__ == "__main__":
  # Take CSDN as an example, CSDN can't be accessed without changing the User Agent
  url = '/'
  head = {}
  # Write User Agent information
  head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19'
 # Create Request object
  req = (url, headers=head)
  # Pass in the created Request object.
  response = (req)
  # Read the response message and decode it
  html = ().decode('utf-8')
  #Printing information
  print(html)

The results of the run are as follows:

Method II:

Create the file urllib_test10.py, use the first User Agent for Android mentioned above, create the Request object without passing the headers parameter, use the add_header() method after creation, add headers, write the code as follows:

# -*- coding: UTF-8 -*-
from urllib import request

if __name__ == "__main__":
  # Take CSDN as an example, CSDN can't be accessed without changing the User Agent
  url = '/'
  # Create Request object
  req = (url)
  # Pass in the headers
  req.add_header('User-Agent', 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19')
  # Pass in the created Request object.
  response = (req)
  # Read the response message and decode it
  html = ().decode('utf-8')
  #Printing information
  print(html)

The result of the run is the same as the previous method.

IV. Use of IP proxies

1.Why use IP Proxy

User Agent has been set up, but should also consider a problem, the program's operating speed is very fast, if we use a crawler program to crawl things in the site, a fixed IP access frequency will be very high, which does not meet the standards of human operation, because the human operation is not possible in a few ms, to make such frequent visits. So some sites will set a threshold for IP access frequency, if an IP access frequency exceeds this threshold, it means that this is not a human visit, but a crawler program.

2. General step-by-step instructions

A very simple solution is to set a delay, but this obviously does not meet the crawler's purpose of crawling information quickly, so another better way is to use an IP proxy. Steps to use a proxy:

(1) Calls () with the proxies argument being a dictionary.

(2) create Opener (similar to urlopen, this generation of open method is our own customization)

(3) Installation of Opener

引用块内容

The install_opener method replaces the program's default urlopen method. That is to say, if you use install_opener, in that file, call urlopen again will use your own created opener, if you don't want to replace it, but just want to use it temporarily, you can use (url), so it won't have any effect on the program's default urlopen.

3. Proxy IP selection

Before writing the code, pick an IP address from the Proxy IP website, which recommends West Spur Proxy IP.

URL：/

Note: Of course, you can also write a regular expression from the site to crawl directly to the IP, but keep in mind not to crawl too often, plus a delay or something, too often to the server to bring pressure, the server will directly block you, do not allow you to visit, I was blocked for two days.

Selecting IPs with good signals from the West Stinger website, my choices are as follows: (106.46.136.112:808)

Write code to access/The site is a URL to test how many IPs you have, and the server will return the visitor's IP.

4. Code examples

Create the file urllib_test11.py and write the code as follows:

# -*- coding: UTF-8 -*-
from urllib import request

if __name__ == "__main__":
  # Visiting the web site
  url = '/'
  # It's a proxy IP
  proxy = {'http':'106.46.136.112:808'}
  #CreateProxyHandler
  proxy_support = (proxy)
  # Create Opener
  opener = request.build_opener(proxy_support)
  # Add User Angent
   = [('User-Agent','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36')]
  #Install OPener
  request.install_opener(opener)
  #Use your own installed Opener
  response = (url)
  # Read the corresponding message and decode it
  html = ().decode("utf-8")
  #Printing information
  print(html)

The results of the run are as follows:

As you can see from the image above, the visiting IP has been disguised as 106.46.136.112.

This is the whole content of this article.