Example tutorial for making a crawler using Python's urllib and urllib2 modules

urllib
After learning the basics of python, I'm a bit lost. When I close my eyes, a kind of blank suffocation keeps coming. Still lack of practice, so I took the crawler to practice. After studying the Spartan python crawler course, the experience will be organized as follows, for subsequent review. The notes are divided into the following sections.

1. Make a simple crawler program
2. A small test - grab Baidu bar images
3. Summary

1. Make a simple crawler program
First the environment description

Device: Mba 2012 Yosemite 10.10.1
Python: python 2.7.9
Editor: Sublime Text 3

There's not much to say about this one, so let's get right to the code!

'''
@ urllib is a web library that comes with python.
@ urlopen is a method of urllib that opens a connection and captures a web page.
 The read() method assigns a value to read().
'''
import urllib

url = ""#More words, why lifevc, mainly because it's been pissing me off lately.
html = (url)
content = ()
()
# Can print out the content of a web page via print
print content

It's very simple, there's basically nothing to say, and that's the beauty of python, a few lines of code and you're done.
Of course, there's no real value in just crawling the web. So let's get down to the nitty-gritty.

2. Small test
Capture Baidu posting images
In fact, it is very simple, because to capture the picture, but also need to analyze the source code of the web page!
(here to know the basic html knowledge, browser to chrome for example)
As shown in the picture, here is a brief description of the steps, please refer to.

Open the page, right-click, and select "inspect Element" (at the bottom).
Click on the question mark on the far left of the box that pops up below, and the question mark will turn blue.
Move the mouse to click on the image we want to capture (a cute girl)
As shown in the picture, we can picture the position in the source code

(899×570)

Here's a copy of the source code

<img class="BDE_Image" src="/forum/w%3D580/
sign=3d5aacaab21c8701d6b6b2ee177e9e6e/17a6d439b6003af329aece2e342ac65c1138b6d8.
jpg" height="840" width="560" style="cursor: url(/tb/
static-pb/img/cur_zin.cur), pointer;">

After analyzing and comparing (omitted here), you can basically see a few characteristics of the image to be captured.

Under the img tag
Under the class named BDE_Image
Image format is jpg

I'll be updating the regular expressions later, so stay tuned!

According to the above judgment, directly on the code

'''
@This program is used to download Baidu posting images
@re is a regular description library
'''
import urllib
import re

# Get the html information of the web page
url = "/p/2336739808"
html = (url)
content = ()
()

# Match the image features by the regular, and get the image link
img_tag = (r'class="BDE_Image" src="(.+?\.jpg)"')
img_links = (img_tag, content)

# download image img_counter is the image counter (filename)
img_counter = 0
for img_link in img_links:
  img_name = '%' % img_counter
  (img_link, "//Users//Sean//Downloads//tieba//%s" %img_name)
  img_counter += 1

As you can see, we'll just grab a picture of you know what.

(908×659)

3. Summary
As in the previous two sections, we can easily make web pages or images.
As an added tip, if you encounter a library or method that you don't quite understand, you can use the following methods to get an initial idea.

dir(urllib) #see what methods are in the current library
help() # See what the current method is doing or what arguments it is taking, the official authority.

Or /2/library/input-related searches.

Of course, Baidu can also be, but the efficiency is too low. Suggest the use of related searches (you know, absolutely satisfied).
Here we explain how to crawl the web and download images, in the following we will explain how to crawl the site with restrictions on crawling.

urllib2
Above we explain how to crawl the web and download images, in the next section we will explain how to crawl the site with restrictions on crawling
First of all, we still use the method of our last lesson to capture a website that everyone used as an example <>, this article is divided into the following parts.

1. Crawl restricted pages
2. Some optimization of the code

1. Crawl restricted pages

Let's start with a test using what we learned in the previous section: the

'''
@This program is used to capture web pages
'''
import urllib

url = "/FansUnion"
html = (url)
The #getcode() method returns the Http status code.
print ()
()
# Output

Here we have an output of 403, which means that access is denied; similarly, 200 means that the request was successfully completed; and 404 means that the URL was not found.
can be seen csdn has done the relevant blocking, through the first section of the method is unable to obtain the page, here we need to start a new library: urllib2
But we also see that the browser can send that text, is it possible that we can simulate the operation of the browser to get the information of the web page.
As usual, let's take a look at how the browser submits a request to the csdn server. First, a brief description of the method.

Open the page, right-click, and select "inspect Element" (at the bottom).
Click on the Network tab in the box that pops up below
Refresh the page, you can see that the Network tab captures a lot of information.
Find one of these messages and expand it to see the header of the request packet.

(856×706)

Here is the collated Header information

Request Method:GET
Host:
Referer:/?ref=toolbar_logo
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36

Then according to the extracted Header information, using urllib2's Request method to simulate the browser to submit a request to the server, the code is as follows.

# coding=utf-8
'''
@This program is used to crawl restricted web pages()
@User-Agent:Client Browser Version
@Host:Server address
@Referer:jump address
@GET:request method is GET
'''
import urllib2

url = "/FansUnion"

# Customize a custom Header to simulate a browser submitting a request to the server.
req = (url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
req.add_header('Host', '')
req.add_header('Referer', '')
req.add_header('GET', url)

#Download web html and print
html = (req)
content = ()
print content
()

Oh, you limit me, I'll skip your limitations. It's said that if the browser can access it, it can be crawled.

2. Some optimization of the code
Simplified submission of Header methods
I realized that writing so many req.add_header every time is a kind of torture for myself, is there any way to just copy it and use it. The answer is yes.

#input:
help()
#output (for space reasons, only the __init__ method is taken)
__init__(self, url, data=None, headers={}, origin_req_host=None, unverifiable=False)
Through observation,We foundheaders={},That is, it can be submitted as a dictionaryheadertext.Well, let's try it.!!

# Take only the custom Header part of the code
csdn_headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
  "Host": "",
  'Referer': '',
  "GET": url
  }
req = (url,headers=csdn_headers)

Found out if it was easy, thanks to Spartan for his selfless advice here.

Provide dynamic header information
If you follow the above method to crawl, many times because the information submitted is too single, the server thinks it is a machine crawler and rejects it.
So is there some smarter way to submit some dynamic data, the answer is definitely yes. And it's easy, straight to the code!

'''
@This program is used to dynamically submit Header information.
@random dynamic library, please refer to </2/library/> for details.
'''

# coding=utf-8
import urllib2
import random

url = '/'

my_headers = [
  'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)',
  'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; InfoPath.1',
  'Mozilla/4.0 (compatible; GoogleToolbar 5.0.2124.2070; Windows 6.0; MSIE 8.0.6001.18241)',
  'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
  'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; Sleipnir/2.9.8)',
  #N omitted for reasons of space.
]

random_header = (headers)
# You can see the submitted header information by printing random_header
req = (url)
req.add_header("User-Agent", random_header)
req.add_header('Host', '')
req.add_header('Referer', '')
req.add_header('GET', url)
content = (req).read()
print content

It's really simple, so we've done some optimization of the code.