SoFunction
Updated on 2024-11-18

Python3 learning how to use urllib example

urllib is a python get url (Uniform Resource Locators, Uniform Resource Locator), you can use it to capture remote data for preservation, this article collated some of the use of urllib on some of the header, proxy, timeout, authentication, exception handling methods.

1. Basic methodology

(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

  1. url: url to be opened
  2. data: data submitted by Post
  3. timeout: set the timeout for accessing the website

Directly use the module's urlopen() to get the page, the page's data format is of type bytes, which needs to be decoded() and converted to type str.

from urllib import request
response = (r'/') # < object at 0x00000000048BC908> HTTPResponse type
page = ()
page = ('utf-8')

The urlopen return object provides methods:

  1. read() , readline() ,readlines() , fileno() , close() : operate on HTTPResponse type data
  2. info(): return HTTPMessage object, represents the header information returned by the remote server
  3. getcode(): return Http status code. If it is an http request, 200 request completed successfully; 404 url not found
  4. geturl(): return the url of the request

1、Simple reading of web page information

import  
response = ('/') 
html = () 

2、Use request

(url, data=None, headers={}, method=None)

Use request() to wrap the request and urlopen() to get the page.

import  
req = ('/') 
response = (req) 
the_page = () 

3、Send data to log in to know for example

'''''
Created on May 31, 2016

@author: gionee
''' 
import gzip 
import re 
import  
import  
import  
 
def ungzip(data): 
  try: 
    print("Trying to decompress...") 
    data = (data) 
    print("Decompression complete.") 
  except: 
    print("Uncompressed, no need to decompress") 
   
  return data 
     
def getXSRF(data): 
  cer = ('name=\"_xsrf\" value=\"(.*)\"',flags = 0) 
  strlist = (data) 
  return strlist[0] 
 
def getOpener(head): 
  # Cookies handling
  cj = () 
  pro = (cj) 
  opener = .build_opener(pro) 
  header = [] 
  for key,value in (): 
    elem = (key,value) 
    (elem) 
   = header 
  return opener 
# header information can be obtained via firebug
header = { 
  'Connection': 'Keep-Alive', 
  'Accept': 'text/html, application/xhtml+xml, */*', 
  'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3', 
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0', 
  'Accept-Encoding': 'gzip, deflate', 
  'Host': '', 
  'DNT': '1' 
} 
 
url = '/' 
opener = getOpener(header) 
op = (url) 
data = () 
data = ungzip(data) 
_xsrf = getXSRF(()) 
 
url += "login/email" 
email = "Login account." 
password = "Login password" 
postDict = { 
  '_xsrf': _xsrf, 
  'email': email, 
  'password': password, 
  'rememberme': 'y'  
} 
postData = (postDict).encode() 
op = (url,postData) 
data = () 
data = ungzip(data) 
 
print(()) 

4. http error

import  
req = ('http://. ') 
try: 
  (req) 
except  as e: 
print() 
print(().decode("utf8")) 

5. Exception handling

from  import Request, urlopen 
from  import URLError, HTTPError 
 
req = Request(" /") 
try: 
  response = urlopen(req) 
except HTTPError as e: 
  print('The server couldn't fulfill the request.') 
  print('Error code: ', ) 
except URLError as e: 
  print('We failed to reach a server.') 
  print('Reason: ', ) 
else: 
  print("good!") 
  print(().decode("utf8")) 

6、http authentication

import  
 
# create a password manager 
password_mgr = () 
 
# Add the username and password. 
# If we knew the realm, we could use it instead of None. 
top_level_url = "https:// /" 
password_mgr.add_password(None, top_level_url, 'rekfan', 'xxxxxx') 
 
handler = (password_mgr) 
 
# create "opener" (OpenerDirector instance) 
opener = .build_opener(handler) 
 
# use the opener to fetch a URL 
a_url = "https:// /" 
x = (a_url) 
print(()) 
 
# Install the opener. 
# Now all calls to  use our opener. 
.install_opener(opener) 
a = (a_url).read().decode('utf8') 
 
print(a) 

7. Use of proxies

import  
 
proxy_support = ({'sock5': 'localhost:1080'}) 
opener = .build_opener(proxy_support) 
.install_opener(opener) 
 
a = (" ").read().decode("utf8") 
print(a) 

8. Timeout

import socket 
import  
 
# timeout in seconds 
timeout = 2 
(timeout) 
 
# this call to  now uses the default timeout 
# we have set in the socket module 
req = ('https:// /') 
a = (req).read() 
print(a) 

This is the whole content of this article.