Getting Started with python3 Crawler Basics and Regular Expressions

The previous python3 introductory series basically also on python into the door, from this chapter onwards will begin to introduce the python crawler tutorials, take it out to share with you; crawler said simple, is to go to the net to capture the data for analysis and processing; this chapter of the main introductory, to understand a few crawler test, as well as the crawler to the introduction of tools, such as collections, queues, regular expressions;

Crawl the specified page in python:

The code is as follows:

import 
url= ""
data = (url).read()#
data = ('UTF-8')
print(data)

(url) official documentation return an object, this object and the use of read() method; return data; this function returns an object, this object and various methods, such as read() method we use;

Find variable URLs:

import urllib
import 
data={}
data['word']='one peace'
url_values=(data)
url="/s?"
full_url=url+url_values
a = (full_url)
data=()
data=('UTF-8')
print(data)
## Print out the URL:
()

The data is a dictionary, which is then converted to a 'word=one+peace' string via (), and finally merged with the url to form full_url.

Introduction to python regular expressions:

Queue Introduction

In the crawler's program used the breadth priority algorithm, the algorithm uses a data structure, of course, you can use list to implement the queue, but the efficiency is not high. Now here is the introduction: there are queues in containers:

#Queue simple test:
from collections import deque
queue=deque(["peace","rong","sisi"])
("nick")
("pishi")
print(())
print(())
print(queue)

Set Introduction:

In the crawler program, in order not to repeat those who have crawled the site, we need to crawl the page url into the collection, in each time to climb a certain url before, first look at the collection whether it already exists. If it already exists, we skip the url; if it doesn't exist, we put the url into the collection first, and then crawl the page again.

Python also includes a data type, set. A set is an unordered set of unduplicated elements. Basic functions include relevance testing and eliminating duplicate elements. Set objects also support mathematical operations such as union, intersection, difference and sysmmetric difference.

Curly braces or the set() function can be used to create sets. Note: To create an empty set, you must use set() instead of {}. {} is used to create an empty dictionary;

The collection creation demo is shown below:

a={"peace","peace","rong","rong","nick"}
print(a)
"peace" in a
b=set(["peace","peace","rong","rong"])
print(b)
# Demonstrate the union
print(a|b)
#Demonstrate the handover
print(a&b)
# Poor presentation
print(a-b)
# Symmetric difference sets
print(a^b)
#Output:
{'peace', 'rong', 'nick'}
{'peace', 'rong'}
{'peace', 'rong', 'nick'}
{'peace', 'rong'}
{'nick'}
{'nick'}

regular expression (math.)

In the crawler to collect back is generally a stream of characters, we have to pick out the url requires a simple string processing capabilities, and with regular expressions can easily accomplish this task;

Regular expression steps: 1, regular expression compilation 2, regular expression matching string 3, the results of processing

The following figure lists the syntax of regular expressions:

Using regular expressions in pytho requires the introduction of the re module; some of the methods in that module are described below;

and match

re module compile used to generate pattern objects, and then by calling the pattern instance of the match method to process the text to ultimately obtain the match instance; through the use of match to obtain information;

import re

# Compile Regular Expressions into Pattern Objects
pattern = (r'rlovep')
# Use Pattern to match the text, get the matching result, will return None if it can't be matched.
m = ('')
if m:
# Use Match to get grouping information
  print(())
### Output ###
# rlovep
(strPattern[, flag]):

This method is a factory method of the Pattern class for compiling regular expressions in string form into Pattern objects. The second argument, flag, is the match pattern, which can be evaluated using the per-position or operator '|' to indicate that both are in effect, e.g. | . Alternatively, you can specify the pattern in the regex string, e.g. ('pattern', | ) is equivalent to ('(?im)pattern').

Optional values are:

(): case-neglected (full writing in parentheses, below)

M(MULTILINE): multi-line mode, changes the behavior of '^' and '$' (see above)

S(DOTALL): Tap any match pattern to change the '.' behavior

L(LOCALE): Enables a predefined character class \w \W \b \B \s \S depending on the current locale.

U(UNICODE): makes the predefined character class \w \W \b \B \s \S \d \D depending on the character attributes of the unicode definition

X(VERBOSE): Detailed mode. In this mode regular expressions can be multi-line, whitespace is ignored, and comments can be added.

Match:The Match object is the result of a match and contains a lot of information about the match, which can be retrieved using readable properties or methods provided by the Match.

Properties:

string: The text to use when matching.
re: Pattern object used for matching.
pos: the index at which the regular expression in the text starts the search. The value is the same as the parameter with the same name of the () and () methods.
endpos: index of the regular expression in the text to end the search. The value is the same as the parameter with the same name of the () and () methods.
lastindex: index of the last captured group in the text. If there are no captured groups, it will be None.
lastgroup: the alias of the last captured group. If this group has no alias or no captured groups, it will be None.

Methods:

group([group1, …]):
Get one or more groups of intercepted strings; specify more than one parameter will be returned in the form of tuples. group1 can use the number can also use an alias; number 0 represents the entire matching substring; do not fill in the parameter, return group(0); no intercepted strings of the group to return None; intercepted more than one group to return the last intercepted substring.
groups([default]):
Returns the intercepted strings of all groups as a tuple. This is equivalent to calling group(1,2,...last). default means that groups with no intercepted strings will be replaced by this value, default is None.
groupdict([default]):
Returns a dictionary with the alias of an aliased group as the key and the truncated substring of the group as the value. groups without aliases are not included. default has the same meaning as above.
start([group]):
Returns the starting index of the specified group intercepted substring in string (the index of the first character of the substring). group defaults to 0.
end([group]):
Returns the ending index of the specified group intercepted substring in string (the index of the last character of the substring + 1). group defaults to 0.
span([group]):
Returns (start(group), end(group)).
expand(template):
Substitute the matched grouping into the template and return. template can use \id or \g<id>, \g<name> to refer to the grouping, but not the number 0. \id and \g<id> are equivalent; however, \10 will be considered the 10th grouping, and if you want to express that \1 is followed by the character '0', you can only use \g<1>0.
pattern: Pattern object is a compiled regular expression , through the Pattern provides a series of methods to match the text to find .

Pattern cannot be instantiated directly, it must be constructed using ().

Pattern provides several readable properties for obtaining information about an expression:

pattern: The expression string to be used at compile time.
flags: Matching patterns to be used at compile time. Numeric form.
groups: The number of groups in the expression.
groupindex: a dictionary with the alias of the group that has an alias in the expression as the key and the corresponding number of the group as the value, groups without an alias are not included.
Instance methods [ | re module methods]:

match(string[, pos[, endpos]]) | (pattern, string[, flags]):
This method will try to match the pattern from the string's pos subscript; if the pattern can still be matched at the end of the pattern, a Match object will be returned; if the pattern can't be matched during the matching process, or if the match reaches the endpos before the end of the pattern, then None will be returned.
The default values of pos and endpos are 0 and len(string) respectively; () cannot specify these two parameters, and the parameter flags is used to specify the matching pattern when compiling the pattern.
Note: This method is not an exact match. If there are still characters left in string at the end of the pattern, it is still considered successful. To get an exact match, you can put a boundary match '$' at the end of the expression.
search(string[, pos[, endpos]]) | (pattern, string[, flags]):
This method is used to find a substring in a string that can be matched successfully. It tries to match the pattern from the pos subscript of string, and returns a Match object if the pattern is still matchable at the end of the pattern; if it is not matchable, it adds 1 to pos and tries to match it again; if it is not matchable until pos=endpos, then it returns None. Defaults for pos and endpos are 0 and len( string)); () can not specify these two parameters, the parameter flags is used to compile the pattern to specify the match pattern.
split(string[, maxsplit]) | (pattern, string[, maxsplit]):
Split the string according to the substrings that can be matched and return the list. maxsplit is used to specify the maximum number of splits, if not, all the strings will be split.
findall(string[, pos[, endpos]]) | (pattern, string[, flags]):
Searches for string, returning all the substrings that can be matched as a list.
finditer(string[, pos[, endpos]]) | (pattern, string[, flags]):
Searches for string, returning an iterator that accesses each match result (Match object) in order.
sub(repl, string[, count]) | (pattern, repl, string[, count]):
Returns the replaced string after replacing each matching substring in string using repl. When repl is a string, you can use \id or \g<id>, \g<name> to refer to the group, but you can't use the number 0. When repl is a method, the method should only accept one parameter (the Match object) and return a string for replacement (the returned string can't be referenced to the group). count is used to specify the maximum number of times to replace, not specified when all replaced.
subn(repl, string[, count]) |(pattern, repl, string[, count]):
Returns (sub(repl, string[, count]), number of replacements).

(pattern, string, flags=0)

Description of function parameters:

parameters	descriptive
pattern	Matching regular expressions
string	The string to be matched.
flags	Flag bits, used to control the regular expression matching mode, such as: whether case sensitive, multi-line matching and so on.

We can use the group(num) or groups() match object functions to get matching expressions.

Matching Object Methods	descriptive
group(num=0)	the entire string of the matched expression.group() It is possible to enter more than one group number at a time, in which case it will return a tuple containing the values corresponding to those groups.
groups()	Returns a tuple containing all the group strings from the 1 until (a time)Panel number included.

The demo is as follows:

#.
import re
print(("rlovep",""))## Match rlovep
print(("rlovep","").span())## Match rlovep from the beginning
print(("com",""))##No more starting positions cannot be matched successfully
## Output:
<_sre.SRE_Match object; span=(0, 6), match='rlovep'>
(0, 6)
None

Example 2: Using groups

import re
line = "This is my blog"
# Match strings containing is
matchObj = ( r'(.*) is (.*?) .*', line, |)
# Group output is used: when group is not parameterized, the entire output of a successful match will be output.
# When with parameter 1 the match is the first bracket included on the left side of the outermost layer, and so on;
if matchObj:
 print ("() : ", ())#Match the entire
 print ("(1) : ", (1))# Matches the first bracket
 print ("(2) : ", (2))# Matched second bracket
else:
 print ("No match!!")
#Output:
() : This is my blog
(1) : This
(2) : my

methodologies

Scans the entire string and returns the first successful match.

Function Syntax:

(pattern, string, flags=0)

Description of function parameters:

parameters	descriptive
pattern	Matching regular expressions
string	The string to be matched.
flags	Flag bits, used to control the regular expression matching mode, such as: whether case sensitive, multi-line matching and so on.

We can use the group(num) or groups() match object functions to get matching expressions.

Matching Object Methods	descriptive
group(num=0)	the entire string of the matched expression.group() It is possible to enter more than one group number at a time, in which case it will return a tuple containing the values corresponding to those groups.
groups()	Returns a tuple containing all the group strings from the 1 until (a time)Panel number included.

Example one:

import re
print(("rlovep","").span())
print(("com","").span())
#Output:
import re
print(("rlovep","").span())
print(("com","").span())

Example two:

import re
line = "This is my blog"
# Match strings containing is
matchObj = ( r'(.*) is (.*?) .*', line, |)
# Group output is used: when group is not parameterized, the entire output of a successful match will be output.
# When with parameter 1 the match is the first bracket included on the left side of the outermost layer, and so on;
if matchObj:
 print ("() : ", ())#Match the entire
 print ("(1) : ", (1))# Matches the first bracket
 print ("(2) : ", (2))# Matched second bracket
else:
 print ("No match!!")
#Output:
() : This is my blog
(1) : This
(2) : my

search and match difference: only match the beginning of the string, if the beginning of the string does not match the regular expression, the match fails, the function returns None; and match the entire string until a match is found.

A small taste of python crawler

Crawling all http protocol links in a page using python and recursively crawling links to subpages. Used collections and queues; this goes to crawl my website, first version many bugs; code below:

import re
import 
import urllib
from collections import deque
# Use a queue to store urls
queue = deque()
>frontpython3The Starter Series is also basically a great way topythonbe initiated，From this chapter onwards, it's all aboutpythonCrawler tutorials，Take it out and share it with everyone；Crawlers make it easy.，It's about capturing data from the Internet and analyzing it.；This chapter focuses on getting started，Learn about a few crawler quizzes，As well as an introduction to the tools used by the crawler，For example, the set，formation，regular expression (math.)；
<!--more-->
# Use visited to prevent crawling the same page over and over again.
visited = set()
url = '' # Entry page, you can change it to something else
 #Incoming Team Initial Page
(url)
cnt = 0
while queue:
 url = () # Team leader element out of the team
 visited |= {url} # Marked as visited
 print('Already captured:' + str(cnt) + ' Being captured <--- ' + url)
 cnt += 1
 #crawl page
 urlop = (url)
 # Determine if the page is an html page
 if 'html' not in ('Content-Type'):
  continue
 # Avoid program abortions, use try...catch to handle exceptions.
 try:
  # Convert to utf-8
  data = ().decode('utf-8')
 except:
  continue
 # Regular expression extracts all the queues on the page, determines if they have been visited, and then adds them to the queue to be crawled.
 linkre = ("href=['\"]([^\"'>]*?)['\"].*?")
 for x in (data):## Returns a list of all matches
  if 'http' in x and x not in visited:## Determine if the link is an http protocol link and whether it has been crawled or not
   (x)
   print('Join the queue --->' + x)

The results are as follows: