I. What is a regular expression?
Concept.
Regular expression is a logical formula for string manipulation, that is, with some specific characters defined in advance, and the combination of these specific characters, to form a "regular string", this "regular string" is used to express the string of a filtering logic.
A regular expression is a special sequence of characters that helps you conveniently check whether a string matches a certain pattern.
Personal Understanding.
Simply put, it is to use regular expressions to write a filter to filter out the clutter of useless information (eg: web page source code ...) from which to get the content they want to
II. Practical projects
1. Crawl content
Get the names of all the Tertiary hospitals in Shanghai and save them in a .txt file.
2. Access links
Shanghai tertiary hospital website link:https://yyk./sanjia/shanghai/
3. Inspiration for regular expression writing
Going to the website and looking at the source code of this page, I found that: the names of the hospitals are all placed in a
<div class="province-box"> ...... </div>
box we just need to filter the data directly inside this box
Regular expression.
Act I.
1. Primary filtration.
<div class="province-box">(.*)<div class="wrap-right">
Starts with:<div class="province-box"> (. *) Ends with:<div class="wrap-right">
2. Secondary filtration.
title="(. *[Yard Center])*)" Get the information inside the title=" "
Act II.
Optimized one-time filtration.
<li><a href="/[^/].*/" rel="external nofollow" rel="external nofollow" target="_blank" title="(.*)">
post a picture
It starts with.
It ends with.
4. Project source code
import requests import re url = "https://yyk./sanjia/shanghai/" # Simulate browser access headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) ' 'Gecko/20100101 Firefox/87.0'} res = (url,headers=headers) if res.status_code == 200: #1. Get the source code of the web page raw_text = #2. Regular expression writing. #2.2 Note: Regular expressions match one line by default. Our source code matches multiple lines with another parameter. #2.3 Regularization I. #() returns a collection of lsit filtered once re_res = (r'<div class="province-box">(.*)<div class="wrap-right">', raw_text,) #re_res[0] Get the data whose subscript is Secondary filtering res=(r'title="(.*[Division of the Heart of the Institution])*)"',re_res[0]) #Check the print capture print(res) #2.4 Regularization II. #(Optimize) No need to filter twice, one filter solves the problem. # re_list = (r'<li><a href="/[^/].*/" rel="external nofollow" rel="external nofollow" target="_blank" title="(.*)">', ) #print(re_list) # Write to file read = open("List of hospitals in Shanghai", "w", encoding='utf-8') for i in res: (i) ("\n") () else: print("error")
Project catalog.
Selected results.
python regular expression - extract image address
import os,sys,time,json,time import socket,random,hashlib import requests,configparser import json,re from datetime import datetime from import Pool as ThreadPool def getpicurl(url): url = "/zipai/comment-page-352" html = (url).text pic_url = ('img src="(.*?)"',html,) for key in pic_url: print(key + "\r\n") #print(pic_url) getpicurl("/zipai/-352")
Output results:
python
/mw1024//mw1024/
/mw1024/
/mw1024/
/mw1024/
/mw1024/
/mw1024/
/mw1024/
/mw1024/
/mw1024/
/mw1024/
/mw1024/
summarize
To this article on how to use Python regular expression crawl web page information and images of the article is introduced to this, more related Python regular expression crawl content, please search for my previous posts or continue to browse the following related articles I hope you will support me in the future!