SoFunction
Updated on 2024-11-16

How Python crawls web page information and images using regular expressions

I. What is a regular expression?

Concept.

Regular expression is a logical formula for string manipulation, that is, with some specific characters defined in advance, and the combination of these specific characters, to form a "regular string", this "regular string" is used to express the string of a filtering logic.

A regular expression is a special sequence of characters that helps you conveniently check whether a string matches a certain pattern.

Personal Understanding.

Simply put, it is to use regular expressions to write a filter to filter out the clutter of useless information (eg: web page source code ...) from which to get the content they want to

II. Practical projects

1. Crawl content

Get the names of all the Tertiary hospitals in Shanghai and save them in a .txt file.

2. Access links

Shanghai tertiary hospital website link:https://yyk./sanjia/shanghai/

3. Inspiration for regular expression writing

Going to the website and looking at the source code of this page, I found that: the names of the hospitals are all placed in a

<div class="province-box"> ...... </div>

box we just need to filter the data directly inside this box

Regular expression.

Act I.

1. Primary filtration.

   <div class="province-box">(.*)<div class="wrap-right">

Starts with:<div class="province-box"> (. *) Ends with:<div class="wrap-right">

2. Secondary filtration.

title="(. *[Yard Center])*)" Get the information inside the title=" "

Act II.

Optimized one-time filtration.

 <li><a href="/[^/].*/" rel="external nofollow" rel="external nofollow" target="_blank" title="(.*)">

post a picture

It starts with.

It ends with.

4. Project source code

import requests
import re

url = "https://yyk./sanjia/shanghai/"
# Simulate browser access
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) '
                        'Gecko/20100101 Firefox/87.0'}
res = (url,headers=headers)

if res.status_code == 200:
	#1. Get the source code of the web page
    raw_text = 
    
    #2. Regular expression writing.
    #2.2 Note: Regular expressions match one line by default. Our source code matches multiple lines with another parameter.
	#2.3 Regularization I.
		#() returns a collection of lsit filtered once
    re_res = (r'<div class="province-box">(.*)<div class="wrap-right">', raw_text,)
    	#re_res[0] Get the data whose subscript is Secondary filtering
    res=(r'title="(.*[Division of the Heart of the Institution])*)"',re_res[0])
    	#Check the print capture
	print(res)
	
	#2.4 Regularization II.
		#(Optimize) No need to filter twice, one filter solves the problem.
    # re_list = (r'<li><a href="/[^/].*/" rel="external nofollow"  rel="external nofollow"  target="_blank" title="(.*)">', )
    #print(re_list)

    # Write to file
    read = open("List of hospitals in Shanghai", "w", encoding='utf-8')
    for i in res:
        (i)
        ("\n")
    ()
else:
    print("error")

Project catalog.

Selected results.

python regular expression - extract image address

import os,sys,time,json,time
import socket,random,hashlib
import requests,configparser
import json,re
from datetime import datetime
from  import Pool as ThreadPool


def getpicurl(url):
    url = "/zipai/comment-page-352"
    html = (url).text
    pic_url = ('img src="(.*?)"',html,)
    for key in pic_url:
        print(key + "\r\n")
    #print(pic_url)
    
getpicurl("/zipai/-352")

Output results:

python
/mw1024/

/mw1024/

/mw1024/

/mw1024/

/mw1024/

/mw1024/

/mw1024/

/mw1024/

/mw1024/

/mw1024/

/mw1024/

/mw1024/

summarize

To this article on how to use Python regular expression crawl web page information and images of the article is introduced to this, more related Python regular expression crawl content, please search for my previous posts or continue to browse the following related articles I hope you will support me in the future!