Go language combines regular expressions to build efficient data crawling tools. Below I will provide several complete examples covering data crawling requirements in different scenarios.
Basic web content crawling
1.1 Get all links in the webpage
package main import ( "fmt" "io/ioutil" "net/http" "regexp" ) func main() { // Send HTTP request resp, err := ("") if err != nil { ("HTTP request failed:", err) return } defer () // Read the response content body, err := () if err != nil { ("Read response failed:", err) return } // Compile regular expressions to match all A-label href attributes re := (`<a[^>]+href=["'](.*?)["']`) matches := (string(body), -1) // Output all links ("Link Found:") for _, match := range matches { if len(match) > 1 { (match[1]) } } }
1.2 Extract text from a specific pattern
package main import ( "fmt" "io/ioutil" "net/http" "regexp" ) func main() { resp, err := ("") if err != nil { ("HTTP request failed:", err) return } defer () body, _ := () // Match all <h1>-<h6> tag content re := (`<h[1-6][^>]*>(.*?)</h[1-6]>`) titles := (string(body), -1) ("Web title:") for _, title := range titles { if len(title) > 1 { // Remove HTML tags cleanTitle := (`<[^>]+>`).ReplaceAllString(title[1], "") (cleanTitle) } } }
Structured data crawling
2.1 Crawl table data
package main import ( "fmt" "io/ioutil" "net/http" "regexp" "strings" ) func main() { resp, err := ("/table-page") if err != nil { ("HTTP request failed:", err) return } defer () body, _ := () content := string(body) // Match the entire table tableRe := (`<table[^>]*>(.*?)</table>`) tableMatch := (content) if len(tableMatch) == 0 { ("Not found") return } tableContent := tableMatch[1] // Match table rows rowRe := (`<tr[^>]*>(.*?)</tr>`) rows := (tableContent, -1) // Match cells cellRe := (`<t[dh][^>]*>(.*?)</t[dh]>`) ("Table Data:") for _, row := range rows { cells := (row[1], -1) for _, cell := range cells { if len(cell) > 1 { // Clean cell content cleanCell := ((`<[^>]+>`).ReplaceAllString(cell[1], "")) ("%s\t", cleanCell) } } () // Line break } }
2.2 Crawling specific fields in JSON data
package main import ( "encoding/json" "fmt" "io/ioutil" "net/http" "regexp" ) type Product struct { Name string `json:"name"` Price float64 `json:"price"` } func main() { resp, err := ("/products") if err != nil { ("HTTP request failed:", err) return } defer () body, _ := () // Method 1: Directly parse JSON var products []Product if err := (body, &products); err == nil { ("Product List(JSONAnalysis):") for _, p := range products { ("%s - $%.2f\n", , ) } return } // Method 2: Use regularity when JSON structure is uncertain ("\nTry to use regular expression extraction:") // Match the product name and price re := (`"name"\s*:\s*"([^"]+)"[^}]+"price"\s*:\s*(\d+\.?\d*)`) matches := (string(body), -1) for _, match := range matches { if len(match) >= 3 { ("%s - $%s\n", match[1], match[2]) } } }
Advanced crawler skills
3.1 Crawlers with concurrent control
package main import ( "fmt" "io/ioutil" "net/http" "regexp" "sync" ) func main() { urls := []string{ "/page1", "/page2", "/page3", } var wg semaphore := make(chan struct{}, 3) // Concurrency limit is 3 titleRe := (`<title[^>]*>(.*?)</title>`) for _, url := range urls { (1) go func(u string) { defer () semaphore <- struct{}{} // Get semaphore resp, err := (u) if err != nil { ("Failed to obtain %s: %v\n", u, err) <-semaphore return } body, _ := () () title := (string(body)) if len(title) > 1 { ("%s title: %s\n", u, title[1]) } <-semaphore // Release the semaphore }(url) } () }
3.2 Handling paging content
package main import ( "fmt" "io/ioutil" "net/http" "regexp" "strconv" ) func main() { baseURL := "/news?page=" pageRe := (`<div class="news-item">(.*?)</div>`) titleRe := (`<h2>(.*?)</h2>`) pageNumRe := (`page=(\d+)`) // Get the total number of pages first totalPages := getTotalPages(baseURL + "1") ("Total %d page content\n", totalPages) // Crawl content on each page for page := 1; page <= totalPages; page++ { url := baseURL + (page) ("\nCrawling page %d: %s\n", page, url) resp, err := (url) if err != nil { ("Failed to obtain page %d: %v\n", page, err) continue } body, _ := () () newsItems := (string(body), -1) for _, item := range newsItems { if len(item) > 1 { title := (item[1]) if len(title) > 1 { ("News Title:", title[1]) } } } } } func getTotalPages(url string) int { resp, err := (url) if err != nil { return 1 // Default 1 page } defer () body, _ := () // Suppose there is a text similar to "5 pages in total" on the page re := (`common\s*(\d+)\s*Page`) match := (string(body)) if len(match) > 1 { total, _ := (match[1]) return total } return 1 }
Practical tips and precautions
-Agent settings:
client := &{} req, _ := ("GET", "", nil) ("User-Agent", "Mozilla/5.0 (compatible; MyBot/1.0)") resp, _ := (req)
2. Process relative links:
import "net/url" base, _ := ("") rel, _ := ("/page1") absURL := (rel).String()
3. Regular expression optimization:
Precompiled regular expressions: re := (pattern)
Using non-greedy matching:.*?
Avoid overly complex regular expressions
4. Error handling enhancement:
resp, err := (url) if err != nil { return ("Request failed: %w", err) } defer func() { if err := (); err != nil { ("Closing the response body failed: %v", err) } }()
Anti-crawler strategy response
Set a reasonable request interval:
import "time" func crawlWithDelay(urls []string, delay ) { for _, url := range urls { go crawlPage(url) (delay) } }
Using proxy IP:
proxyUrl, _ := ("http://proxy-ip:port") client := &{ Transport: &{ Proxy: (proxyUrl), }, } resp, _ := ("")
Processing Cookies:
jar, _ := (nil) client := &{Jar: jar} // Get cookies for the first time("/login") // Subsequent requests will carry cookies("/protected-page")
Summarize
The above example shows a variety of methods for data crawling by combining regular expressions in Go language:
- Basic web page crawling: obtain links and extract specific content
- Structured data extraction: table data, JSON data
- Advanced skills: concurrency control, paging processing
- Practical tips: User-Agent settings, relative link processing
- Reverse crawling response: request interval, proxy IP, cookies processing
In actual projects, it is recommended:
- Prefer APIs rather than HTML parsing for structured data
- For complex HTML parsing, consider using special libraries such as goquery
- Comply with the rules of the website
- Set reasonable crawling frequency to avoid burdening the target website
These instances can be used as basic templates to adjust and expand according to specific needs.
This is the article about Go language combining regular expressions to achieve efficient data acquisition. For more relevant Go data acquisition content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!