Go language combines regular expressions to achieve efficient data acquisition

Go language combines regular expressions to build efficient data crawling tools. Below I will provide several complete examples covering data crawling requirements in different scenarios.

Basic web content crawling

1.1 Get all links in the webpage

package main

import (
	"fmt"
	"io/ioutil"
	"net/http"
	"regexp"
)

func main() {
	// Send HTTP request	resp, err := ("")
	if err != nil {
		("HTTP request failed:", err)
		return
	}
	defer ()

	// Read the response content	body, err := ()
	if err != nil {
		("Read response failed:", err)
		return
	}

	// Compile regular expressions to match all A-label href attributes	re := (`&lt;a[^&gt;]+href=["'](.*?)["']`)
	matches := (string(body), -1)

	// Output all links	("Link Found:")
	for _, match := range matches {
		if len(match) &gt; 1 {
			(match[1])
		}
	}
}

1.2 Extract text from a specific pattern

package main

import (
	"fmt"
	"io/ioutil"
	"net/http"
	"regexp"
)

func main() {
	resp, err := ("")
	if err != nil {
		("HTTP request failed:", err)
		return
	}
	defer ()

	body, _ := ()

	// Match all <h1>-<h6> tag content	re := (`&lt;h[1-6][^&gt;]*&gt;(.*?)&lt;/h[1-6]&gt;`)
	titles := (string(body), -1)

	("Web title:")
	for _, title := range titles {
		if len(title) &gt; 1 {
			// Remove HTML tags			cleanTitle := (`&lt;[^&gt;]+&gt;`).ReplaceAllString(title[1], "")
			(cleanTitle)
		}
	}
}

Structured data crawling

2.1 Crawl table data

package main

import (
	"fmt"
	"io/ioutil"
	"net/http"
	"regexp"
	"strings"
)

func main() {
	resp, err := ("/table-page")
	if err != nil {
		("HTTP request failed:", err)
		return
	}
	defer ()

	body, _ := ()
	content := string(body)

	// Match the entire table	tableRe := (`&lt;table[^&gt;]*&gt;(.*?)&lt;/table&gt;`)
	tableMatch := (content)
	if len(tableMatch) == 0 {
		("Not found")
		return
	}

	tableContent := tableMatch[1]

	// Match table rows	rowRe := (`&lt;tr[^&gt;]*&gt;(.*?)&lt;/tr&gt;`)
	rows := (tableContent, -1)

	// Match cells	cellRe := (`&lt;t[dh][^&gt;]*&gt;(.*?)&lt;/t[dh]&gt;`)

	("Table Data:")
	for _, row := range rows {
		cells := (row[1], -1)
		for _, cell := range cells {
			if len(cell) &gt; 1 {
				// Clean cell content				cleanCell := ((`&lt;[^&gt;]+&gt;`).ReplaceAllString(cell[1], ""))
				("%s\t", cleanCell)
			}
		}
		() // Line break	}
}

2.2 Crawling specific fields in JSON data

package main

import (
	"encoding/json"
	"fmt"
	"io/ioutil"
	"net/http"
	"regexp"
)

type Product struct {
	Name  string  `json:"name"`
	Price float64 `json:"price"`
}

func main() {
	resp, err := ("/products")
	if err != nil {
		("HTTP request failed:", err)
		return
	}
	defer ()

	body, _ := ()

	// Method 1: Directly parse JSON	var products []Product
	if err := (body, &amp;products); err == nil {
		("Product List(JSONAnalysis):")
		for _, p := range products {
			("%s - $%.2f\n", , )
		}
		return
	}

	// Method 2: Use regularity when JSON structure is uncertain	("\nTry to use regular expression extraction:")

	// Match the product name and price	re := (`"name"\s*:\s*"([^"]+)"[^}]+"price"\s*:\s*(\d+\.?\d*)`)
	matches := (string(body), -1)

	for _, match := range matches {
		if len(match) &gt;= 3 {
			("%s - $%s\n", match[1], match[2])
		}
	}
}

Advanced crawler skills

3.1 Crawlers with concurrent control

package main

import (
	"fmt"
	"io/ioutil"
	"net/http"
	"regexp"
	"sync"
)

func main() {
	urls := []string{
		"/page1",
		"/page2",
		"/page3",
	}

	var wg 
	semaphore := make(chan struct{}, 3) // Concurrency limit is 3
	titleRe := (`&lt;title[^&gt;]*&gt;(.*?)&lt;/title&gt;`)

	for _, url := range urls {
		(1)
		go func(u string) {
			defer ()
			semaphore &lt;- struct{}{} // Get semaphore
			resp, err := (u)
			if err != nil {
				("Failed to obtain %s: %v\n", u, err)
				&lt;-semaphore
				return
			}

			body, _ := ()
			()

			title := (string(body))
			if len(title) &gt; 1 {
				("%s title: %s\n", u, title[1])
			}

			&lt;-semaphore // Release the semaphore		}(url)
	}

	()
}

3.2 Handling paging content

package main

import (
	"fmt"
	"io/ioutil"
	"net/http"
	"regexp"
	"strconv"
)

func main() {
	baseURL := "/news?page="
	pageRe := (`&lt;div class="news-item"&gt;(.*?)&lt;/div&gt;`)
	titleRe := (`&lt;h2&gt;(.*?)&lt;/h2&gt;`)
	pageNumRe := (`page=(\d+)`)

	// Get the total number of pages first	totalPages := getTotalPages(baseURL + "1")
	
	("Total %d page content\n", totalPages)

	// Crawl content on each page	for page := 1; page &lt;= totalPages; page++ {
		url := baseURL + (page)
		("\nCrawling page %d: %s\n", page, url)
		
		resp, err := (url)
		if err != nil {
			("Failed to obtain page %d: %v\n", page, err)
			continue
		}

		body, _ := ()
		()

		newsItems := (string(body), -1)
		for _, item := range newsItems {
			if len(item) &gt; 1 {
				title := (item[1])
				if len(title) &gt; 1 {
					("News Title:", title[1])
				}
			}
		}
	}
}

func getTotalPages(url string) int {
	resp, err := (url)
	if err != nil {
		return 1 // Default 1 page	}
	defer ()

	body, _ := ()
	
	// Suppose there is a text similar to "5 pages in total" on the page	re := (`common\s*(\d+)\s*Page`)
	match := (string(body))
	if len(match) &gt; 1 {
		total, _ := (match[1])
		return total
	}
	
	return 1
}

Practical tips and precautions

-Agent settings:

client := &{}
req, _ := ("GET", "", nil)
("User-Agent", "Mozilla/5.0 (compatible; MyBot/1.0)")
resp, _ := (req)

2. Process relative links:

import "net/url"

base, _ := ("")
rel, _ := ("/page1")
absURL := (rel).String()

3. Regular expression optimization:

Precompiled regular expressions: re := (pattern)

Using non-greedy matching:.*?

Avoid overly complex regular expressions

4. Error handling enhancement:

resp, err := (url)
if err != nil {
    return ("Request failed: %w", err)
}
defer func() {
    if err := (); err != nil {
        ("Closing the response body failed: %v", err)
    }
}()

Anti-crawler strategy response

Set a reasonable request interval:

import "time"

func crawlWithDelay(urls []string, delay ) {
    for _, url := range urls {
        go crawlPage(url)
        (delay)
    }
}

Using proxy IP:

proxyUrl, _ := ("http://proxy-ip:port")
client := &{
    Transport: &{
        Proxy: (proxyUrl),
    },
}
resp, _ := ("")

Processing Cookies:

jar, _ := (nil)
client := &amp;{Jar: jar}
// Get cookies for the first time("/login")
// Subsequent requests will carry cookies("/protected-page")

Summarize

The above example shows a variety of methods for data crawling by combining regular expressions in Go language:

Basic web page crawling: obtain links and extract specific content
Structured data extraction: table data, JSON data
Advanced skills: concurrency control, paging processing
Practical tips: User-Agent settings, relative link processing
Reverse crawling response: request interval, proxy IP, cookies processing

In actual projects, it is recommended:

Prefer APIs rather than HTML parsing for structured data
For complex HTML parsing, consider using special libraries such as goquery
Comply with the rules of the website
Set reasonable crawling frequency to avoid burdening the target website

These instances can be used as basic templates to adjust and expand according to specific needs.

This is the article about Go language combining regular expressions to achieve efficient data acquisition. For more relevant Go data acquisition content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!