SoFunction
Updated on 2024-11-13

A simple example of urllib and BeautifulSoup crawling Wikipedia entries.

In this paper, the main example of the implementation of the use of urllib and BeautifulSoup crawl Wikipedia entries, as follows.

Simple code:

#Introduce development packages
from  import urlopen
from bs4 import BeautifulSoup
import re
# Request URL and encode the result in UTF-8
resp=urlopen("/wiki/Wikipedia:%E9%A6%96%E9%A1%B5").read().decode("utf-8")
# Use BeautifulSoup to parse it.
soup=BeautifulSoup(resp,"")
#print(soup)
# Get the href attribute of all a tags that start with /wiki/
listUrl=("a",href=("^/wiki/"))
# Outputs the names and URLs of all entries
for link in listUrl:
  if not ("\.(jpg|JPG)$",link["href"]):
    print(link.get_text(),"<----->",""+link["href"])

Run results:

summarize

Overall, Python is concise yet powerful, calling a few libraries to achieve what a ton of code in other languages can only achieve.

Above is this article on urllib and BeautifulSoup crawl Wikipedia entries simple example of the full content, I hope you can help. Interested friends can continue to refer to other related topics on this site, if there are inadequacies, welcome to leave a message to point out. Thank you for the support of friends on this site!