In this paper, the main example of the implementation of the use of urllib and BeautifulSoup crawl Wikipedia entries, as follows.
Simple code:
#Introduce development packages from import urlopen from bs4 import BeautifulSoup import re # Request URL and encode the result in UTF-8 resp=urlopen("/wiki/Wikipedia:%E9%A6%96%E9%A1%B5").read().decode("utf-8") # Use BeautifulSoup to parse it. soup=BeautifulSoup(resp,"") #print(soup) # Get the href attribute of all a tags that start with /wiki/ listUrl=("a",href=("^/wiki/")) # Outputs the names and URLs of all entries for link in listUrl: if not ("\.(jpg|JPG)$",link["href"]): print(link.get_text(),"<----->",""+link["href"])
Run results:
summarize
Overall, Python is concise yet powerful, calling a few libraries to achieve what a ton of code in other languages can only achieve.
Above is this article on urllib and BeautifulSoup crawl Wikipedia entries simple example of the full content, I hope you can help. Interested friends can continue to refer to other related topics on this site, if there are inadequacies, welcome to leave a message to point out. Thank you for the support of friends on this site!