A brief introduction to the functions that will be used
1、from bs4 import BeautifulSoup
#Introduction
2. Request header herders
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36','referer':"" } all_url = '/' 'User-Agent':request method 'referer':What link did you jump from?
3、Establish connection
start_html = (all_url, headers=headers) all_url:starting address,That is, the first page visited headers:request header,Tell the server who's here.。 :A method that gets theall_urland returns the content of the。
4、Parse the page obtained
Soup = BeautifulSoup(start_html.text, 'lxml') BeautifulSoup:parse page lxml:resolver start_html.text:Content of the page
5、Processing the acquired page
all_a = ('div', class_='pic').find_all('a')[-2] ()Finding a find_all()Find all,Returns a list .find('img')['src'] :gainimg(used form a nominal expression)srcLink Properties class__:gain目标(used form a nominal expression)类名 div/a:The type condition isdiv/a(used form a nominal expression) [-2]可以用来去掉最后多匹配(used form a nominal expression)tab (of a window) (computing),Here it means removing the last twoatab (of a window) (computing)
6. Access to targeted content
<a href =# >content</a> a[i]/get_text():Getting the firsticlassifier for individual things or people, general, catch-all classifieraInside the tag
7. Introduction to other functions that may be used:
1、Folder creation and switching
(("E:\name", filename)) # Create a folder named filename in the directory E:\name ("E:\name\\" + filename) #Switch the working path toE:\name\filenamearrive at (a decision, conclusion etc)
2、File saving
f = open(name+'.jpg', 'ab')The parameter b is required for ## writing multimedia files! () ##Multimedia files if you use conctent! ()
Case in point: crawling girl charts
import requests from bs4 import BeautifulSoup import os # Import the required modules class mzitu(): def all_url(self, url): html = (url)## all_a = BeautifulSoup(, 'lxml').find('div', class_='all').find_all('a') for a in all_a: title = a.get_text() print('------ start saving:', title) path = str(title).replace("?", '_') ## Replace the ones that come with it? (path) ## Call the mkdir function to create a folder! Here, path stands for the title. href = a['href'] (href) def html(self, href): ## Get the page address of the image html = (href) max_span = BeautifulSoup(, 'lxml').find('div', class_='pagenavi').find_all('span')[-2].get_text() #This one's mentioned above for page in range(1, int(max_span) + 1): page_url = href + '/' + str(page) (page_url) ## Call the img function def img(self, page_url): ## Process image page addresses to get the actual address of the image img_html = (page_url) img_url = BeautifulSoup(img_html.text, 'lxml').find('div', class_='main-image').find('img')['src'] (img_url) def save(self, img_url): ##Save the picture name = img_url[-9:-4] img = (img_url) f = open(name + '.jpg', 'ab') () () def mkdir(self, path): ##Creating a folder path = () isExists = (("E:\mzitu2", path)) if not isExists: print('Built a name for it.', path, ''The folder!'') (("E:\mzitu2", path)) (("E:\mzitu2", path)) ##Switch to directory return True else: print( path, 'The folder already exists!') return False def request(self, url): ##This function gets the response of a web page and returns it headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36', 'referer':# Fake an access source "/100260/2" } content = (url, headers=headers) return content # Setting up the startup function def main(): Mzitu = mzitu() ## Instantiation Mzitu.all_url('/all') ## Pass parameters to function all_url main()