SoFunction
Updated on 2024-11-18

Python with BeautifulSoup library simple crawler example analysis

A brief introduction to the functions that will be used

1、from bs4 import BeautifulSoup

#Introduction

2. Request header herders

headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36','referer':"" }
all_url = '/' 
'User-Agent':request method 
'referer':What link did you jump from?

3、Establish connection

start_html = (all_url, headers=headers)
all_url:starting address,That is, the first page visited
headers:request header,Tell the server who's here.。
:A method that gets theall_urland returns the content of the。

4、Parse the page obtained

Soup = BeautifulSoup(start_html.text, 'lxml')
BeautifulSoup:parse page
lxml:resolver
start_html.text:Content of the page

5、Processing the acquired page

all_a = ('div', class_='pic').find_all('a')[-2]
()Finding a
find_all()Find all,Returns a list
.find('img')['src']  :gainimg(used form a nominal expression)srcLink Properties  
class__:gain目标(used form a nominal expression)类名
div/a:The type condition isdiv/a(used form a nominal expression)
[-2]可以用来去掉最后多匹配(used form a nominal expression)tab (of a window) (computing),Here it means removing the last twoatab (of a window) (computing)

6. Access to targeted content

<a href =# >content</a>
a[i]/get_text():Getting the firsticlassifier for individual things or people, general, catch-all classifieraInside the tag

7. Introduction to other functions that may be used:

1、Folder creation and switching

(("E:\name", filename))
# Create a folder named filename in the directory E:\name
("E:\name\\" + filename)
#Switch the working path toE:\name\filenamearrive at (a decision, conclusion etc)

2、File saving

f = open(name+'.jpg', 'ab')The parameter b is required for ## writing multimedia files!
() ##Multimedia files if you use conctent!
()

Case in point: crawling girl charts

  
import requests
from bs4 import BeautifulSoup
import os
# Import the required modules
class mzitu():
  def all_url(self, url):
    html = (url)##
    all_a = BeautifulSoup(, 'lxml').find('div', class_='all').find_all('a')
    for a in all_a:
      title = a.get_text()
      print('------ start saving:', title) 
      path = str(title).replace("?", '_') ## Replace the ones that come with it?
      (path) ## Call the mkdir function to create a folder! Here, path stands for the title.
      href = a['href']
      (href) 

  def html(self, href):  ## Get the page address of the image
    html = (href)
    max_span = BeautifulSoup(, 'lxml').find('div', class_='pagenavi').find_all('span')[-2].get_text()
    #This one's mentioned above
    for page in range(1, int(max_span) + 1):
      page_url = href + '/' + str(page)
      (page_url) ## Call the img function

  def img(self, page_url): ## Process image page addresses to get the actual address of the image
    img_html = (page_url)
    img_url = BeautifulSoup(img_html.text, 'lxml').find('div', class_='main-image').find('img')['src']
    (img_url)

  def save(self, img_url): ##Save the picture
    name = img_url[-9:-4]
    img = (img_url)
    f = open(name + '.jpg', 'ab')
    ()
    ()

  def mkdir(self, path): ##Creating a folder
    path = ()
    isExists = (("E:\mzitu2", path))
    if not isExists:
      print('Built a name for it.', path, ''The folder!'')
      (("E:\mzitu2", path))
      (("E:\mzitu2", path)) ##Switch to directory
      return True
    else:
      print( path, 'The folder already exists!')
      return False

  def request(self, url): ##This function gets the response of a web page and returns it
    headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
      'referer':# Fake an access source "/100260/2"
    }
    content = (url, headers=headers)
    return content
# Setting up the startup function
def main():
  Mzitu = mzitu() ## Instantiation
  Mzitu.all_url('/all') ## Pass parameters to function all_url

main()