Crawling URLs:/allultraman/
Tools used: pycharm, requests
Go to page
Open Developer Tools
Click Network
Refresh the page for information
The Request URL is the URL we are crawling.
Scroll down to the bottom and there is a User-Agent, copy it.
Send a request to the server
200 means the request was successful
Getting text data using
As you can see there is some garbled code
Converting with encode
import requests url = '/allultraman/' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36' } response = (url = url,headers=headers) html = Html=('iso-8859-1').decode('gbk') print(Html)
Next, start crawling the data you need
Using Xpath to get web links
To use Xpath you must first import the parsel package.
import requests import parsel def get_response(html_url): headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36' } response = (url = html_url,headers=headers) return response url = '/allultraman/' response = get_response(url) html=('iso-8859-1').decode('gbk') selector = (html) period_hrefs = ('//div[@class="btn"]/a/@href') #Get links to pages from all three eras for period_href in period_hrefs: print(period_href.get())
As you can see the web link is incomplete, let's add it up manually period_href = '/allultraman/' + period_href.get()
Go to one of these pages
As you did before, use Xpath to get information about Ultraman's web page
for period_href in period_hrefs: period_href = '/allultraman/' + period_href.get() # print(period_href) period_response = get_response(period_href).text period_html = (period_response) lis = period_html.xpath('//div[@class="ultraheros-Contents_Generations"]/div/ul/li/a/@href') for li in lis: print(())
Running it also reveals that the link is incomplete
li = '/allultraman/' + ().replace('./','')
After getting the URL to continue to nesting operation, you can get the picture data
png_url = '/allultraman/' + li_selector.xpath('//div[@class="left"]/figure/img/@src').get().replace('../','')
Full Code
import requests import parsel import os dirname = "Ultraman." if not (dirname): # Determine if a folder with the name Ultraman exists, create it if it doesn't (dirname) def get_response(html_url): headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36' } response = (url = html_url,headers=headers) return response url = '/allultraman/' response = get_response(url) html=('iso-8859-1').decode('gbk') selector = (html) period_hrefs = ('//div[@class="btn"]/a/@href') #Get links to pages from all three eras for period_href in period_hrefs: period_href = '/allultraman/' + period_href.get() period_html = get_response(period_href).text period_selector = (period_html) lis = period_selector.xpath('//div[@class="ultraheros-Contents_Generations"]/div/ul/li/a/@href') for li in lis: li = '/allultraman/' + ().replace('./','') #Get the URL of every Ultraman # print(li) li_html = get_response(li).text li_selector = (li_html) url = li_selector.xpath('//div[@class="left"]/figure/img/@src').get() # print(url) if url: png_url = '/allultraman/' + ('.', '') png_title =li_selector.xpath('//ul[@class="lists"]/li[3]/text()').get() png_title = png_title.encode('iso-8859-1').decode('gbk') # print(li,png_title) png_content = get_response(png_url).content with open(f'{dirname}\\{png_title}.png','wb') as f: (png_content) print(png_title,'Image download complete') else: continue
When climbing to the Nexus Ultraman, it will return None, adjusted half a day, but did not understand, so if url: statement to skip the Nexus Ultraman, there is no big brother know why!
url = li_selector.xpath('//div[@class="left"]/figure/img/@src').get()
to this article on the realization of this python crawl Ultraman picture examples of the article is introduced to this, more related python crawl Ultraman picture content please search for my previous posts or continue to browse the following related articles I hope you will support me in the future more!