SoFunction
Updated on 2024-11-15

Example of crawling Ultraman images with python implementation

Crawling URLs:/allultraman/

Tools used: pycharm, requests

Go to page

Open Developer Tools

Click Network

Refresh the page for information

The Request URL is the URL we are crawling.

Scroll down to the bottom and there is a User-Agent, copy it.

Send a request to the server

200 means the request was successful

Getting text data using

As you can see there is some garbled code

Converting with encode

import requests
 
url = '/allultraman/'
 
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'
}
 
response = (url = url,headers=headers)
html = 
Html=('iso-8859-1').decode('gbk')
print(Html)

Next, start crawling the data you need

Using Xpath to get web links

To use Xpath you must first import the parsel package.

import requests
import parsel
 
def get_response(html_url):
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'
    }
 
    response = (url = html_url,headers=headers)
    return response
 
url = '/allultraman/'
response = get_response(url)
html=('iso-8859-1').decode('gbk')
selector = (html)
 
period_hrefs = ('//div[@class="btn"]/a/@href')  #Get links to pages from all three eras
 
for period_href in period_hrefs:
    print(period_href.get())
 

As you can see the web link is incomplete, let's add it up manually period_href = '/allultraman/' + period_href.get()

Go to one of these pages

As you did before, use Xpath to get information about Ultraman's web page

for period_href in period_hrefs:
    period_href = '/allultraman/' + period_href.get()
    # print(period_href)
    period_response = get_response(period_href).text
    period_html = (period_response)
    lis = period_html.xpath('//div[@class="ultraheros-Contents_Generations"]/div/ul/li/a/@href')
    for li in lis:
        print(())

Running it also reveals that the link is incomplete

li = '/allultraman/' + ().replace('./','')

After getting the URL to continue to nesting operation, you can get the picture data

png_url = '/allultraman/' + li_selector.xpath('//div[@class="left"]/figure/img/@src').get().replace('../','')

Full Code

import requests
import parsel
import os
 
dirname = "Ultraman."
if not (dirname):     # Determine if a folder with the name Ultraman exists, create it if it doesn't
    (dirname)
 
 
def get_response(html_url):
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'
    }
 
    response = (url = html_url,headers=headers)
    return response
 
url = '/allultraman/'
response = get_response(url)
html=('iso-8859-1').decode('gbk')
selector = (html)
 
period_hrefs = ('//div[@class="btn"]/a/@href')  #Get links to pages from all three eras
 
for period_href in period_hrefs:
    period_href = '/allultraman/' + period_href.get()
 
    period_html = get_response(period_href).text
    period_selector = (period_html)
    lis = period_selector.xpath('//div[@class="ultraheros-Contents_Generations"]/div/ul/li/a/@href')
    for li in lis:
        li = '/allultraman/' + ().replace('./','')     #Get the URL of every Ultraman
        # print(li)
        li_html = get_response(li).text
        li_selector = (li_html)
        url = li_selector.xpath('//div[@class="left"]/figure/img/@src').get()
        # print(url)
 
        if url:
            png_url = '/allultraman/' + ('.', '')
            png_title =li_selector.xpath('//ul[@class="lists"]/li[3]/text()').get()
            png_title = png_title.encode('iso-8859-1').decode('gbk')
            # print(li,png_title)
            png_content = get_response(png_url).content
            with open(f'{dirname}\\{png_title}.png','wb') as f:
                (png_content)
            print(png_title,'Image download complete')
        else:
            continue
 

When climbing to the Nexus Ultraman, it will return None, adjusted half a day, but did not understand, so if url: statement to skip the Nexus Ultraman, there is no big brother know why!

url = li_selector.xpath('//div[@class="left"]/figure/img/@src').get()

to this article on the realization of this python crawl Ultraman picture examples of the article is introduced to this, more related python crawl Ultraman picture content please search for my previous posts or continue to browse the following related articles I hope you will support me in the future more!