SoFunction
Updated on 2024-11-13

python crawler basics of simple web page collector

Simple Web Page Collector

We have already learned a simple crawl browser page crawler. But the fact is that our needs are not so simple as crawling the home page of Sogou or the home page of B. No matter what, we want to be able to crawl a specific page with information.

I don't know if you've tried to crawl some search pages like Baidu like I did after learning how to crawl. Pages like this one

在这里插入图片描述

Notice the part I underlined in red, this is the page I opened. Now I want to be able to crawl the data on this page, and according to the code we learned earlier, it should look something like this:

import requests

if __name__ == "__main__":
    # Specify URL
    url = "/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=2&tn=93923645_hao_pg&wd=%E5%A5%A5%E7%89%B9%E6%9B%BC&rsv_spt=1&oq=%25E7%2588%25AC%25E5%258F%2596%25E7%2599%25BE%25E5%25BA%25A6%25E9%25A6%2596%25E9%25A1%25B5&rsv_pq=b233dcfd0002d2d8&rsv_t=ccdbEuqbJfqtjnkFvevj%2BfxQ0Sj2UP88ixXHTNUNsmTa9yWEWTUEgxTta9r%2Fj3mXxDs%2BT1SU&rqlang=cn&rsv_dl=tb&rsv_enter=1&rsv_sug3=8&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&rsv_btype=t&inputT=1424&rsv_sug4=1424"

    # Send request
    response = (url)

    # Getting data
    page_text = 

    # Storage
    with open(". /Otman.html", "w", encoding = "utf-8") as fp:
        (page_text)

    print("Crawl successful!!!")

However, when we opened our saved file, we found that the result was not quite what we thought it would be

在这里插入图片描述

We realized that the file we saved is a blank page, why is that?

In fact, we change the URL into a Sogou may be or more intuitive (I do not know why my side of the Sogou always open, so the use of Baidu as an example, you can write their own code about the Sogou search), the same code to change the URL into a Sogou results in this way

在这里插入图片描述

We found that one of the sentences is " There is an anomalous access in the network".

What this means is that Sogou or Baidu noticed that it was a crawler program that sent the request, not a human operator.

So what is the rationale for this?

Simply put, that is, the program access and we use the browser access is different, the requested server are relying on user-agent to determine the identity of the visitor, if it is a browser to accept the request, otherwise it will be rejected. This is a very common anti-climbing mechanism.

Does that mean there's nothing we can do?

The so-called devil is one foot tall, Dao is ten feet tall. Since we want to identify the user-agent, we will let the crawler simulate the user-agent.

To simulate input data or a user-agent in python, we usually use the dictionary

That's how it's written:

header = {
	"user-agent": "" # user-agent value is a long string
	}

So how do you get the user-agent value?

1. Open any web page, right-click and select "Check".

2. Select "Network" (Google Chrome)(If it's Chinese, select the "Network" item)

在这里插入图片描述

3. If you find that the page is blank, like this, then refresh the page.

在这里插入图片描述

After refreshing it looks like this:

在这里插入图片描述

Then randomly select the item circled by the red pen, we will see something like this, and then find the "user-agent" in it, and copy down its value.

在这里插入图片描述

With "user-agent", we're rewriting our code for crawling the web page, and we're good to go

import requests

if __name__ == "__main__":
    # Specify URL
    url = "/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=2&tn=93923645_hao_pg&wd=%E5%A5%A5%E7%89%B9%E6%9B%BC&rsv_spt=1&oq=%25E7%2588%25AC%25E5%258F%2596%25E7%2599%25BE%25E5%25BA%25A6%25E9%25A6%2596%25E9%25A1%25B5&rsv_pq=b233dcfd0002d2d8&rsv_t=ccdbEuqbJfqtjnkFvevj%2BfxQ0Sj2UP88ixXHTNUNsmTa9yWEWTUEgxTta9r%2Fj3mXxDs%2BT1SU&rqlang=cn&rsv_dl=tb&rsv_enter=1&rsv_sug3=8&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&rsv_btype=t&inputT=1424&rsv_sug4=1424"

    # Emulate "user-agent", i.e. UA masquerading
    header = {
        "user-agent" : "" # The value of the copied user-agent
        }
    # Send request
    response = (url, headers = header)

    # Getting data
    page_text = 

    # Storage
    with open("./Ultraman, Japanese science fiction superhero(UAunder the guise of).html", "w", encoding = "utf-8") as fp:
        (page_text)

    print("Crawl successful!!!")

Run it again and open the file

在这里插入图片描述

This time it worked, which means that our crawler program fooled the server perfectly

to this article on the basis of python crawler simple web page collector is introduced to this article, more related python web page collector content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!