Simple Web Page Collector
We have already learned a simple crawl browser page crawler. But the fact is that our needs are not so simple as crawling the home page of Sogou or the home page of B. No matter what, we want to be able to crawl a specific page with information.
I don't know if you've tried to crawl some search pages like Baidu like I did after learning how to crawl. Pages like this one
Notice the part I underlined in red, this is the page I opened. Now I want to be able to crawl the data on this page, and according to the code we learned earlier, it should look something like this:
import requests if __name__ == "__main__": # Specify URL url = "/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=2&tn=93923645_hao_pg&wd=%E5%A5%A5%E7%89%B9%E6%9B%BC&rsv_spt=1&oq=%25E7%2588%25AC%25E5%258F%2596%25E7%2599%25BE%25E5%25BA%25A6%25E9%25A6%2596%25E9%25A1%25B5&rsv_pq=b233dcfd0002d2d8&rsv_t=ccdbEuqbJfqtjnkFvevj%2BfxQ0Sj2UP88ixXHTNUNsmTa9yWEWTUEgxTta9r%2Fj3mXxDs%2BT1SU&rqlang=cn&rsv_dl=tb&rsv_enter=1&rsv_sug3=8&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&rsv_btype=t&inputT=1424&rsv_sug4=1424" # Send request response = (url) # Getting data page_text = # Storage with open(". /Otman.html", "w", encoding = "utf-8") as fp: (page_text) print("Crawl successful!!!")
However, when we opened our saved file, we found that the result was not quite what we thought it would be
We realized that the file we saved is a blank page, why is that?
In fact, we change the URL into a Sogou may be or more intuitive (I do not know why my side of the Sogou always open, so the use of Baidu as an example, you can write their own code about the Sogou search), the same code to change the URL into a Sogou results in this way
We found that one of the sentences is " There is an anomalous access in the network".
What this means is that Sogou or Baidu noticed that it was a crawler program that sent the request, not a human operator.
So what is the rationale for this?
Simply put, that is, the program access and we use the browser access is different, the requested server are relying on user-agent to determine the identity of the visitor, if it is a browser to accept the request, otherwise it will be rejected. This is a very common anti-climbing mechanism.
Does that mean there's nothing we can do?
The so-called devil is one foot tall, Dao is ten feet tall. Since we want to identify the user-agent, we will let the crawler simulate the user-agent.
To simulate input data or a user-agent in python, we usually use the dictionary
That's how it's written:
header = { "user-agent": "" # user-agent value is a long string }
So how do you get the user-agent value?
1. Open any web page, right-click and select "Check".
2. Select "Network" (Google Chrome)(If it's Chinese, select the "Network" item)
3. If you find that the page is blank, like this, then refresh the page.
After refreshing it looks like this:
Then randomly select the item circled by the red pen, we will see something like this, and then find the "user-agent" in it, and copy down its value.
With "user-agent", we're rewriting our code for crawling the web page, and we're good to go
import requests if __name__ == "__main__": # Specify URL url = "/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=2&tn=93923645_hao_pg&wd=%E5%A5%A5%E7%89%B9%E6%9B%BC&rsv_spt=1&oq=%25E7%2588%25AC%25E5%258F%2596%25E7%2599%25BE%25E5%25BA%25A6%25E9%25A6%2596%25E9%25A1%25B5&rsv_pq=b233dcfd0002d2d8&rsv_t=ccdbEuqbJfqtjnkFvevj%2BfxQ0Sj2UP88ixXHTNUNsmTa9yWEWTUEgxTta9r%2Fj3mXxDs%2BT1SU&rqlang=cn&rsv_dl=tb&rsv_enter=1&rsv_sug3=8&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&rsv_btype=t&inputT=1424&rsv_sug4=1424" # Emulate "user-agent", i.e. UA masquerading header = { "user-agent" : "" # The value of the copied user-agent } # Send request response = (url, headers = header) # Getting data page_text = # Storage with open("./Ultraman, Japanese science fiction superhero(UAunder the guise of).html", "w", encoding = "utf-8") as fp: (page_text) print("Crawl successful!!!")
Run it again and open the file
This time it worked, which means that our crawler program fooled the server perfectly
to this article on the basis of python crawler simple web page collector is introduced to this article, more related python web page collector content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!