Python crawler sets cookies to solve the problem of websites intercepting and crawling ant short term rentals

We are writing a Python crawler, sometimes encountered in the site denies access to anti-climbing means, for example, so we want to crawl ants short-term rental data, it will prompt "the current access is suspected of hacking, has been set by the site administrator to intercept" prompt, the following chart shows. At this point we need to use the setting of cookies to crawl, the following we carry out a detailed introduction. Thank you very much to my student Cheng Feng to provide ideas, the next wave pushed the front ah!

I. Website Analysis and Crawler Interception

When we open the ant short term rental search Guiyang city, feedback the results shown below.

We can see that the short-term rental information shows a certain regular distribution, as shown in the figure below, which is also the information we want to crawl.

Reviewing the elements through the browser, we can see that the information needed to crawl each rental is located under the <dd></dd> node.

Many people learn python and don't know where to start.
Many people learn python, and after mastering the basic syntax, they don't know where to find cases to get started.
A lot of people who are already making cases don't know how to go about learning more advanced knowledge.
So for these three categories of people, I offer you a good learning platform to get free video tutorials, e-books, and source code for your courses!
QQswarm：810735403

In locating the house name, as shown below, under the <div class="room-detail clearfloat"></div> node.

Next we write a simple BeautifulSoup to crawl.

# -*- coding: utf-8 -*-
import urllib
import re
from bs4 import BeautifulSoup
import codecs
 
url = '/guiyang/?map=no'
response=(url)
contents = ()
soup = BeautifulSoup(contents, "")
print 
print soup
#Short-Term Rental Names
for tag in soup.find_all('dd'):
 for name in tag.find_all(attrs={"class":"room-detail clearfloat"}):
 fname = ('p').get_text()
 print u'[Name of short-term rental]'', ('\n','').strip()

But unfortunately, it was reported incorrectly, showing that Anthem precautions are still in place.

II. Setting Cookies for BeautifulSoup Crawler

The code to add the message header is shown below, the code and results are given here before showing you how to get the cookie.

# -*- coding: utf-8 -*-
import urllib2
import re
from bs4 import BeautifulSoup
 
#crawler functions
def gydzf(url):
 user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
 headers={"User-Agent":user_agent}
 request=(url,headers=headers)
 response=(request)
 contents = ()
 soup = BeautifulSoup(contents, "")
 for tag in soup.find_all('dd'):
 #Short-Term Rental Names
 for name in tag.find_all(attrs={"class":"room-detail clearfloat"}):
 fname = ('p').get_text()
 print u'[Name of short-term rental]'', ('\n','').strip()
 #Short-Term Rental Rates
 for price in tag.find_all(attrs={"class":"moy-b"}):
 string = ('p').get_text()
 fprice = ("[￥]+".decode("utf8"), "".decode("utf8"),string)
 fprice = fprice[0:5]
 print u'[Short-term rental prices]'', ('\n','').strip()
 # of ratings and comments
 for score in ('ul'):
 fscore = ('ul').get_text()
 print u'[Short term rental ratings/reviews/number of occupants]'', ('\n','').strip()
 #webpage link url
 url_dzf = (attrs={"target":"_blank"})
 urls = url_dzf.attrs['href']
 print u'[Web Link]', ('\n','').strip()
 urlss = '' + urls + ''
 print urlss
 
#main function
if __name__ == '__main__':
 i = 1
 while i<10:
 print u'Page number', i
 url = '/guiyang/' + str(i) + '/?map=no'
 gydzf(url)
 i = i+1
 else:
 print u"The end."

The output is shown below:

Page 1
[Name of short-term rental] Datang Dongyuan Fortune Plaza--City Minimalist Duplex B&B
[Short-term room rates] 298
[Short Rentals Ratings/Reviews/Number of Occupants] 5.0 out of 5 - 5 reviews - 2 bedroom - sleeps 3
[Web Link] /room/851634765
/room/851634765
[Name of short-term rental] Datang Dongyuan Fortune Plaza - Fresh Lemon Duplex B&B
[Short-term room rates] 568
[Short-Stay Ratings/Reviews/Number of Occupants] 2 Reviews - Three Bedroom - Sleeps 6
[Web Link] /room/851634467
/room/851634467

...

Page 9
[Name of short-term rental] [Next to the park of North High-speed Railway Station] American style + oversized comfortable and cozy
[Short-term room rates] 366
[Short term rental ratings/reviews/number of occupants] 3 reviews - 2 bedroom - sleeps 5
[Web Link] /room/851018852
/room/851018852
[Name of short-term rental] Danyingpo (near Zhongda International Shopping Center) Scandinavian Small Fresh Three Rooms
[Short-term room rates] 298
[Short term rental ratings/reviews/number of occupants] 3 bedroom - sleeps 6
[Web Link] /room/851647045
/room/851647045

Next we want to get the details

Here the author mainly provides the method to analyze the cookies, use the browser to open the web page, right click "check", and then refresh the web page. Find the web page in "NetWork" and click on it, the information will be hidden in the pop-up Headers.

The two most common parameters are Cookie and User-Agent, as shown below:

Then just set these parameters in the Python code and call () to submit the request, the core code is as follows:

user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) ... Chrome/61.0.3163.100 Safari/537.36"
 cookie="mediav=%7B%22eid%22%3A%22387123...b3574ef2-21b9-11e8-b39c-1bc4029c43b8"
 headers={"User-Agent":user_agent,"Cookie":cookie}
 request=(url,headers=headers)
 response=(request)
 contents = ()
 soup = BeautifulSoup(contents, "")
 for tag1 in soup.find_all(attrs={"class":"main"}):

Note that cookies are updated every hour, we need to manually change the cookie values just to change them, that is, the cookie variable and user_agent variable in the above code. The complete code is shown below:

import urllib2
import re
from bs4 import BeautifulSoup
import codecs
import csv
 
c = open("","wb") #write Write
(codecs.BOM_UTF8)
writer = (c)
(["Name of short-term rental","Address.","Price.","Scoring.","Number of occupants","Price per capita"])
 
#Crawl for details
def getInfo(url,fname,fprice,fscore,users):
 # Through the browser developer mode to view access to the use of user_agent and cookie settings access to the headers (headers) to avoid anti-crawler, and every once in a while to run according to the developer in the cookie to change the code in the cookie
 user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
 cookie="mediav=%7B%22eid%22%3A%22387123%22eb7; mayi_uuid=1582009990674274976491; sid=42200298656434922.85.130.130"
 headers={"User-Agent":user_agent,"Cookie":cookie}
 request=(url,headers=headers)
 response=(request)
 contents = ()
 soup = BeautifulSoup(contents, "")
 #Short-Term Rentals Address
 for tag1 in soup.find_all(attrs={"class":"main"}):
 print u'Short term rental address:'
 for tag2 in tag1.find_all(attrs={"class":"desWord"}):
 address = ('p').get_text()
 print address
 # of occupants
 print u'Number of occupants:'
 for tag4 in tag1.find_all(attrs={"class":"w258"}):
 yy = ('span').get_text()
 print yy
 fname = ("utf-8")
 address = ("utf-8")
 fprice = ("utf-8")
 fscore = ("utf-8")
 fpeople = yy[2:3].encode("utf-8")
 ones = int(float(fprice))/int(float(fpeople))
 #Store locally
 ([fname,address,fprice,fscore,fpeople,ones])
 
#crawler functions
def gydzf(url):
 user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
 headers={"User-Agent":user_agent}
 request=(url,headers=headers)
 response=(request)
 contents = ()
 soup = BeautifulSoup(contents, "")
 for tag in soup.find_all('dd'):
 #Short-Term Rental Names
 for name in tag.find_all(attrs={"class":"room-detail clearfloat"}):
 fname = ('p').get_text()
 print u'[Name of short-term rental]'', ('\n','').strip()
 #Short-Term Rental Rates
 for price in tag.find_all(attrs={"class":"moy-b"}):
 string = ('p').get_text()
 fprice = ("[￥]+".decode("utf8"), "".decode("utf8"),string)
 fprice = fprice[0:5]
 print u'[Short-term rental prices]'', ('\n','').strip()
 # of ratings and comments
 for score in ('ul'):
 fscore = ('ul').get_text()
 print u'[Short term rental ratings/reviews/number of occupants]'', ('\n','').strip()
 #webpage link url
 url_dzf = (attrs={"target":"_blank"})
 urls = url_dzf.attrs['href']
 print u'[Web Link]', ('\n','').strip()
 urlss = '' + urls + ''
 print urlss
 getInfo(urlss,fname,fprice,fscore,user_agent)
 
#main function
if __name__ == '__main__':
 i = 0
 while i<33:
 print u'Page number', (i+1)
 if(i==0):
 url = '/guiyang/?map=no'
 if(i>0):
 num = i+2 # Except that the first page is empty and the second page starts incrementing in order of 2
 url = '/guiyang/' + str(num) + '/?map=no'
 gydzf(url)
 i=i+1
 
()

The output is as follows, storing the local CSV file:

In the meantime, you can try Selenium to crawl Ant Shortcuts, which should also be a viable method.

To this article on Python crawler set cookies to solve the site to intercept and crawl ants short-term rental of the article is introduced to this, more related Python crawler crawl ants short-term rental content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!