I. Main objectives
Recently, I've been playing with Python web crawlers, and then I came across the selenium module, so I've been trying to do something interesting, and I've been recording my own learning process.
II. Preliminary preparations
- Operating system: windows 10
- Browser: Google Chrome
- Browser driver: (My version - >89.0.4389.128 )
- The modules I use in the program
import csv import os import re import json import time import requests from import Chrome from import WebElement from import By from import ui from import expected_conditions from lxml import etree chrome = Chrome(executable_path='chromedriver')
- All third-party packages used can be installed with pip install.
- The last line of the above code creates a browser object
III. Analysis of ideas
1. After a general look at the homepage, you need to log in before you can access information, so you can only simulate logging in first.
When you enter the login page is displayed QR code login, we do not use this, because it is really not very convenient, we simulate clicking the button on the page to enter the account, password login page input to login. Here is how to drive the browser to perform the above series of operations ⬇⬇⬇⬇⬇⬇
# Get the login page (url) # Find out where to log in with your account password chrome.find_element_by_class_name('zppp-panel-qrcode-bar__triangle').click() chrome.find_element_by_xpath('//div[@class="zppp-panel-normal__inner"]/ul/li[2]').click() # Find the interface to interact with the account password and enter it user_name = chrome.find_elements_by_xpath('//div[@class="zppp-input__container"]/input')[0] pass_word = chrome.find_elements_by_xpath('//div[@class="zppp-input__container"]/input')[1] # Perform the password entry for the account to be logged in user_name.send_keys('**********') pass_word.send_keys('***********') # Click Login when you're done typing chrome.find_element_by_class_name('zppp-submit').click() # Manually implement slider validation here # Move your little mouse
2. After logging in and roughly looking at the home page decided to start climbing from the city first, in its original file to analyze its location, such as Figure ↓
I use requests to get the original web page, and then use the regular match to the content we need (that is, the red pile of ↑ in the picture above), and then a series of parsing to get the url of each city with its corresponding url ⬇⬇⬇⬇⬇⬇
resp = (url, headers=headers) if resp.status_code == 200: html = json_data = (r'<script>__INITIAL_STATE__=(.*?)</script>', html).groups()[0] data = (json_data) cityMapList = data['cityList']['cityMapList'] # dict for letter, citys in (): # print(f'-----{letter}-------') for city in citys: # citys is a list with a dictionary nested inside it ''' { 'name': 'Anshan', 'url': '///anshan/', 'code': '601', 'pinyin': 'anshan' } ''' city_name = city['name'] city_url = 'https:' + city['url']
Here we get all the cities and it url, if they all have to be crawled if the amount of data is slightly large, so we can filter out the need to crawl the city to reduce the workload, anyway, crawling the city we want to change how to change hahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahaha.
3. Next we can do a job search, since we are using Python to crawl, then query Python related jobs.
First of all, we still need to find the box to enter the search information and find out its interface, and then input (here the input is Python), input is completed to find the right side of the search button (that is, the magnifying glass) for the click on the operation, the following is the simulation of the browser operation code to achieve ⬇⬇⬇⬇⬇
# Query the WebElement to find the location of the input based on class_name. input_seek: WebElement = chrome.find_element_by_class_name('zp-search__input') input_seek.send_keys('Python') # Type Python click: WebElement = # Find the search button and click chrome.find_element_by_xpath('//div[@class="zp-search__common"]//a') () chrome.switch_to.window(chrome.window_handles[1])
Herein lies the need for atake note ofThe place is now: after typing Python and clicking the search button a new window will pop up, while the program driving the browser is still in the first window, so you need to use swiitch_to_window(chrome.window_handles[n]) --<n denotes the position of the target window, and the first window at the very beginning is 0>. method for window switching.
4. Data parsing and extraction
It is clear to see that the information needed are under class="positionlist", further analysis can be seen that the data are under the a tag, the next step is to use Xpath for data extraction ⬇⬇⬇⬇⬇
root = (html) divs = ('//div[@class="positionlist"]') # element object for div in divs: # Posts # Inside corresponds to a list of individual position = ('.//a//div[@class="iteminfo__line1__jobname"]/span[1]') # Company company = ('//a//div[@class="iteminfo__line1__compname"]/span/text()') # Salary money = ('.//a//div[@class="iteminfo__line2__jobdesc"]/p/text()') # Position city = ('//a//div[@class="iteminfo__line2__jobdesc"]/ul/li[1]/text()') # Experience experience = ('.//a//div[@class="iteminfo__line2__jobdesc"]/ul/li[2]/text()') # Academic qualifications education = ('.//a//div[@class="iteminfo__line2__jobdesc"]/ul/li[3]/text()') # Scale scale = ('.//a//div[@class="iteminfo__line2__compdesc"]/span[1]/text()') # of persons people = ('.//a//div[@class="iteminfo__line2__compdesc"]/span[2]/text()')
5. Get the next page
Find the next page button and simulate a browser click to get all the data for each page.
IV. Specific source code
import csv import os import re import json import time import requests from import Chrome from import WebElement from import By from import ui from import expected_conditions from lxml import etree chrome = Chrome(executable_path='chromedriver') # Simulate login def login(url): # Get the login page (url) # Find out where to log in with your account password chrome.find_element_by_class_name('zppp-panel-qrcode-bar__triangle').click() chrome.find_element_by_xpath('//div[@class="zppp-panel-normal__inner"]/ul/li[2]').click() # Find the interface to interact with the account password and enter it user_name = chrome.find_elements_by_xpath('//div[@class="zppp-input__container"]/input')[0] pass_word = chrome.find_elements_by_xpath('//div[@class="zppp-input__container"]/input')[1] # Enter your account password for logging in to WisdomJobs here. user_name.send_keys('***********') pass_word.send_keys('**********') # Click Login when you're done typing chrome.find_element_by_class_name('zppp-submit').click() # Manually implement slider validation here # Sign in with the swipe of a finger (10) get_allcity('/citymap') # Perform all city information access in logged in state def get_allcity(url): resp = (url, headers=headers) if resp.status_code == 200: html = json_data = (r'<script>__INITIAL_STATE__=(.*?)</script>', html).groups()[0] data = (json_data) cityMapList = data['cityList']['cityMapList'] # dict for letter, citys in (): # print(f'-----{letter}-------') for city in citys: # citys is a list with a dictionary nested inside it ''' { 'name': 'Anshan', 'url': '///anshan/', 'code': '601', 'pinyin': 'anshan' } ''' city_name = city['name'] city_url = 'https:' + city['url'] # Filter cities query_citys = ('Chengdu') if city_name in query_citys: print(f'Acquiring{city_name}such information') get_city_job(city_url) (3) else: # print(f'{city_name} is not in the search!) pass else: print('Web page fetch failed') def get_city_job(url): (url) # Open city information # Query the WebElement to find the location of the input based on class_name. input_seek: WebElement = chrome.find_element_by_class_name('zp-search__input') input_seek.send_keys('Python') # Type Python click: WebElement = chrome.find_element_by_xpath('//div[@class="zp-search__common"]//a') # Find the search button and click () # Switch to the second page chrome.switch_to.window(chrome.window_handles[1]) (1) (1) # Wait for the class_name "sou-main__list" div element to appear. (chrome, 30).until( expected_conditions.visibility_of_all_elements_located((By.CLASS_NAME, 'sou-main__list')), 'The element looked for never appeared' ) # Determine if the current query result does not exist no_content = chrome.find_elements_by_class_name('positionlist') if not no_content: print('Python jobs not found in current city') else: # Extract search results parse(chrome.page_source) def parse(html): root = (html) divs = ('//div[@class="positionlist"]') # element object items = {} for div in divs: # Positions position = ('.//a//div[@class="iteminfo__line1__jobname"]/span[1]') # Company company = ('//a//div[@class="iteminfo__line1__compname"]/span/text()') # Salary money = ('.//a//div[@class="iteminfo__line2__jobdesc"]/p/text()') # Position city = ('//a//div[@class="iteminfo__line2__jobdesc"]/ul/li[1]/text()') # Experience experience = ('.//a//div[@class="iteminfo__line2__jobdesc"]/ul/li[2]/text()') # Academic qualifications education = ('.//a//div[@class="iteminfo__line2__jobdesc"]/ul/li[3]/text()') # Scale scale = ('.//a//div[@class="iteminfo__line2__compdesc"]/span[1]/text()') # of persons people = ('.//a//div[@class="iteminfo__line2__compdesc"]/span[2]/text()') for position_, company_, money_, city_, experience_, education_, scale_, people_ in zip(position, company, money, city, experience, education, scale, people): # title="python crawler engineer" Get the value of its title attribute string = position_.('title') items['position'] = string items['company'] = company_ items['money'] = money_.strip() items['city'] = city_ items['experience'] = experience_ items['education'] = education_ items['scale'] = scale_ items['people'] = people_ itempipeline(items) # Get the next page next_page() def itempipeline(items): has_header = (save_csv) # File header with open(save_csv, 'a', encoding='utf8') as file: writer = (file, fieldnames=()) if not has_header: () # Write to file header (items) def next_page(): # Find the next page button (0.5) button = chrome.find_elements_by_xpath('//div[@class="soupager"]/button[@class="btn soupager__btn"]') if not button: print(f'Acquisition complete.,please select {save_csv} look sth up!!') exit() else: button[0].click() # Click on the next page (1) parse(chrome.page_source) if __name__ == '__main__': n = 0 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3823.400 QQBrowser/10.7.4307.400', 'Cookie': 'aQQ_ajkguid=B4D4C2CC-2F46-D252-59D7-83356256A4DC; id58=e87rkGBclxRq9+GOJC4CAg==; _ga=GA1.2.2103255298.1616680725; 58tj_uuid=4b56b6bf-99a3-4dd5-83cf-4db8f2093fcd; wmda_uuid=0f89f6f294d0f974a4e7400c1095354c; wmda_new_uuid=1; wmda_visited_projects=%3B6289197098934; als=0; cmctid=102; ctid=15; sessid=E454865C-BA2D-040D-1158-5E1357DA84BA; twe=2; isp=true; _gid=GA1.2.1192525458.1617078804; new_uv=4; obtain_by=2; xxzl_cid=184e09dc30c74089a533faf230f39099; xzuid=7763438f-82bc-4565-9fe8-c7a4e036c3ee' } save_csv = '' login( '/login?bkUrl=%2F%%2Fblank%3Fhttps%3A%2F%%2Fbeijing%2F')
V. Partial presentation of results
VI. Summary
Personally, I think the anti-climbing of the Wisdom Link is still relatively friendly, why? Because before the test program when the simulation logged in dozens of times, are in a short period of time, and at first more worried about the IP was blocked but in the end did not have any problems. There is also selenium by the speed of the network is more influential, waiting time set too long, it will affect the speed of the program, but too short a time it will damage the data.
to this article on python selenium to achieve the wisdom of the recruitment data crawl to this article, more related selenium to achieve the wisdom of the recruitment crawl content please search my previous articles or continue to browse the following related articles I hope you will support me more in the future!