python selenium implement wisdom union recruitment data crawling

I. Main objectives

Recently, I've been playing with Python web crawlers, and then I came across the selenium module, so I've been trying to do something interesting, and I've been recording my own learning process.

II. Preliminary preparations

Operating system: windows 10
Browser: Google Chrome
Browser driver: (My version - >89.0.4389.128 )
The modules I use in the program

import csv
import os
import re
import json
import time

import requests
from  import Chrome
from  import WebElement
from  import By
from  import ui
from  import expected_conditions
from lxml import etree

chrome = Chrome(executable_path='chromedriver')

All third-party packages used can be installed with pip install.
The last line of the above code creates a browser object

III. Analysis of ideas

1. After a general look at the homepage, you need to log in before you can access information, so you can only simulate logging in first.

在这里插入图片描述

When you enter the login page is displayed QR code login, we do not use this, because it is really not very convenient, we simulate clicking the button on the page to enter the account, password login page input to login. Here is how to drive the browser to perform the above series of operations ⬇⬇⬇⬇⬇⬇

# Get the login page
(url)
# Find out where to log in with your account password
chrome.find_element_by_class_name('zppp-panel-qrcode-bar__triangle').click()
chrome.find_element_by_xpath('//div[@class="zppp-panel-normal__inner"]/ul/li[2]').click()
# Find the interface to interact with the account password and enter it
user_name = chrome.find_elements_by_xpath('//div[@class="zppp-input__container"]/input')[0]
pass_word = chrome.find_elements_by_xpath('//div[@class="zppp-input__container"]/input')[1]
# Perform the password entry for the account to be logged in
user_name.send_keys('**********')  
pass_word.send_keys('***********')
# Click Login when you're done typing
chrome.find_element_by_class_name('zppp-submit').click()

# Manually implement slider validation here
# Move your little mouse

2. After logging in and roughly looking at the home page decided to start climbing from the city first, in its original file to analyze its location, such as Figure ↓

在这里插入图片描述

I use requests to get the original web page, and then use the regular match to the content we need (that is, the red pile of ↑ in the picture above), and then a series of parsing to get the url of each city with its corresponding url ⬇⬇⬇⬇⬇⬇

resp = (url, headers=headers)
    if resp.status_code == 200:
        html = 
        json_data = (r'<script>__INITIAL_STATE__=(.*?)</script>', html).groups()[0]
        data = (json_data)
        cityMapList = data['cityList']['cityMapList']  # dict
        for letter, citys in ():
            # print(f'-----{letter}-------')
            for city in citys:  # citys is a list with a dictionary nested inside it
                '''
                {
                    'name': 'Anshan',
                    'url': '///anshan/',
                    'code': '601',
                    'pinyin': 'anshan'
               }
                '''
                city_name = city['name']
                city_url = 'https:' + city['url']

Here we get all the cities and it url, if they all have to be crawled if the amount of data is slightly large, so we can filter out the need to crawl the city to reduce the workload, anyway, crawling the city we want to change how to change hahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahaha.

3. Next we can do a job search, since we are using Python to crawl, then query Python related jobs.

在这里插入图片描述

First of all, we still need to find the box to enter the search information and find out its interface, and then input (here the input is Python), input is completed to find the right side of the search button (that is, the magnifying glass) for the click on the operation, the following is the simulation of the browser operation code to achieve ⬇⬇⬇⬇⬇

# Query the WebElement to find the location of the input based on class_name.
input_seek: WebElement = chrome.find_element_by_class_name('zp-search__input')
input_seek.send_keys('Python')  # Type Python
click: WebElement = 
    # Find the search button and click
    chrome.find_element_by_xpath('//div[@class="zp-search__common"]//a')  
()

chrome.switch_to.window(chrome.window_handles[1])

Herein lies the need for atake note ofThe place is now: after typing Python and clicking the search button a new window will pop up, while the program driving the browser is still in the first window, so you need to use swiitch_to_window(chrome.window_handles[n]) --<n denotes the position of the target window, and the first window at the very beginning is 0>. method for window switching.

4. Data parsing and extraction

在这里插入图片描述

It is clear to see that the information needed are under class="positionlist", further analysis can be seen that the data are under the a tag, the next step is to use Xpath for data extraction ⬇⬇⬇⬇⬇

root = (html)
    divs = ('//div[@class="positionlist"]')  # element object
    for div in divs:   
        # Posts # Inside corresponds to a list of individual
        position = ('.//a//div[@class="iteminfo__line1__jobname"]/span[1]')  
         # Company
        company = ('//a//div[@class="iteminfo__line1__compname"]/span/text()') 
        # Salary
        money = ('.//a//div[@class="iteminfo__line2__jobdesc"]/p/text()')  
         # Position
        city = ('//a//div[@class="iteminfo__line2__jobdesc"]/ul/li[1]/text()') 
        # Experience
        experience = 				              ('.//a//div[@class="iteminfo__line2__jobdesc"]/ul/li[2]/text()') 
        # Academic qualifications
        education =    ('.//a//div[@class="iteminfo__line2__jobdesc"]/ul/li[3]/text()')  
         # Scale
        scale = ('.//a//div[@class="iteminfo__line2__compdesc"]/span[1]/text()') 
         # of persons
        people = ('.//a//div[@class="iteminfo__line2__compdesc"]/span[2]/text()')

5. Get the next page

在这里插入图片描述

Find the next page button and simulate a browser click to get all the data for each page.

IV. Specific source code

import csv
import os
import re
import json
import time

import requests
from  import Chrome
from  import WebElement
from  import By
from  import ui
from  import expected_conditions
from lxml import etree

chrome = Chrome(executable_path='chromedriver')


# Simulate login
def login(url):
    # Get the login page
    (url)
    # Find out where to log in with your account password
    chrome.find_element_by_class_name('zppp-panel-qrcode-bar__triangle').click()
    chrome.find_element_by_xpath('//div[@class="zppp-panel-normal__inner"]/ul/li[2]').click()
    # Find the interface to interact with the account password and enter it
    user_name = chrome.find_elements_by_xpath('//div[@class="zppp-input__container"]/input')[0]
    pass_word = chrome.find_elements_by_xpath('//div[@class="zppp-input__container"]/input')[1]
    # Enter your account password for logging in to WisdomJobs here.
    user_name.send_keys('***********')
    pass_word.send_keys('**********')
    # Click Login when you're done typing
    chrome.find_element_by_class_name('zppp-submit').click()

    # Manually implement slider validation here
    # Sign in with the swipe of a finger
    (10)

    get_allcity('/citymap')
    # Perform all city information access in logged in state


def get_allcity(url):
    resp = (url, headers=headers)
    if resp.status_code == 200:
        html = 
        json_data = (r'<script>__INITIAL_STATE__=(.*?)</script>', html).groups()[0]
        data = (json_data)
        cityMapList = data['cityList']['cityMapList']  # dict
        for letter, citys in ():
            # print(f'-----{letter}-------')
            for city in citys:  # citys is a list with a dictionary nested inside it
                '''
                {
                    'name': 'Anshan',
                    'url': '///anshan/',
                    'code': '601',
                    'pinyin': 'anshan'
               }
                '''
                city_name = city['name']
                city_url = 'https:' + city['url']

                # Filter cities
                query_citys = ('Chengdu')
                if city_name in query_citys:
                    print(f'Acquiring{city_name}such information')
                    get_city_job(city_url)
                    (3)
                else:
                    # print(f'{city_name} is not in the search!)
                    pass
    else:
        print('Web page fetch failed')


def get_city_job(url):
    (url)  # Open city information
    # Query the WebElement to find the location of the input based on class_name.
    input_seek: WebElement = chrome.find_element_by_class_name('zp-search__input')
    input_seek.send_keys('Python')  # Type Python
    click: WebElement = chrome.find_element_by_xpath('//div[@class="zp-search__common"]//a')  # Find the search button and click
    ()

    # Switch to the second page
    chrome.switch_to.window(chrome.window_handles[1])

    (1)

    (1)
    # Wait for the class_name "sou-main__list" div element to appear.
    (chrome, 30).until(
        expected_conditions.visibility_of_all_elements_located((By.CLASS_NAME, 'sou-main__list')),
        'The element looked for never appeared'
    )

    # Determine if the current query result does not exist
    no_content = chrome.find_elements_by_class_name('positionlist')
    if not no_content:
        print('Python jobs not found in current city')
    else:
        # Extract search results
        parse(chrome.page_source)


def parse(html):
    root = (html)
    divs = ('//div[@class="positionlist"]')  # element object
    items = {}
    for div in divs:
        # Positions
        position = ('.//a//div[@class="iteminfo__line1__jobname"]/span[1]')  
         # Company
        company = ('//a//div[@class="iteminfo__line1__compname"]/span/text()') 
        # Salary
        money = ('.//a//div[@class="iteminfo__line2__jobdesc"]/p/text()')  
        # Position
        city = ('//a//div[@class="iteminfo__line2__jobdesc"]/ul/li[1]/text()') 
        # Experience
        experience =  ('.//a//div[@class="iteminfo__line2__jobdesc"]/ul/li[2]/text()')  
        # Academic qualifications
        education =  ('.//a//div[@class="iteminfo__line2__jobdesc"]/ul/li[3]/text()')  
         # Scale
        scale = ('.//a//div[@class="iteminfo__line2__compdesc"]/span[1]/text()') 
         # of persons
        people = ('.//a//div[@class="iteminfo__line2__compdesc"]/span[2]/text()') 
        for position_, company_, money_, city_, experience_, education_, scale_, people_ in zip(position, company,
                                                                                                money, city, experience,
                                                                                                education, scale,
                                                                                                people):
            # title="python crawler engineer" Get the value of its title attribute
            string = position_.('title')  
            items['position'] = string
            items['company'] = company_
            items['money'] = money_.strip()
            items['city'] = city_
            items['experience'] = experience_
            items['education'] = education_
            items['scale'] = scale_
            items['people'] = people_
            itempipeline(items)

    # Get the next page
    next_page()


def itempipeline(items):
    has_header = (save_csv)  # File header
    with open(save_csv, 'a', encoding='utf8') as file:
        writer = (file, fieldnames=())
        if not has_header:
            ()  # Write to file header
        (items)

def next_page():
    # Find the next page button
    (0.5)
    button = chrome.find_elements_by_xpath('//div[@class="soupager"]/button[@class="btn soupager__btn"]')
    if not button:
        print(f'Acquisition complete.，please select {save_csv} look sth up!!')
        exit()
    else:
        button[0].click()  # Click on the next page
        (1)
        parse(chrome.page_source)


if __name__ == '__main__':
    n = 0
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3823.400 QQBrowser/10.7.4307.400',
        'Cookie': 'aQQ_ajkguid=B4D4C2CC-2F46-D252-59D7-83356256A4DC; id58=e87rkGBclxRq9+GOJC4CAg==; _ga=GA1.2.2103255298.1616680725; 58tj_uuid=4b56b6bf-99a3-4dd5-83cf-4db8f2093fcd; wmda_uuid=0f89f6f294d0f974a4e7400c1095354c; wmda_new_uuid=1; wmda_visited_projects=%3B6289197098934; als=0; cmctid=102; ctid=15; sessid=E454865C-BA2D-040D-1158-5E1357DA84BA; twe=2; isp=true; _gid=GA1.2.1192525458.1617078804; new_uv=4; obtain_by=2; xxzl_cid=184e09dc30c74089a533faf230f39099; xzuid=7763438f-82bc-4565-9fe8-c7a4e036c3ee'
    }
    save_csv = ''
    login(
        '/login?bkUrl=%2F%%2Fblank%3Fhttps%3A%2F%%2Fbeijing%2F')

V. Partial presentation of results

在这里插入图片描述

VI. Summary

Personally, I think the anti-climbing of the Wisdom Link is still relatively friendly, why? Because before the test program when the simulation logged in dozens of times, are in a short period of time, and at first more worried about the IP was blocked but in the end did not have any problems. There is also selenium by the speed of the network is more influential, waiting time set too long, it will affect the speed of the program, but too short a time it will damage the data.

to this article on python selenium to achieve the wisdom of the recruitment data crawl to this article, more related selenium to achieve the wisdom of the recruitment crawl content please search my previous articles or continue to browse the following related articles I hope you will support me more in the future!