SoFunction
Updated on 2024-11-19

Python-Selenium Automated Crawler

Brief introduction:

Seleniumis a Web automation testing tool, initially developed for website automation testing, Selenium can run directly on the browser, it supports all major browsers (including PhantomJS these interface-less browsers (in 2018 the developer said to suspend the development of chromedriver can achieve the same functionality)), can receive commands that Let the browser automatically load the page, get the required data, and even page screenshots.

1. Installation

pip install selenium -i /simple

2. Download the browser driver

Google Chrome used here

/mirrors/chromedriver/

Check your browser version to download the corresponding driver.

Put the unzipped driver in its own Catalog.

3. Examples

3.1 Download the corresponding version of the browser driver

/mirrors/chromedriver/

Put the unzipped driver in your own directory.

3.2 Test code, open a web page and get the title of the page

from  import Chrome


if __name__ == '__main__':
    web = Chrome()
    ("")
    print()

3.3 A small sample

from  import Chrome


if __name__ == '__main__':
    web = Chrome()
    url = '/acm/home'
    (url)
    # Get the a tag to click on
    el = web.find_element_by_xpath('/html/body/div/div[3]/div[1]/div[1]/div[1]/div/a')
    # Click
    ()                          # "/html/body/div/div[3]/div[1]/div[2]/div[2]/div[2]/div[1]/h4/a"
    # Crawl the desired content
    lists = web.find_elements_by_xpath("/html/body/div/div[3]/div[1]/div[2]/div[@class='platform-item js-item ']/div["
                                       "2]/div[1]/h4/a")
    print(len(lists))
    for i in lists:
        print()

3.4 Automatic input and jump

from  import Chrome
from  import Keys
import time

if __name__ == '__main__':
    web = Chrome()
    url = '/acm/home'
    (url)

    el = web.find_element_by_xpath('/html/body/div/div[3]/div[1]/div[1]/div[1]/div/a')

    ()
    (1)
    input_el = web.find_element_by_xpath('/html/body/div/div[3]/div[1]/div[1]/div[1]/form/input[1]')
    input_el.send_keys('Cattlemen', )
    #  do something

4. Turn on headless mode

Whether headless mode is enabled (i.e., whether an interface is required)

from  import Chrome
from  import Options

option = Options()  # Instantiate the option object
option.add_argument("--headless")  # Add headerless parameters to the option object

if __name__ == '__main__':
    web = Chrome(executable_path='D:\PyProject\spider\venv\Scripts\',options=option) # Specify the driver location, otherwise look in the python interpreter directory.
    ("")
    print()

5. Save a screenshot of the page

from  import Chrome
from  import Options

option = Options()  # Instantiate the option object
option.add_argument("--headless")  # Add headerless parameters to the option object

if __name__ == '__main__':
    web = Chrome()
    web.maximize_window()  # Maximize browser window
    ("")
    print()
    web.save_screenshot('')  # Save a screenshot of the current web page Save to current folder
    ()  # Close current page

6. Analog inputs and clicks

from  import Chrome
from  import Options

option = Options()  # Instantiate the option object
option.add_argument("--headless")  # Add headerless parameters to the option object

if __name__ == '__main__':
    web = Chrome()
    web.maximize_window()  # Maximize browser window
    ("")
    el = web.find_element_by_id('kw')
    el.send_keys('Harris-H')
    btn = web.find_element_by_id('su')
    ()
    # ()  # Close current page

It seems that Baidu can now recognizeselenium, also requires image verification.

6.1 Finding nodes based on text values

# Find the node with the text value Baidu.com
driver.find_element_by_link_text("Baidu.") 
# Get a list of elements based on the text contained in the link, fuzzy match
driver.find_elements_by_partial_link_text("Degree.") 

6.2 Getting the text of the current node

 # Get the text of the current node
ele.get_attribute("data-click")  # Gets the attribute corresponding to thevalue

6.3 Printing some information about the current page

print(driver.page_source)  # Print the source code of a web page
print(driver.get_cookies())  # Print out cookies for web pages
print(driver.current_url)  # Print out the current page'surl

6.4 Close Browser() # close the current page

()  # Close the current page
()  # Directly close the browser

6.5 Mouse scrolling simulation

from  import Chrome
import time

if __name__ == '__main__':

    driver = Chrome()

    (
        "/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=78000241_12_hao_pg&wd=selenium%20js%E6%BB%91%E5%8A%A8&fenlei=256&rsv_pq=8215ec3a00127601&rsv_t=a763fm%2F7SHtPeSVYKeWnxKwKBisdp%2FBe8pVsIapxTsrlUnas7%2F7Hoo6FnDp6WsslfyiRc3iKxP2s&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=31&rsv_sug1=17&rsv_sug7=100&rsv_sug2=0&rsv_btype=i&inputT=9266&rsv_sug4=9770")
    # 1. Scroll to the bottom of the page
    js = "=1000"
    # Execute js
    driver.execute_script(js)
    (2)
    # Scroll to the top
    js = "=0"
    driver.execute_script(js)  # Execute js

    (2)
    ()

options = ()
options.add_argument("--proxy-server=http://110.52.235.176:9999") # Add Agent
options.add_argument("--headless") # Headless mode
options.add_argument("--lang=en-US") # Web pages in English
prefs = {"profile.managed_default_content_settings.images": 2, '': 2} # Disable rendering
options.add_experimental_option("prefs", prefs)
driver = (executable_path="D:\ProgramApp\chromedriver\",chrome_options=options)
 
("/ip")
 

8. Verify slider movement

Goal: Sliding CAPTCHA

  • 1. Positioning buttons
  • 2. Press and hold the slider
  • 3. Slide button
import time
from selenium import webdriver

if __name__ == '__main__':
    chrome_obj = ()
    chrome_obj.get('/demo/2017/unlock/')

    # 1. Position the slide button
    click_obj = chrome_obj.find_element_by_xpath('//div[@class="bar1 bar"]/div[@class="slide-to-unlock-handle"]')

    # 2. Press and hold
    # Create an action chain object, the argument is the browser object
    action_obj = (chrome_obj)

    # Click and hold, the parameter is the button for positioning
    action_obj.click_and_hold(click_obj)

    # Get its width and height
    size_ = click_obj.size
    width_ = 298 - size_['width']  # The width of the frame minus the width of the slider is the distance to the x-axis (to the right).
    print(width_)
    # 3. Position the slide coordinates
    action_obj.move_by_offset(298-width_, 0).perform()

    # 4. Release the slide
    action_obj.release()

    (6)
    chrome_obj.quit()

9. Open multiple windows and page switching

Sometimes there are many sub-tabs in a window. Selenium provides a switch_to_window to do this, and you can find out which page you want to switch to from thedriver.window_handleshit the nail on the head

from selenium import webdriver

if __name__ == '__main__':
    driver = ()

    ("/")
    driver.implicitly_wait(2)
    driver.execute_script("('/')")
    driver.switch_to.window(driver.window_handles[1])

    print(driver.page_source)
 

manipulate

# 1. Get all cookies:
for cookie in driver.get_cookies():
    print(cookie)
# 2. Get the value based on the key of the cookie:
value = driver.get_cookie(key)
# 3. Delete all cookies:
driver.delete_all_cookies()
# 4. Delete a cookie:
driver.delete_cookie(key)
# Add cookie:
driver.add_cookie({"name":"password","value":"111111"})
 

11. Simulate login

Here is an analog login to our school's Registrar's Office:

from  import Chrome

if __name__ == '__main__':
    web = Chrome()
    ('/')
    username = web.find_element_by_id('userAccount')
    username.send_keys('xxxxxxx') # Here's your student number
    password = web.find_element_by_id('userPassword')
    password.send_keys('xxxxxxx') # Here's your password
    btn = web.find_element_by_xpath('//*[@]/li[4]/button')
    ()
    # do something

 

Since there's no slider or anything to validate, it's pretty simple qwq. then just do your own manipulation later.

12. Advantages and disadvantages

selenium can execute js on the page, for js rendered data and simulation of login processing is very easy.
selenium is very inefficient due to the number of requests it sends in the process of fetching a page, so it needs to be used with discretion in many cases.

To this article on Python-Selenium automated crawler is introduced to this article, more related Selenium automated crawler content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future!