SoFunction
Updated on 2024-11-12

python crawler selenium module

I. What is Selenium

selenium is a complete web application testing system , including test recording (selenium IDE), writing and running (Selenium Remote Control) and test parallel processing (Selenium Grid). selenium core Selenium Core is based on JsUnit . The core of Selenium, Selenium Core, is based on JsUnit and is written entirely in JavaScript, so it can be used in any browser that supports JavaScript.

selenium can simulate a real browser , automated testing tools , support for multiple browsers , crawlers are mainly used to solve the JavaScript rendering problem .

Second, selenium installation

When writing a crawler in python, we mainly use selenium's Webdriver, and we can first see which browsers are supported by the following

from selenium import webdrive
help(webdriver)

The result is as follows, from the result we can also see that the basic mountain supports all common browsers:

NAME

PACKAGE CONTENTS:

  • android (package)
  • blackberry (package)
  • chrome (package)
  • common (package)
  • edge (package)
  • firefox (package)
  • ie (package)
  • opera (package)
  • phantomjs (package)
  • remote (package)
  • safari (package)
  • support (package)
  • webkitgtk (package)

VERSION: 3.14.1

1. PhantomJS: Browser without Visualization Interface (Headless Browser)

PhantomJS is a WebKit-based server-side JavaScript API, support for the Web without browser support , its fast , native support for a variety of Web standards : Dom processing , CSS selector , JSON and so on.

PhantomJS can be used for page automation, web monitoring, web screenshots, and interface-less testing.

2、Download the browser driver

When selenium was upgraded to 3.0, different browser drivers were standardized. If you want to use selenium to drive different browsers, you must download and set up different browser drivers separately.

Download address for each browser:

  • Firefox browser driver:geckodriver
  • Chrome Driver:chromedriver , taobao alternate address
  • Internet Explorer Driver:IEDriverServer
  • Edge browser driver:MicrosoftWebDriver

classifier for sums of money: The webdriver needs to be compatible with the corresponding browser version and selenium version.

View the mapping of drivers to browser versions:

According to the chrome version of the corresponding driver, directly according to the browser version to find the corresponding driver (just correspond to the big version)

After downloading, unzip it to any directory (the path should not have Chinese).

Browser driver file, (.exe file obtained after downloading and unzipping in Win environment) needs to be put into the same level directory before you can use it. Or configure your computer's environment variables

You can manually create a directory for storing browser drivers, e.g. C:\driver , and drop the downloaded browser driver files (e.g. chromedriver, geckodriver) into that directory.

My Computer-->Properties-->System Settings-->Advanced-->Environment Variables-->System Variables-->Path, add the "C:\driver" directory to the value of Path.

Third, the basic use of selenium

1. Declare the browser object

Above we know that selenium supports many browsers, but if you want to declare and invoke a browser you need to:

Only two examples are written here, but of course all other supported browsers can be called in this way.

from selenium import webdriver

browser = ()
# browser = ()

Headless startup

Headless Chrome is an interface-less form of Chrome that allows you to run your programs with all the features Chrome supports without opening the browser. Compared to modern browsers, Headless Chrome is much easier to test web applications, get screenshots of websites, do crawling for information, etc. Headless Chrome is much closer to the browser environment than the older PhantomJS, SlimerJS, and others.

Headless Chrome Requirements for Chrome: The official documentation describes that mac and linux environments require chrome version 59+, while the windows version of chrome requires 60+, while chromedriver requires version 2.30+.

from selenium import webdriver
from  import By
from  import WebDriverWait
from  import expected_conditions as EC
from .action_chains import ActionChains
from  import Keys

chrome_options = ()
# Use headless browser mode
chrome_options.add_argument('--headless') // Add no interface option
chrome_options.add_argument('--disable-gpu') // If you don't add this option, sometimes there are problems with positioning
# Launch a browser and get the source code of the page
browser = (chrome_options=chrome_options)
mainUrl = "/"
(mainUrl)
print(f"browser text = {browser.page_source}")
()

2. Visit the page

After the following code runs, it will automatically open the Chrome browser and log in to Baidu to print the source code of the Baidu home page, and then close the browser

from selenium import webdriver

browser = ()

("")
print(browser.page_source)
()

3、Find the element

1、Single element search

(<a href="/" rel="external nofollow"  rel="external nofollow"   target="_blank"></a>)
input_first = browser.find_element_by_id("q")
input_second = browser.find_element_by_css_selector("#q")
input_third = browser.find_element_by_xpath('//*[@]')
print(input_first)
print(input_second)
print(input_third)
()

Here we get the responsive element in three different ways, the first by means of the id, the second in the CSS selector, and the third in the xpath selector, all with the same result.
The results are as follows:

Commonly used methods for finding elements:

  • find_element_by_name:Positioning by element name
  • find_element_by_id:Positioning by element id
  • find_element_by_xpath:Locate by xpath expression
  • find_element_by_link_text:Positioning by complete hyperlinked text
  • find_element_by_partial_link_text:Positioning by partial link text
  • find_element_by_tag_name:Positioning by tag
  • find_element_by_class_name:Localization by class name
  • find_element_by_css_selector:Positioning via css selector

Examples:

xpath positioning, xpath positioning there are N ways to write, here are a few common ways to write:.

dr.find_element_by_xpath("//*[@id='kw']")
dr.find_element_by_xpath("//*[@name='wd']")
dr.find_element_by_xpath("//input[@class='s_ipt']")
dr.find_element_by_xpath("/html/body/form/span/input")
dr.find_element_by_xpath("//span[@class='soutu-btn']/input")
dr.find_element_by_xpath("//form[@id='form']/span/input")
dr.find_element_by_xpath("//input[@id='kw' and @name='wd']")

Through the css positioning, css positioning has N kinds of writing, here are a few commonly used ways to write:.

dr.find_element_by_css_selector("#kw")
dr.find_element_by_css_selector("[name=wd]")
dr.find_element_by_css_selector(".s_ipt")
dr.find_element_by_css_selector("html > body > form > span > input")
dr.find_element_by_css_selector("-btn> input#kw")
dr.find_element_by_css_selector("form#form > span > input")

How to use xpath

1. First method: positioning by absolute path (I believe that you will not use this method)

(“html/body/div/form/input”)

2. The second method: positioning through the relative path, two slashes represent relative paths

(“//input//div”)

3. Third method: positioning by element indexing

(“//input[4]”)

4. Fourth method: positioning using xpath + node attributes (can be used in combination with methods 2 and 3)

(“//input[@id='kw1']”)
(“//input[@type='name' and @name='kw1']”)

5. Fifth method: matching using partial attribute values (the most powerful method)

(“//input[start-with(@id,'nice')]”)
(“//input[ends-with(@id,'beautiful')]")
(“//input[contains(@id,'So beautiful')]”)

6. Sixth method: use of a combination of pre-centralized methods

(“//input[@id='kw1']//input[start-with(@id,'nice']/div[1]/form[3])

The following approach is a more general one: here you need to remember the By module so it needs to be imported.

from selenium import webdriver

from  import By

browser = ()

("")
input_first = browser.find_element(,"q")
print(input_first)
()

Of course this approach is generic to the one described above, browser.find_element(, "q") where the ID can be replaced with several others

2、Multiple elements to find

Multiple elements find_elements, a single element is find_element, other than the use of no difference, through one of the example demonstration:

(<a href="/" rel="external nofollow"  rel="external nofollow"   target="_blank"></a>)
lis = browser.find_elements_by_css_selector('.service-bd li')
print(lis)
()

The result is a list of

Of course, the above can also be accomplished by importing from import By.

lis = browser.find_elements(By.CSS_SELECTOR,'.service-bd li')

The same method of finding in a single element exists for multiple element finding:

  • find_elements_by_name
  • find_elements_by_id
  • find_elements_by_xpath
  • find_elements_by_link_text
  • find_elements_by_partial_link_text
  • find_elements_by_tag_name
  • find_elements_by_class_name
  • find_elements_by_css_selector

4、Element Interaction Operation

Generally speaking, the following are some of the more commonly used methods for manipulating objects in the webdriver:

  • click--Click on the object
  • send_keys-- Simulate keystrokes on an object
  • clear--clears the contents of the object, if possible
  • submit--Contents of the submitted object, if possible
  • text-- Used to get text information about an element

1、Keyboard events

The keys package is required to invoke keyboard keystrokes:

from import Keys

Call keystrokes via send_keys():

send_keys() # TAB
send_keys() # Enter
send_keys(,'a') # ctrl+a Selects the contents of the input box in full
send_keys(,'x') # ctrl+x cuts the content of the input box
import time

()
input_str = browser.find_element_by_id('q')
input_str.send_keys("ipad")
(1)
input_str.clear()
input_str.send_keys("MakBook pro")
button = browser.find_element_by_class_name('btn-search')
()

The result of running can be seen that the program will automatically open Chrome and open Taobao to enter ipad,then delete it and re-enter MakBook pro and click search.

2、Mouse event

Mouse events generally include right mouse button, double click, drag, move the mouse to an element and so on. The ActionChains class needs to be introduced. Introduce methods:

from .action_chains import ActionChains

ActionChains common methods:

  • context_click() : Right click;
  • double_click() : Double tap;
  • drag_and_drop(): Drag;
  • move_to_element() : Mouse hover.
  • perform(): Executes all behaviors stored in ActionChains;

Double mouse click on the example:

qqq =driver.find_element_by_xpath("xxx")  # Locate the element to be double-clicked
ActionChains(driver).double_click(qqq).perform() #Performs a double-click on a localized element.

Mouse drag and drop example:

from  import ActionChains

url = "/try/?filename=jqueryui-api-droppable"
(url)
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
target = browser.find_element_by_css_selector('#droppable')
actions = ActionChains(browser)
actions.drag_and_drop(source, target)
()

For more operational references:

/#.action_chains

5. Execute JavaScript

This is a very useful method, here you can directly call the js method to achieve some of the operations, the following example is by logging in to know and then through the js flip to the bottom of the page, and the pop-up box to prompt the

(<a href="/explore" rel="external nofollow"   target="_blank">/explore</a>)
browser.execute_script('(0, )')
browser.execute_script('alert("To Bottom")')

6. Get DOM

1, get element attributes: get_attribute('class')

url = '/explore'
(url)
logo = browser.find_element_by_id('zh-top-link-logo')
print(logo)
print(logo.get_attribute('class'))

2, get the text value: text

url = '/explore'
(url)
input = browser.find_element_by_class_name('zu-top-add-question')
print()

3、Get ID, location, label name

  • id
  • location
  • tag_name
  • size
url = '/explore'
(url)
input = browser.find_element_by_class_name('zu-top-add-question')
print()
print()
print(input.tag_name)
print()

7、Frame

In many web pages are Frame tags, so we crawl the data involved in cutting into the frame as well as cut out of the problem, through the following example demonstration

Commonly used here are switch_to.from() and switch_to.parent_frame()

import time
from selenium import webdriver
from  import NoSuchElementException

browser = ()
url = '/try/?filename=jqueryui-api-droppable'
(url)
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
print(source)

try:
    logo = browser.find_element_by_class_name('logo')
except NoSuchElementException:
    print('NO LOGO')
browser.switch_to.parent_frame()
logo = browser.find_element_by_class_name('logo')
print(logo)
print()

8. Waiting

1. Implicit waiting

When an implicit wait is used to execute a test, if the WebDriver does not find an element in the DOM, it will continue to wait for a set amount of time before throwing a not found exception.

In other words, when looking for an element or when an element does not appear immediately, the implicit wait will wait for a certain period of time before looking up the DOM, the default time is 0

to a certain time found that the elements have not been loaded, then continue to wait for the time we specify, if more than the time we specify has not been loaded will throw an exception, if there is no need to wait for the time it has been loaded will be immediately executed

browser.implicitly_wait(10)
('/explore')
input = browser.find_element_by_class_name('zu-top-add-question')
print(input)

2. Display Waiting

Specify a wait condition and a maximum wait time within which to determine whether the wait condition is satisfied.

If it is valid, it will return immediately, if not, it will wait until it waits for the maximum waiting time you specified, if it is still not satisfied, it will throw an exception, if it is satisfied, it will return normally.

from selenium import webdriver
from  import By
from  import WebDriverWait
from  import expected_conditions as EC

browser = ()
('/')
wait = WebDriverWait(browser, 10)
input = (EC.presence_of_element_located((, 'q')))
button = (EC.element_to_be_clickable((By.CSS_SELECTOR, '.btn-search')))
print(input, button)

The condition in the above example: EC.presence_of_element_located() is confirming whether the element has appeared. EC.element_to_be_clickable() is confirming whether the element is clickable or not

3. Commonly used judgment conditions:

  • title_is : The title is a piece of content
  • title_contains : The title contains a certain content
  • presence_of_element_located : The element is loaded out, passing in the positioning tuple, e.g. (, 'p')
  • visibility_of_element_located : the element is visible, pass in the positioning tuple
  • visibility_of : As can be seen, the incoming element object
  • presence_of_all_elements_located : All elements loaded out
  • text_to_be_present_in_element : an element text contains a certain text
  • text_to_be_present_in_element_value : An element value contains a text
  • frame_to_be_available_and_switch_to_it : frame loading and switching
  • invisibility_of_element_located : element is not visible
  • element_to_be_clickable : Element clickable
  • staleness_of : Determine whether an element is still in the DOM, and whether the page has been refreshed.
  • element_to_be_selected : element selectable, pass element object
  • element_located_to_be_selected : element selectable, pass in positioning tuple
  • element_selection_state_to_be : Pass in the element object and state, return True for equal, otherwise return False.
  • element_located_selection_state_to_be : Pass in the location tuple and the status, return True for equal, False otherwise.
  • alert_is_present : Whether Alert appears

Example: blogspot title judgment

# coding: utf-8
from selenium import webdriver
from  import expected_conditions as EC
driver = ()
("/101718qiong/")

title = EC.title_is(u"Silence&QH - Blogland") # Determine that the title is exactly equal to
print title(driver)
 
title1 = EC.title_contains("Silence&QH") # Determine if the title contains
print title1(driver)

r1 = EC.title_is(u"Silence&QH - Blogland")(driver) # Alternatively written
r2 = EC.title_contains("Silence&QH")(driver)
print r1
print r2

For more operational references:

/#.expected_conditions

9. Browser Browser Operation

Browser maximization, minimization

browser.maximize_window() # Maximize the browser
browser.minimize_window() # Minimize browser display

Browser Settings Window Size

browser.set_window_size(480, 800) # Setting the Browser Width480、your (honorific)800demonstrate

Browser forward and backward

  • back(): back
  • forward().
import time
from selenium import webdriver

browser = ()
('/')
('/')
('/')
()
(1)
()
()

10、cookie operation

  • get_cookies()
  • delete_all_cookes()
  • add_cookie()
('/explore')
print(browser.get_cookies())
browser.add_cookie({'name': 'name', 'domain': '', 'value': 'zhaofan'})
print(browser.get_cookies())
browser.delete_all_cookies()
print(browser.get_cookies())

11、Multi-window management

New tab() is implemented by executing the js command, different tabs are present in the list browser.window_handles.

The first tab can be manipulated by browser.window_handles[0]. current_window_handle: get the current window handle.

import time

('')
browser.execute_script('()')
print(browser.window_handles)
browser.switch_to.window(browser.window_handles[1])
('')
(1)
browser.switch_to.window(browser.window_handles[0])
('')

12. Exception handling

The exception here is a bit more complicated, the reference address on the official website:

/#

Here's just a simple demonstration of finding a non-existent element

from selenium import webdriver
from  import TimeoutException, NoSuchElementException

browser = ()
try:
    ('')
except TimeoutException:
    print('Time Out')
try:
    browser.find_element_by_id('hello')
except NoSuchElementException:
    print('No Element')
finally:
    ()

13. Warning box handling

Handling JavaScript-generated alerts, confirms, and prompts in WebDriver is very simple, using the switch_to.alert method to locate alerts/confirms/prompts, and then using the text/accept/dismiss/send_keys methods to manipulate them. keys, and then use the text/accept/dismiss/send_keys methods.

methodologies

  • text : Returns the text in alert/confirm/prompt.
  • accept(): accepts an existing warning box
  • dismiss() : dismisses the existing warning box
  • send_keys(keysToSend) : sends text to the warning box. keysToSend: sends text to the warning box.

Demonstration

from selenium import webdriver
from .action_chains import ActionChains
import time

driver = ("F:\Chrome\ChromeDriver\chromedriver")
driver.implicitly_wait(10)
('')

# Hover over the "Settings" link.
link = driver.find_element_by_link_text('Settings')
ActionChains(driver).move_to_element(link).perform()

# Open search settings
driver.find_element_by_link_text("Search Settings").click()

# Set a wait time of 2s here or you may get an error.
(2)
 # Save settings
driver.find_element_by_class_name("prefpanelgo").click()
 (2)

# Accept the warning box
driver.switch_to.()

()

14, drop-down box selection operation

Import the Select drop-down box Select class and use this class to handle drop-down box operations.

from  import Select

Methods of the Select class:

  • select_by_value("select_value") value of the value attribute of the select tag
  • select_by_index("index_value") Index of the dropdown box
  • select_by_visible_testx("text_value") Text value of the dropdown box

Sometimes we will encounter drop-down boxes , WebDriver provides a Select class to deal with drop-down boxes . Such as Baidu search settings of the drop-down box.

from selenium import webdriver
 from  import Select
 from time import sleep

driver = ("F:\Chrome\ChromeDriver\chromedriver")
 driver.implicitly_wait(10)
 ('')

#1. Hover over the "Settings" link.
driver.find_element_by_link_text('Settings').click()
 sleep(1)
 #2. Open the search settings
driver.find_element_by_link_text("Search Settings").click()
 sleep(2)

#3. Number of search results displayed
sel = driver.find_element_by_xpath("//select[@id='nr']")
 Select(sel).select_by_value('50')  # 50 displayed

sleep(3)
 ()

15. File upload

For the upload function realized through the input tag, it can be regarded as an input box, i.e., the file upload can be realized by specifying the path of the local file through send_keys().

File uploads via the send_keys() method.

from selenium import webdriver
import os

driver = ()
file_path = 'file:///' + ('')
(file_path)

# Locate the upload button to add a local file
driver.find_element_by_name("file").send_keys('D:\\upload_file.txt')

()

16、Window Screenshot

Automation use cases are executed by the program, so sometimes the error messages printed are not very clear. If you can take a screenshot of the current window when a script execution error occurs, then you can visualize the cause of the error through the picture.WebDriver provides a screenshot function get_screenshot_as_file() to intercept the current window.

Screenshot Method:

get_screenshot_as_file(self, filename) to capture the current window and save the image locally

from selenium import webdriver
from time import sleep

driver =(executable_path ="F:\GeckoDriver\geckodriver")
('')

driver.find_element_by_id('kw').send_keys('selenium')
 driver.find_element_by_id('su').click()
 sleep(2)

#1. Intercept the current window, and specify where to save the screenshot image
driver.get_screenshot_as_file("D:\\baidu_img.jpg")

()

17. Close the browser

In the previous examples we have been using the quit() method, which means to quit the associated driver and close all windows. In addition, WebDriver also provides the close() method to close the current window. For example, in the case of multi-window handling, in the process of executing the use case to open a number of windows, we want to close one of the windows, then we need to use the close() method to close.

  • close() closes a single window
  • quit() closes all windows

18. selenium avoids being detected and recognized.

Now a lot of big websites have adopted a monitoring mechanism for selenium. For example, under normal circumstances, we use the browser to visit Taobao and other sites the value of the
The value is undefined, whereas it is true when accessed using selenium, so how do I fix this?

Simply set the Chromedriver startup parameters to solve the problem. Before starting Chromedriver, turn on the experimental features parameter for ChromeexcludeSwitches, which has a value of['enable-automation'], the full code is below:

from  import Chrome
from  import ChromeOptions

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)

19, Example:

Automatic login to CSDN

import time
import numpy as np
from numpy import random
from selenium import webdriver
from  import ActionChains
from  import By
from  import WebDriverWait
from  import expected_conditions as EC


def ease_out_expo(x):
    if x == 1:
        return 1
    else:
        return 1 - pow(2, -10 * x)


def get_tracks(distance, seconds, ease_func):
    tracks = [0]
    offsets = [0]
    for t in (0.0, seconds+0.1, 0.1):
        ease = globals()[ease_func]
        offset = round(ease(t / seconds) * distance)
        (offset - offsets[-1])
        (offset)
    return offsets, tracks


def drag_and_drop(browser, offset):
    # Position the slider element
    WebDriverWait(browser, 20).until(
        EC.visibility_of_element_located((, "//*[@class='nc_iconfont btn_slide']"))
    )
    knob = browser.find_element_by_xpath("//*[@class='nc_iconfont btn_slide']")
    offsets, tracks = get_tracks(offset, 0.2, 'ease_out_expo')
    ActionChains(browser).click_and_hold(knob).perform()
    for x in tracks:
        ActionChains(browser).move_by_offset(x, 0).perform()
    # Let go
    ActionChains(browser).pause((6, 14) / 10).release().perform()


chrome_options = ()
chrome_options.add_argument("--start-maximized")
browser = (chrome_options=chrome_options)

('')
browser.find_element_by_id('kw').send_keys('CSDN')
browser.find_element_by_id('su').click()
WebDriverWait(browser, 20).until(
    EC.visibility_of_element_located((By.PARTIAL_LINK_TEXT, "-Professional IT technology community."))
)

browser.find_element_by_partial_link_text('-Professional IT Technology Community').click()

browser.switch_to.window(browser.window_handles[1])  # Move handles
(1)
browser.find_element_by_partial_link_text('Login').click()
browser.find_element_by_link_text('Account Password Login').click()
browser.find_element_by_id('all').send_keys('yangbobin')
browser.find_element_by_name('pwd').send_keys('pass-word')
browser.find_element_by_css_selector("button[data-type='account']").click()
(5)  # Wait for the slider module and other JS files to finish loading!
while True:
    # Define mouse drag-and-drop actions
    drag_and_drop(browser, 261)
    # Wait for the JS authentication to run, if you do not wait easy to report errors
    (2)
    # Check if the authentication was successful, get the text value
    WebDriverWait(browser, 20).until(
        EC.visibility_of_element_located((By.LINK_TEXT, "Refresh."))
    )
    browser.find_element_by_link_text('Refresh').click()

Automatic login to 163 mailbox

First, it's the 163. landing as an iframe

browser.switch_to_frame('x-URS-iframe')

This is the end of this article about python crawler of selenium module. I hope it will be helpful for your learning and I hope you will support me more.