Python selenium crawl weibo public number article code details

Reference: selenium webdriver add cookie.https:///article/

Demand:

I want to read the WeChat public number history articles, but it is inconvenient to find the place to read each time.

Thoughts:

1、Use selenium to open WeChat public number history articles, and scroll to refresh to the bottom, get all the history articles urls.

2. Traverse access to urls and download them locally.

realization

1, open the WeChat client, click on a WeChat public number - & gt; enter the public number - & gt; open the history of the article link (use the browser to open), and through the developer tools to obtain cookies, save as excel.

2. Start webdriver and add the appropriate cookies.

browser = ()
wait = WebDriverWait(browser,10)
# Visiting a random address before cookies can be set
('/get')
# add cookies, df for saved excel cookies
for i in range(len(df)):
  cookie_dict = {
          "domain": [i,'DomaiN'], 
          'name': [i,'Name'],
          'value': str([i,'Value']),
          "expires": [i,"Expires/Max-Age"],
          'path': '/',}
  browser.add_cookie(cookie_dict)
(weixin_url)

3、Move under the control browser

Looking at page_source, you can see that the article to the very bottom is judged.

<div class="loadmore with_line" style="display: none;" >
    <div class="tips_wrp">
      <span class="tips js_no_more_msg" style="display: none;">No more</span>
      <span class="tips js_need_add_contact" style="display: none;">Follow the public account，Receive more messages</span>
    </div>
  </div>

Use driver to control JS.

%%time
# Determine whether to go to the bottom by determining that there are no more styles, and finally execute to the bottom.
no_more_msg_style = 'display: none;'
while True:
  (EC.presence_of_element_located((,'//span[@class="tips js_no_more_msg" and text()="no longer available"]'))))
  no_more= browser.find_element_by_xpath('//span[@class="tips js_no_more_msg" and text()="No more"]')
  now_style = no_more.get_attribute('style')
  if str(now_style).find(no_more_msg_style) == -1:
    # That means it's loaded
    break
  else:
    # Pause for a moment and wait for the browser to load
    (5)
    # Execute to the bottom, via JS
    browser.execute_script('(0,)')

4. Key information access.

According to the html, the article url is analyzed in <div msg>.

<div class="weui_msg_card js_card" msg>
      <div class="weui_msg_card_hd">2017surname Nian1moon13date</div>
      <div class="weui_msg_card_bd">
         <!-- graphic -->
             <!-- 普通graphic -->
            <div  class="weui_media_box appmsg js_appmsg" hrefs="/s?__biz=MzI5MDQ4NzU5MA==&mid=2247483748&idx=1&sn=e804e638484794181a27c094f81be8e1&chksm=ec1e6d2ddb69e43bd3e1f554c2d0cedb37f099252f122cee1ac5052b589b56f428b2c304de8e&scene=38#wechat_redirect" data-t="0">
              <span class="weui_media_hd js_media" style="background-image:url(/mmbiz_jpg/XibhQ5tjv6dG9B4GF1C9MGBJO5AR2wvjCL9LgdcFgAdEgyU8wZFuDXoH9O9dNvafwK3RibCjUyiarIlUDlkxbcyfQ/640?wx_fmt=jpeg)" data-s="640" hrefs="/s?__biz=MzI5MDQ4NzU5MA==&mid=2247483748&idx=1&sn=e804e638484794181a27c094f81be8e1&chksm=ec1e6d2ddb69e43bd3e1f554c2d0cedb37f099252f122cee1ac5052b589b56f428b2c304de8e&scene=38#wechat_redirect" data-type="APPMSG">
              </span>
              <div class="weui_media_bd js_media" data-type="APPMSG">
                <h4 class="weui_media_title" hrefs="/s?__biz=MzI5MDQ4NzU5MA==&mid=2247483748&idx=1&sn=e804e638484794181a27c094f81be8e1&chksm=ec1e6d2ddb69e43bd3e1f554c2d0cedb37f099252f122cee1ac5052b589b56f428b2c304de8e&scene=38#wechat_redirect">
                  What's wrong with admitting you're a refugee?
                </h4>
                <p class="weui_media_desc">The chains are heavy enough.，Refuse to be morally abducted</p>
                <p class="weui_media_extra_info">2017surname Nian1moon13date</p>
              </div>
            </div> 
      </div>
    </div>

The article types are mainly categorized as.

<div class="weui_media_bd js_media" data-type="APPMSG">
<div class="weui_media_bd js_media" data-type="TEXT">

Divide with or without originality.

Final realization:

%%time
result = []
errlist = []
# Get one of them first
el_divs = browser.find_elements_by_xpath('//div[@class="weui_msg_card_list"]/div[@class="weui_msg_card js_card"]')
i = 0
for div in el_divs:
  date = title = url = yuanchuang = ''
  try:
    date = div.find_element_by_xpath('.//div[@class="weui_msg_card_hd"]').get_attribute('innerHTML')
    el_content = div.find_element_by_xpath('.//div[@class="weui_media_bd js_media"]')
    if el_content.get_attribute('data-type') == 'APPMSG':
      el = el_content.find_element_by_xpath('./h4[@class="weui_media_title"]')
      title = 
      url = el.get_attribute('hrefs')
      xb = el_content.find_element_by_xpath('./p[@class="weui_media_extra_info"]').text
      yuanchuang = 'Original' if ('Original') != -1 else ''
    elif el_content.get_attribute('data-type') == 'TEXT':
      title = 'accompanying text'
      url = el_content.find_element_by_xpath('./div').text
      yuanchuang = 'Original'
    else:
      # Other unidentified types
      ([i,div.get_attribute('innerHTML')])
  except NoSuchElementException:
    ([i,div.get_attribute('innerHTML')])
  print(str(i),':',date,title,url,yuanchuang)
  ([date,title,yuanchuang,url])
  i = i + 1

5, will get the url saved to excel

dfout = (result, columns=['date', 'title', 'original', 'address'])
with (savename) as writer:
dfout.to_excel(writer,index=False,sheet_name = 'Sheet1')

Final preservation form

6, in traversing the last link address, one by one requets save, you can get. Formed into a menu form of the article, you can refer to

Remembering an excel vba reference manual crawl in action, an unnecessary one. :htthttps:///article/

Pitfalls:

1, find_element_by_xpath need to be used with NoSuchElementException, otherwise encountered not found nodes will be an error, initially find_elements_by_xpath to prevent can not find the relevant nodes, the results found that the implementation of the speed of abnormally slow, you need to find out why.

2、Cookies are obtained artificially when they are used, and need to be re-fetched if they are not used for too long. Consider combining with pyautogui to control the weixin client to get it.?

3, when building, the final distribution trial, the initial article type did not do a good job of judgment, the result is a long execution time. Do a good job of exception capture, and then gradually analyze the problem of the wrong node.

This is the whole content of this article.