SoFunction
Updated on 2024-12-16

python automatically open the browser to download zip and extract the content to write to excel

preamble

The guys lightly spray, some of the code is now learning to write, some details are not handled well place please point out ~~~

First of all, I'll post a picture of the effect: there are some parts I didn't put in, such as the browser startup, but I'm sure you guys who are smart enough to learn that stuff will learn it as soon as I detail it. There are some things I didn't put in.

downloading

Libraries used and general ideas

This part uses the libraries time, selenium, urllib, re, requests, os.

coding

#!/usr/bin/python3
# coding=utf-8
import time
from selenium import webdriver
from  import quote,unquote
import re
import requests
import os
# The following two parameters are to prevent backcrawling, other articles also write this, but I do not use it here
headers = {
 'Accept': '*/*',
 'Accept-Language': 'en-US,en;q=0.5',
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36'
}
params = {
 'from': 'search',
 'seid': '9698329271136034665'
}


class Download_file():
 def __init__(self,url,order_number,file_path):
 =url
 self.order_number=order_number
 self.file_path=file_path

 # Get the download link for the file
 def _get_files_url(self):
 # Open it with Google Chrome
 driver=()
 # Get the url
 ()
 print()
 (5)
 # Get the corresponding operation object by tag id
 driver.switch_to.frame(0)
 driver.find_element_by_id('search_id').send_keys(self.order_number)
 # Specific pages have specific actions, here the button I need to find has no id, he is using ng-click="queryCheckRecordByTid()"
 driver.find_element_by_class_name('btn').click()
 # driver.find_element_by_id('su').click()
 (3)
 # AngularJS syntax for writing tags is annoying. I'll start here by finding the parent tag of the target tag
 # And then get the target tag through the parent tag #
 dd=driver.find_elements_by_class_name('col-xs-2')
 # I have two <a></a> tags under this parent tag, and I can only have the first one.
 url_list=[]
 for i in dd:
 # Because the downloaded url happens to be the first one, and then it's obtained here as element, so it happens to be the correct url.
 a=i.find_element_by_xpath('.//a')
 # print(a.get_attribute('href'))
 url_list.append(a.get_attribute('href'))
 # download_btn[0].click()
 (3)
 ()
 return url_list

 # Download file
 def download_save(self):
 # Matches may come up with None, so do a little processing
 url_list=self._get_files_url()
 url_list=list(filter(lambda x:x!=None,url_list))
 if len(url_list)==0:
 return False
 # Create a folder to hold the zip
 # The reason for changing the execution path is that it gives the flexibility to create files in user-specified directories.
 (self.file_path)
 if (self.file_path+'/'+'Download_Files') == False:
 ('Download_Files')
 # Change execution path
 (self.file_path + '/'+'Download_Files/')
 for url in url_list:
 # The link comes with the author and filename, but it needs to be decoded, so first extract the target string in regular language, then convert it to Chinese
 ret = (r'_.*\.zip$',url)
 file_info=unquote(())
 file_author=file_info.split('_')[1]
 file_title=file_info.split('_')[2]
 file_object=(url)
 file_name=file_author+'_'+file_title
 print('Downloading:%s'%file_name)
 with open(file_name,'wb') as f:
 (file_object.content)


 # def auto_fill(self):

if __name__ == '__main__':
 url='http://***'
 order_id='***'
 file_path='D:/For discipline/Get_excel'
 test=Download_file(url,order_id,file_path)
 test.download_save()

account for

Use selenium library to access the target page, I here through the _get_files_url method to locate the input box and hyperlink address, and then return the hyperlink address. After the download_save method by getting the file, and then exists locally, inside some of the storage directory, filename processing and other details to see the code can be.
Note that this is just a case, not universal, because each page of the front-end writing method is not the same, the specific page needs to be specifically analyzed, I do not post my website here is involved in the girlfriend's business, so it is not appropriate to post.

Extract the contents and fill in

Libraries used

This part uses time, xlwt, urllib, re, pickle, os, zipfile, BeautifulSoup library.

coding

#!/usr/bin/python3
# coding=utf-8
import os
import time
import xlwt
import zipfile
import re
import pickle
from bs4 import BeautifulSoup
from Download_files import Download_file
class get_excel():
 def __init__(self,file_path):
 self.file_path=file_path


 # Extract the target file
 def _unzip_files(self):
 '''
 This function decompresses the target file and returns a list of files to be processed.
 :return.
 '''
 files_list=(self.file_path)
 # The filenames are stored in a list, so in case you're dealing with other files, first match them with a regular rule
 files_list=list(filter(lambda x:(r'\.zip$',x)!=None,files_list))
 title_list=[]
 for file in files_list:
 title=('.')[0].split('_')[1]
 with (self.file_path+'/'+file,'r') as z:
 # The code is a bit long, mainly for filtering out the target file
 target_file=list(filter(lambda x:(r'Comparison Report.html$',x)!=None,()))
 # The following approach is the more flexible
 contentb=(target_file[0])
 # The big headache here is that the return value is binary, so even if you decode it, there's no way to regularize it.
 # So I'd like to save it and read it in utf8 #
 # Of course I tried decode, but there are still some things within the page that I can't convert, which also causes the regulars to not match
 if (self.file_path+'/'+title+'_'+'ComparisonReport.html')==False:
 with open(self.file_path+'/'+title+'_'+'ComparisonReport.html','wb') as fb:
 (contentb,fb)
 # with open(self.file_path+'/'+target_file[0],'r',encoding='utf-8') as fa:
 # contenta=()
 # print(contenta)
 # sentence=str((r'<b [^"]*red tahoma.*</b>$',contenta))
 # value=(r'\d.*%', sentence)
 # info=[author,title,value]
 # repetition_rate.append(info)
 title_list.append(target_file[0])
 return files_list,title_list


 # Read the contents of an html file
 def read_html(self):
 '''
 The previous function has extracted the target file, but the html file is more troublesome to read.
 So here I use BeautifulSoup library to read the information I want.
 and then return what I want in a list.
 :return.
 '''
 files_list,title_list=self._unzip_files()
 repetition_rate=[]
 for file in files_list:
 # Take out the author and the title, both of which should be written into excel
 file=('.')
 file=file[0].split('_')
 author=file[0]
 title=file[1]
 # The comparison report has been unpacked. Just read it. #
 with open(self.file_path+'/'+title+'_comparison_report.html','rb') as f:
 # Below is the usage of BeautifulSoup, if you can't read it you can go to the official website
 content=()
 content=BeautifulSoup(content,"")
 # print(type(content))
 # Searched the internet, finally I can find the duplication rate I want #
 value=('b',{"class":"red tahoma"}).string
 repetition_rate.append([author,title,value])
 return repetition_rate


 def write_excel(self):
 '''
 Generate xls table
 :return.
 '''
 workbook=(encoding='utf-8')
 booksheet=workbook.add_sheet('Sheet1')
 # Setting the border
 borders = () # Create Borders
  =  #DASHED dashed line, NO_LINE none, THIN solid line
  =  #=1 indicates a solid line
  = 
  = 
 borders.left_colour=0x40
 borders.right_colour = 0x40
 borders.top_colour = 0x40
 borders.bottom_colour = 0x40
 style1=()
 =borders
 # Setting the background color, these operations are very js and css like.
 pattern = ()
  = .SOLID_PATTERN
 pattern.pattern_fore_colour = 44
 style = () # Create the Pattern
  = pattern
 repetition_rate=self.read_html()
 # Write a title
 (0,0,'The Author',style)
 (0,1,'Title',style)
 (0,2,'Repetition rate',style)
 for item in repetition_rate:
 (repetition_rate.index(item)+1,0,item[0],style1)
 (repetition_rate.index(item)+1,1,item[1],style1)
 (repetition_rate.index(item)+1,2,item[2],style1)
 s='Repetition rate.xls'
 (self.file_path+'/'+s)


if __name__ == '__main__':
 # Judge the Download_files folder.
 file_path='D:/For discipline/Get_excel'
 url='http://***'
 order_number='***'
 if ('./Download_Files')==False:
 get_file=Download_file(url,order_number,file_path)
 get_file.download_save()
 (file_path+'/Download_Files')
 test=get_excel('D:/For discipline/Get_excel/Download_Files')
 test.write_excel()

account for

Since I downloaded the zip file, this needed to be unpacked first, and the library for unpacking is zipfile, but of course this unpacking is only done when executing, not actually unpacking to the bottom of the directory. The unpacked file is quite redundant, so I used the regular to match the most suitable (can reduce the writing workload), most of the work in this part of the code is to get my target value (which includes the byte stream and string conversion work, I just failed to choose to save the html file and re-read the information of the redundant process), that is, the (author,title, repetition rate), the process of writing the information to excel is not very complicated. I basically did not explain the method because of these Baidu or look at the official website on the line, mainly to elaborate on my writing ideas

P.S. Python using beautifulSoup to get in-tag data

from bs4 import BeautifulSoup

for k in soup.find_all('a'):
 print(k)
 print(k['class'])#Check the class attribute of the a tag
 print(k['id'])#Check the id value of the a tag
 print(k['href'])#checking the href value of the a tag
 print()#Check the string of the a tag
 #('calss'),It can also achieve this effect

to this article on python automatically open the browser to download zip and extract the contents of the excel article is introduced to this, more related python automatically browser to download zip and extract the contents of the content please search for my previous articles or continue to browse the following related articles I hope that you will have more support for me in the future!

reference article

  • Excel operations:Python3 read and write excel sheet data.
  • selenium operations.Positioning and switching frames (iframes) in selenium . Using the Python Selenium Library.
  • Unpacking and reading of zip files:Python reads and writes zip archives..