Flowchart of the website crawl:
To realize the project we need to apply the following knowledge
One,Get the page
1. Finding web page patterns;
2. Use a for loop statement to get links to the first 4 pages of the site;
3. Use the Network tab to find Headers information;
4. Use the () function to request pages with Headers.
Two,parse a web page
1. Use BeautifulSoup to parse the page;
2. Use the BeautifulSoup object to call the find_all() method to locate the tag that contains all the information about a single movie;
3. use Extract the serial number, movie title, rating, and recommendation;
4. Use Tag['Attribute Name'] to extract the movie details link.
Three,Stored Data
1. Use with open() as ... to create a csv file to write to;
2. Use () to convert a file object to a DictWriter object;
3. The parameter fieldnames is used to set the header of the csv file;
4. Use writeheader() to write the table header;
5. Use writers() to write to a csv file.
Implementation Code:
import csvimport requestsfrom bs4 import BeautifulSoup# Set up lists to store information about each movie data_list = []# Set up request headersheaders = { 'User-Agent': 'Mozilla/5.0 (Macintosh. Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'}# Iterate through the data in the range 0 to 3 using a for loopfor page_number in range(4): # Set the link to the requested page url = '/top250?start={}&filter='.format(page_number * 25) # Request the page movies_list_res = (url, headers=headers) # Parses the content of the requested page bs = BeautifulSoup(movies_list_res.text, '') # Search the page for all Tag movies that contain all the information about a single movie movies_list = bs.find_all('div', class_='item') # Iterate through the results using a for loop for movie in movies_list: # Extract the movie's serial number movie_num = ('em').text # Extract the movie's name movie_name = ('span').text # Extract the movie's rating movie_score = ("span", class_='rating_num').text # Extract the movie's testimonials movie_instruction = ('div', class_='rating_num').text # Extract the movie's rating movie_instruction = ("span",class_='inq').text # Extract the movie's link movie_link = ('a')['href'] # Add the information to the dictionary movie_dict = { 'span_num': movie_num, 'movie_name': movie _name, 'rating': movie_score, 'recommendation': movie_instruction, 'link': movie_link } # Print the movie's information print(movie_dict) # Store information about each movie data_list.append(movie_dict) # Create a new csv file to store movie information with open('', 'w', encoding='utf-8-sig') as f: # Convert the file object to a DictWriter object f_csv = (f, fieldnames=['Serial number', 'Movie name', 'Rating', 'Testimonials', 'Link']) # Write the table header with data f_csv.writeheader() f_csv.writes(data_list)
Code Analysis:
(1) By looking at the number of movies on a page of the website, you can see that there is only information about 25 movies on a page.
That means we need to crawl the first 4 pages (100 = 25*4) of the site for movie information.
Here we have used traversal to crawl the first four pages of data.
(2) Open the developer tools of the web page by shortcut keys (Windows users can open the developer tools of the web page by pressingCtrl + Shift + I
key or F12 to bring up the Browser Developer Tools, for Mac users the shortcut iscommand + option + I
)。
Next, use the pointer tool in the developer tools to take a general look at where the information to be crawled is located in the first two movies and observe if there are any patterns.
You can find the serial number, movie title, ratings, testimonials, and links to details in the first movie in theclass
attribute with the value "item".
(3) Robots Protocol for Douban Movie Top 250
Doesn't see Disallow: /Top250, which suggests that this page can be crawled.
(4) In the Internet world, web requests store browser information in therequest header
(Request Header).
As long as we copy down the browser information, when the crawler program simply initiates the request, it sets up a connection to therequest header
corresponding parameters to successfully masquerade as a browser.
(5) Code ideas
1) Skillful use of the developer tools' pointer tool can easily help us locate data.
2) After using the pointer tool to locate where each piece of data is located, view their patterns.
3) If the tag you want to extract has an attribute, you can use (HTML element name, HTML attribute name='') to extract it; if it doesn't have an attribute, you can find a tag with an attribute near this tag, and then do find() to extract it.
After crawling the information down through the above steps, we come to the final step of our crawler - storing the data.
(6) Storage of data
1) The syntax for calling the class DictWriter in the csv module is:(f, fieldnames)
In the syntax, the parameter f is the file object opened by the open() function. The argument f in the syntax is the file object opened by the open() function; the argument fieldnames is used to set the file header;
2) Implementation(f, fieldnames)
After that, you will get a DictWriter object;
3) The resulting DictWriter object can call thewriteheader()
method, which writes the fieldnames to the first line of the csv;
4) Finally, callwriterows()
method writes multiple dictionaries to a csv file.
Running results:
Generated CSV file:
summarize
To this article on the use of python to crawl the first one hundred movie douban article is introduced to this, more related python crawl douban movie content please search for my previous articles or continue to browse the following related articles I hope that you will support me more in the future!