SoFunction
Updated on 2024-11-19

pandas read HTML and JSON data implementation examples

Pandas is a powerful data analysis library that provides many flexible and efficient ways to process and analyze data. This article will introduce how to use Pandas to read HTML data and JSON data, and show some common application scenarios.

I. Reading HTML pages

HTML (Hypertext Markup Language) is a standard markup language used to create web pages. Web pages usually consist of HTML tags and content that describe the structure and style of the page. On a web page, data can be presented in the form of tables, lists, or other forms.Pandas can read this HTML data and convert it into dataframes for further analysis and processing.

1. Read HTML data

Pandas provides a functionread_html(), you can read data directly from HTML files or URLs. The following is the basic syntax for reading HTML data:

import pandas as pd

data = pd.read_html('')  # Read data from HTML files
data = pd.read_html('/')  # through (a gap)URLretrieve data

This function returns a list of all HTML tables. Each table is converted into a dataframe that can be manipulated like any other dataframe.

2. Processing HTML data

Once we have read the HTML data into Pandas, we can process and analyze the data using a variety of methods, here are some common operations.

  • View Data

utilizationhead()method allows you to view the first few rows of the data, with the first 5 rows displayed by default.

print(data[0].head())  # View the first table before5classifier for objects in rows such as words
  • Data Cleaning

HTML data usually contains some unwanted rows or columns, which can be removed using Pandas' data cleaning methods.

clean_data = data[0].dropna()  # Remove rows containing NaN values
clean_data = clean_data.drop(columns=['Unnamed: 0'])  # Delete the specified column
  • data conversion

Sometimes, some columns in HTML data may be incorrectly recognized as strings, which can be converted to the correct data type using Pandas' data conversion methods.

clean_data['Price'] = clean_data['Price'].('$', '').astype(float)  # Convert price columns to floating point numbers
  • data analysis

Once data cleansing and transformation is complete, data can be analyzed using various methods provided by Pandas, such as calculating statistical metrics like mean, median, and standard deviation.

mean_price = clean_data['Price'].mean()  # Average of calculated prices
median_price = clean_data['Price'].median()  # Calculate the median price
std_price = clean_data['Price'].std()  # Standard deviation of calculated prices

3. Practical applications

The following will demonstrate how to use Pandas to read and process HTML data through a practical example. Assuming that you want to analyze securities data on a website, the data on the website is presented in the form of an HTML table, you can use Pandas to read this data and analyze it further.

First, the Pandas library needs to be installed. It can be installed using the following command:

pip install pandas

The following code can then be used to read the HTML data:

import pandas as pd

data = pd.read_html('/')

Next you can view the first few rows of the data and perform data cleaning and transformation:

clean_data = data[0].dropna()
clean_data['Price'] = clean_data['Price'].('$', '').astype(float)

Finally, the data is analyzed and the results are exported:

mean_price = clean_data['Price'].mean()
median_price = clean_data['Price'].median()
std_price = clean_data['Price'].std()

print('Average price:', mean_price)
print('Median price:', median_price)
print('Standard deviation of prices:', std_price)

These steps make it easy to read and analyze HTML data to obtain statistical indicators about security prices.

Second, read JSON files

JSON is a commonly used data exchange format, Pandas provides a function read_json (), you can read data directly from the JSON file or URL. The following is the basic syntax for reading JSON data:

import pandas as pd

data = pd.read_json('')  # Read data from a JSON file
data = pd.read_json('/')  # through (a gap)URLretrieve data

1. Processing JSON data

Once the JSON data has been read into Pandas, there are various methods that can be used to process and analyze the data, here are some common operations.

  • View Data

Use the head() method to view the first few rows of the data, the default display is the first 5 rows.

print(())  # View the first 5 rows of data
  • Data Cleaning

When working with JSON data, you may encounter some missing values or outliers.Pandas provides methods to handle these situations.

Clearing Missing Values: Use the dropna() method to remove rows or columns that contain missing values.

()  # Delete rows containing missing values
(axis=1)  # Remove columns containing missing values

Fill missing values: using the fillna() method you can replace the missing values with the specified values.

(0)  # Replace missing values with 0
  • data conversion

Pandas provides methods to convert data types, as well as reshape and pivot data.

Convert data type: Use the asype() method to convert a column of data to the specified data type.

data['column_name'].astype(int)  # Convert a column of data to an integer type

Reshape data: Using the pivot() method you can convert data from a long format to a wide format.

(index='column1', columns='column2', values='value')  # Conversion of data from long to wide format
  • data analysis

Pandas provides rich methods for data analysis, including data aggregation, data sorting, and data statistics.

Data Aggregation: Using the groupby() method you can group data and perform aggregation operations.

('column').sum()  # Group by column and calculate the sum of each group

Sorting data: Using the sort_values() method you can sort the data by the specified columns.

data.sort_values('column')  # Sort data by column

Data statistics: using the describe() method you can calculate statistical indicators of the data, such as the mean, median, standard deviation and so on.

()  # Statistical indicators for the calculation of data

2. Output data

After processing and analyzing the data, the results can be saved as files in other formats, such as CSV, Excel, and so on.

  • Save as CSV file: Use the to_csv() method to save the data as a CSV file.

data.to_csv('')  # Save data as CSV files
  • Save as Excel File: Use the to_excel() method to save the data as an Excel file.

data.to_excel('')  # Save the data asExcelfile

Supplement: Pandas in the read JSON file when the solution to the ValueError

Description of the problem

When we use the read_json function of Pandas to read JSON files, we sometimes encounter the following ValueError error:

ValueError: Trailing data

The reason for this error is that the JSON file being read has some extra data such as brackets or delimiters at the end.

For example, in the following JSON file, we will notice an extra comma at the end:

{
    "name": "John",
    "age": 30,
    "city": "New York",
}

If we read that file using the read_json function of Pandas, we get the ValueError error described above.

cure

1. Modify the JSON file

The easiest way is to modify the JSON file by removing the extra commas or brackets. For large JSON files, you can use a professional JSON editor to edit them. For small JSON files, we can manually remove the extra commas or brackets and save the modified file.

2. Setting the parameters of the read_json function

In addition to modifying the JSON file, we can also solve this problem by setting the parameters of the read_json function. Specifically, we need to use the following two parameters:

  • lines=True: parse the file into multi-line mode, each line is a separate JSON object.
  • orient='records': convert the JSON object to a list of records.

For example, here is an example of a problem solved using these two parameters:

import pandas as pd

df = pd.read_json('', lines=True, orient='records')

Here, we read a JSON file containing multiple JSON objects as a single DataFrame object. If you want to read each JSON object as a separate DataFrame object, you can use the following method:

import pandas as pd

with open('') as f:
    for line in f:
        df = pd.read_json(line, orient='records')

This method reads the JSON file line by line and parses each line into a separate DataFrame object. This avoids ValueError errors caused by extra commas or parentheses.

In summary, this article describes how to use Pandas to read and process HTML and JSON data. Through the functions of Pandas, you can easily read data from JSON files or HTML, and convert it to DataFrame, and then use the various methods provided by Pandas for data cleaning, conversion and analysis.

to this article on pandas read HTML and JSON data to achieve the example of the article is introduced to this, more related pandas read HTML and JSON content, please search for my previous posts or continue to browse the following articles hope that you will support me more!