1. Introduction
In data analysis, correct processing of data types is the basis for ensuring the accuracy of analysis results. Pandas provides a rich data type system and flexible type conversion methods. This article will explain in detail how to view data types in Pandas data structures and how to perform effective type conversions.
2. Data type viewing method
2.1 View the data type of Series/DataFrame
import pandas as pd import numpy as np # Create a sample DataFramedata = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Salary': [50000.0, 60000.5, 70000.0], 'Join_Date': pd.to_datetime(['2020-01-15', '2019-05-20', '2021-11-10']), 'Is_Manager': [True, False, True] } df = (data) # View the data type of the entire DataFrameprint() """ Name object Age int64 Salary float64 Join_Date datetime64[ns] Is_Manager bool dtype: object """ # View single column data typesprint(df['Age'].dtype) # Output: int64
Explanation:
- dtypes attribute returns the data type of each column of DataFrame
- For Series, use the dtype attribute to get its data type
- Common data types of Pandas include: object (string), int64 (integral), float64 (float point number), datetime64 (date time), bool (boolean value), etc.
2.2 Check DataFrame memory usage
# Check the memory usage of each columnprint(df.memory_usage()) """ Index 128 Name 24 Age 24 Salary 24 Join_Date 24 Is_Manager 3 dtype: int64 """ # Check the detailed memory usage (deep=True will calculate the real memory usage of object type)print(df.memory_usage(deep=True)) """ Index 128 Name 174 Age 24 Salary 24 Join_Date 24 Is_Manager 3 dtype: int64 """
Application scenario: Optimize memory usage when processing large data sets.
3. Data type conversion method
3.1 Use astype() for type conversion
# Convert Age from int64 to float64df['Age'] = df['Age'].astype('float64') print(df['Age'].dtype) # Output: float64 # Convert Salary from float64 to int64 (it will truncate the fractional part)df['Salary'] = df['Salary'].astype('int64') print(df['Salary']) """ 0 50000 1 60000 2 70000 Name: Salary, dtype: int64 """ # Convert Is_Manager from bool to strdf['Is_Manager'] = df['Is_Manager'].astype('str') print(df['Is_Manager'].dtype) # Output: object
Notes:
- Converting to a smaller range type may cause data loss (such as float to int will truncate the decimal)
- An exception will be thrown when it cannot be converted (such as converting a non-numeric string to a numeric type)
3.2 Convert to categorical data (category)
# Convert Name column to classification typedf['Name'] = df['Name'].astype('category') print(df['Name'].dtype) # Output: category # View categoriesprint(df['Name'].) """ Index(['Alice', 'Bob', 'Charlie'], dtype='object') """ # Memory saving effect of classification typeprint(f"Original memory: {df['Name'].memory_usage(deep=True)}") df['Name'] = df['Name'].astype('object') print(f"Converted memory: {df['Name'].memory_usage(deep=True)}") """ Raw memory: 174 Converted memory: 180 """
Application scenario: When there are many repeated values in the column, using the category type can significantly save memory.
3.3 Date and time conversion
# Convert string to date timedate_str = (['2023-01-01', '2023-02-15', '2023-03-20']) dates = pd.to_datetime(date_str) print() # Output: datetime64[ns] # Handle multiple date formatsmixed_dates = (['01-01-2023', '2023/02/15', '15-March-2023']) dates = pd.to_datetime(mixed_dates) print(dates) """ 0 2023-01-01 1 2023-02-15 2 2023-03-15 dtype: datetime64[ns] """ # Extract date componentsdf['Year'] = df['Join_Date']. df['Month'] = df['Join_Date']. print(df[['Join_Date', 'Year', 'Month']]) """ Join_Date Year Month 0 2020-01-15 2020 1 1 2019-05-20 2019 5 2 2021-11-10 2021 11 """
3.4 Use pd.to_numeric() for numeric conversion
# Create Series containing numeric strings and missing valuesmixed_data = (['1', '2.5', '3.0', 'four', None]) # Safely convert to numeric value (set to NaN if it cannot be converted)numeric_data = pd.to_numeric(mixed_data, errors='coerce') print(numeric_data) """ 0 1.0 1 2.5 2 3.0 3 NaN 4 NaN dtype: float64 """ # Convert down to the minimum possible typenumeric_data = pd.to_numeric(mixed_data, errors='coerce', downcast='integer') print(numeric_data) """ 0 1.0 1 2.5 2 3.0 3 NaN 4 NaN dtype: float32 """
Advantages: It is safer than atype() and can handle mixed-type data.
4. Special type conversion skills
4.1 Automatically infer type using infer_objects()
# Create data of unclear typedf_mixed = ({ 'A': [1, 2, 3], 'B': ['4', '5', '6'], 'C': [True, False, True] }) # Automatically infer more appropriate typesdf_inferred = df_mixed.infer_objects() print(df_inferred.dtypes) """ A int64 B object C bool dtype: object """
4.2 Use convert_dtypes() to convert to the best type
# Convert to the best supported typedf_best = df.convert_dtypes() print(df_best.dtypes) """ Name string Age Int64 Salary Int64 Join_Date datetime64[ns] Is_Manager boolean Year Int64 Month Int64 dtype: object """
Note: This method will try to convert to the best type that supports missing values (such as StringDtype, Int64, etc.).
4.3 Custom conversion functions
# Use apply for custom conversiondef convert_salary(salary): if salary > 60000: return 'High' elif salary > 50000: return 'Medium' else: return 'Low' df['Salary_Level'] = df['Salary'].apply(convert_salary) print(df[['Salary', 'Salary_Level']]) """ Salary Salary_Level 0 50000 Low 1 60000 Medium 2 70000 High """
5. Best practices for type conversion
Check before conversion: Use dtypes to view the current type
Handle missing values: Handle missing values before conversion, otherwise unexpected results may be caused.
Select the appropriate type: Select the most memory-saving type based on the data characteristics
Use safe conversion: prioritize safe conversion methods such as pd.to_numeric()
Test conversion results: Verify that the data meets expectations after conversion
Classified data optimization: Use category type for low cardinality sequences to save memory
Large file processing: Specify dtypes parameters when reading large files to optimize memory
# Specify the data type when reading CSVdtype_spec = { 'user_id': 'int32', 'product_id': 'category', 'rating': 'float32' } # pd.read_csv('large_file.csv', dtype=dtype_spec)
6. Summary
1. View data type:
- dtypesView DataFrame column types
- dtypeView Series Type
- memory_usage analyzes memory usage
2. Type conversion method:
- astype() basic type conversion
- pd.to_datetime() date and time conversion
- pd.to_numeric() safe numeric conversion
- Category type saves memory
- convert_dtypes() automatically selects the best type
3. Advanced tips:
- Custom conversion functions
- Specify the type when reading data
- Optimize performance using classified data
Correct understanding and processing of Pandas data types is a key step in data preprocessing. A reasonable data type can not only ensure the correctness of calculations, but also significantly improve memory usage efficiency and calculation speed. Mastering these types of viewing and conversion techniques will make your data analysis work more efficient and reliable.
In actual work, it is recommended:
- After importing data, first check the data types of each column
- Transfer to the appropriate type according to the analysis requirements
- Pay special attention to the impact of type on memory when processing large data
- Establish a standardized process for data type checking
This is the article about data type viewing and conversion methods in Python Pandas. For more related Pandas data type viewing and conversion content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!