Data type viewing and conversion methods in Python Pandas

1. Introduction

In data analysis, correct processing of data types is the basis for ensuring the accuracy of analysis results. Pandas provides a rich data type system and flexible type conversion methods. This article will explain in detail how to view data types in Pandas data structures and how to perform effective type conversions.

2. Data type viewing method

2.1 View the data type of Series/DataFrame

import pandas as pd
import numpy as np

# Create a sample DataFramedata = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000.0, 60000.5, 70000.0],
    'Join_Date': pd.to_datetime(['2020-01-15', '2019-05-20', '2021-11-10']),
    'Is_Manager': [True, False, True]
}
df = (data)

# View the data type of the entire DataFrameprint()
"""
Name           object
Age             int64
Salary        float64
Join_Date      datetime64[ns]
Is_Manager       bool
dtype: object
"""

# View single column data typesprint(df['Age'].dtype)  # Output: int64

Explanation:

dtypes attribute returns the data type of each column of DataFrame
For Series, use the dtype attribute to get its data type
Common data types of Pandas include: object (string), int64 (integral), float64 (float point number), datetime64 (date time), bool (boolean value), etc.

2.2 Check DataFrame memory usage

# Check the memory usage of each columnprint(df.memory_usage())
"""
Index        128
Name          24
Age           24
Salary        24
Join_Date     24
Is_Manager     3
dtype: int64
"""

# Check the detailed memory usage (deep=True will calculate the real memory usage of object type)print(df.memory_usage(deep=True))
"""
Index        128
Name         174
Age           24
Salary        24
Join_Date     24
Is_Manager     3
dtype: int64
"""

Application scenario: Optimize memory usage when processing large data sets.

3. Data type conversion method

3.1 Use astype() for type conversion

# Convert Age from int64 to float64df['Age'] = df['Age'].astype('float64')
print(df['Age'].dtype)  # Output: float64
# Convert Salary from float64 to int64 (it will truncate the fractional part)df['Salary'] = df['Salary'].astype('int64')
print(df['Salary'])
"""
0    50000
1    60000
2    70000
Name: Salary, dtype: int64
"""

# Convert Is_Manager from bool to strdf['Is_Manager'] = df['Is_Manager'].astype('str')
print(df['Is_Manager'].dtype)  # Output: object

Notes:

Converting to a smaller range type may cause data loss (such as float to int will truncate the decimal)
An exception will be thrown when it cannot be converted (such as converting a non-numeric string to a numeric type)

3.2 Convert to categorical data (category)

# Convert Name column to classification typedf['Name'] = df['Name'].astype('category')
print(df['Name'].dtype)  # Output: category
# View categoriesprint(df['Name'].)
"""
Index(['Alice', 'Bob', 'Charlie'], dtype='object')
"""

# Memory saving effect of classification typeprint(f"Original memory: {df['Name'].memory_usage(deep=True)}")
df['Name'] = df['Name'].astype('object')
print(f"Converted memory: {df['Name'].memory_usage(deep=True)}")
"""
 Raw memory: 174
 Converted memory: 180
 """

Application scenario: When there are many repeated values in the column, using the category type can significantly save memory.

3.3 Date and time conversion

# Convert string to date timedate_str = (['2023-01-01', '2023-02-15', '2023-03-20'])
dates = pd.to_datetime(date_str)
print()  # Output: datetime64[ns]
# Handle multiple date formatsmixed_dates = (['01-01-2023', '2023/02/15', '15-March-2023'])
dates = pd.to_datetime(mixed_dates)
print(dates)
"""
0   2023-01-01
1   2023-02-15
2   2023-03-15
dtype: datetime64[ns]
"""

# Extract date componentsdf['Year'] = df['Join_Date'].
df['Month'] = df['Join_Date'].
print(df[['Join_Date', 'Year', 'Month']])
"""
   Join_Date  Year  Month
0 2020-01-15  2020      1
1 2019-05-20  2019      5
2 2021-11-10  2021     11
"""

3.4 Use pd.to_numeric() for numeric conversion

# Create Series containing numeric strings and missing valuesmixed_data = (['1', '2.5', '3.0', 'four', None])

# Safely convert to numeric value (set to NaN if it cannot be converted)numeric_data = pd.to_numeric(mixed_data, errors='coerce')
print(numeric_data)
"""
0    1.0
1    2.5
2    3.0
3    NaN
4    NaN
dtype: float64
"""

# Convert down to the minimum possible typenumeric_data = pd.to_numeric(mixed_data, errors='coerce', downcast='integer')
print(numeric_data)
"""
0    1.0
1    2.5
2    3.0
3    NaN
4    NaN
dtype: float32
"""

Advantages: It is safer than atype() and can handle mixed-type data.

4. Special type conversion skills

4.1 Automatically infer type using infer_objects()

# Create data of unclear typedf_mixed = ({
    'A': [1, 2, 3],
    'B': ['4', '5', '6'],
    'C': [True, False, True]
})

# Automatically infer more appropriate typesdf_inferred = df_mixed.infer_objects()
print(df_inferred.dtypes)
"""
A      int64
B     object
C       bool
dtype: object
"""

4.2 Use convert_dtypes() to convert to the best type

# Convert to the best supported typedf_best = df.convert_dtypes()
print(df_best.dtypes)
"""
Name           string
Age             Int64
Salary          Int64
Join_Date    datetime64[ns]
Is_Manager      boolean
Year             Int64
Month            Int64
dtype: object
"""

Note: This method will try to convert to the best type that supports missing values (such as StringDtype, Int64, etc.).

4.3 Custom conversion functions

# Use apply for custom conversiondef convert_salary(salary):
    if salary &gt; 60000:
        return 'High'
    elif salary &gt; 50000:
        return 'Medium'
    else:
        return 'Low'

df['Salary_Level'] = df['Salary'].apply(convert_salary)
print(df[['Salary', 'Salary_Level']])
"""
   Salary Salary_Level
0   50000          Low
1   60000       Medium
2   70000         High
"""

5. Best practices for type conversion

Check before conversion: Use dtypes to view the current type

Handle missing values: Handle missing values before conversion, otherwise unexpected results may be caused.

Select the appropriate type: Select the most memory-saving type based on the data characteristics

Use safe conversion: prioritize safe conversion methods such as pd.to_numeric()

Test conversion results: Verify that the data meets expectations after conversion

Classified data optimization: Use category type for low cardinality sequences to save memory

Large file processing: Specify dtypes parameters when reading large files to optimize memory

# Specify the data type when reading CSVdtype_spec = {
    'user_id': 'int32',
    'product_id': 'category',
    'rating': 'float32'
}
# pd.read_csv('large_file.csv', dtype=dtype_spec)

6. Summary

1. View data type:

dtypesView DataFrame column types
dtypeView Series Type
memory_usage analyzes memory usage

2. Type conversion method:

astype() basic type conversion
pd.to_datetime() date and time conversion
pd.to_numeric() safe numeric conversion
Category type saves memory
convert_dtypes() automatically selects the best type

3. Advanced tips:

Custom conversion functions
Specify the type when reading data
Optimize performance using classified data

Correct understanding and processing of Pandas data types is a key step in data preprocessing. A reasonable data type can not only ensure the correctness of calculations, but also significantly improve memory usage efficiency and calculation speed. Mastering these types of viewing and conversion techniques will make your data analysis work more efficient and reliable.

In actual work, it is recommended:

After importing data, first check the data types of each column
Transfer to the appropriate type according to the analysis requirements
Pay special attention to the impact of type on memory when processing large data
Establish a standardized process for data type checking

This is the article about data type viewing and conversion methods in Python Pandas. For more related Pandas data type viewing and conversion content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!