Faced with three treatments for missing values:
- option 1: Remove samples containing missing values (rows)
- Option 2: Remove columns (feature vectors) containing missing values
- option 3: Fill missing values with some value (0, mean, median, etc.)
For dropna and fillna, both dataframe and series are available, here we mainly talk about datafame's
For option1:
utilization(axis=0, how='any', thresh=None, subset=None, inplace=False)
Parameter Description:
- axis:
- axis=0: remove rows containing missing values
- axis=1: remove columns containing missing values
- how: works with axis
- how='any' :Delete the line item column whenever a missing value occurs
- how='all': all values are missing before deleting rows or columns
- thresh: there are at least thresh non-missing values in the axis, otherwise it is deleted.
- For example, axis=0, thresh=10: identifies that if the number of non-missing values in the line is less than 10, the line will be deleted.
- subset: list
- Which columns to look in to see if there are any missing values
- inplace: Whether to operate on the original data. If true, returns None otherwise returns a new copy, stripped of missing values
It is recommended to write all the default parameters for quick understanding when using the
examples:
df = ( {"name": ['Alfred', 'Batman', 'Catwoman'], "toy": [, 'Batmobile', 'Bullwhip'], "born": [, ("1940-04-25") ]}) >>> df name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT # Drop the rows where at least one element is missing. >>> () name toy born 1 Batman Batmobile 1940-04-25 # Drop the columns where at least one element is missing. >>> (axis='columns') name 0 Alfred 1 Batman 2 Catwoman # Drop the rows where all elements are missing. >>> (how='all') name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT # Keep only the rows with at least 2 non-NA values. >>> (thresh=2) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT # Define in which columns to look for missing values. >>> (subset=['name', 'born']) name toy born 1 Batman Batmobile 1940-04-25 # Keep the DataFrame with valid entries in the same variable. >>> (inplace=True) >>> df name toy born 1 Batman Batmobile 1940-04-25
For option 2.
You can use the dropna or drop function.(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
- labels: list of rows or columns to delete
- axis: 0 rows; 1 column
df = ((12).reshape(3,4), columns=['A', 'B', 'C', 'D']) >>>df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 # Delete columns >>> (['B', 'C'], axis=1) A D 0 0 3 1 4 7 2 8 11 >>> (columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11 # Delete rows (indexes) >>> ([0, 1]) A B C D 2 8 9 10 11
For option3
utilization(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
- value: scalar, dict, Series, or DataFrame
- dict allows you to specify what values to fill each row or column with.
- method: {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
- Column operations
- ffill / pad: use the previous value to fill in missing values
- backfill / bfill :use the latter value to fill in missing values
- limit A limit on the number of missing values that can be filled. Shouldn't be used much.
f = ([[, 2, , 0], [3, 4, , 1], [, , , 5], [, 3, , 4]], columns=list('ABCD')) >>> df A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4 # Use 0 to replace all missing values >>> (0) A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4 # Fill in missing values using back or front values >>> (method='ffill') A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 3.0 3.0 NaN 4 >>>(method='bfill') A B C D 0 3.0 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN 3.0 NaN 5 3 NaN 3.0 NaN 4 # Replace all NaN elements in column ‘A', ‘B', ‘C', and ‘D', with 0, 1, 2, and 3 respectively. # Use different missing values for each column >>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3} >>> (value=values) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 2.0 1 2 0.0 1.0 2.0 5 3 0.0 3.0 2.0 4 # Replace only the first missing value >>>(value=values, limit=1) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 NaN 1 2 NaN 1.0 NaN 5 3 NaN 3.0 NaN 4
House price analysis:
In this problem, only the bedroom column has missing values, following these three methods the processing code is:
# option 1 Remove lines containing missing values (subset=["total_bedrooms"]) # option 2 Remove the column "total_bedrooms" from the data. ("total_bedrooms", axis=1) # option 3 Fill in missing values with the median value of "total_bedrooms". median = housing["total_bedrooms"].median() housing["total_bedrooms"].fillna(median)
Sklearn provides the Imputer class for handling missing values, a tutorial on how to use it is here.https:///article/
summarize
to this article on Python pandas processing of missing values (dropna, drop, fillna) of the article is introduced to this, more related pandas processing of missing values content, please search for my previous posts or continue to browse the following related articles I hope that you will support me in the future more!