If you've gone through the process of data cleansing, you'll understand what I mean. And that's the purpose of writing this article - to make data cleansing easier for the reader.
In fact, I realized not too long ago that there are some data that have similar patterns when it comes to data cleansing. That's when I started organizing and compiling some data cleaning code that I think is applicable to other common scenarios as well.
Since these common scenarios involve different types of datasets, this article focuses more on showing and explaining what these codes can be used to accomplish so that the reader can use them more easily.
Data Cleaning Toolkit
In the code snippet below, the data cleaning code is encapsulated in a number of functions, and the purpose of the code is quite intuitive. You can use this code directly without embedding them in functions that require a small amount of parameter modification.
1. Deletion of multiple columns of data
def drop_multiple_col(col_names_list, df): ''' AIM -> Drop multiple columns based on their column names INPUT -> List of column names, df OUTPUT -> updated df with dropped columns ------ ''' (col_names_list, axis=1, inplace=True) return df
Sometimes, not all columns of data are useful for our data analysis work. Therefore, " " can be convenient to delete your selected columns.
2. Converting Dtypes
def change_dtypes(col_int, col_float, df): ''' AIM -> Changing dtypes to save memory INPUT -> List of column names (int, float), df OUTPUT -> updated df with smaller memory ------ ''' df[col_int] = df[col_int].astype('int32') df[col_float] = df[col_float].astype('float32')
When we are dealing with larger datasets, we need to convert 'dtypes' to save memory. If you are interested in learning how to use 'Pandas' with large data, I highly recommend you to read the article 'Why and How to Use Pandas with Large Data' (/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c ).
3. Conversion of categorical variables to numerical variables
def convert_cat2num(df): # Convert categorical variable to numerical variable num_encode = {'col_1' : {'YES':1, 'NO':0}, 'col_2' : {'WON':1, 'LOSE':0, 'DRAW':0}} (num_encode, inplace=True)
There are some machine learning models that require variables to be in numerical form. This is when we need to convert the categorical variables into numerical variables then use them as inputs to the model. For data visualization tasks, I would recommend that you keep the categorical variables so that the visualization results are more clearly interpreted and easy to understand.
4. Checking for missing data
def check_missing_data(df): # check for any missing data in the df (display in descending order) return ().sum().sort_values(ascending=False)
If you want to check how much missing data is in each column, this is probably the fastest way to do it. This method will give you a better idea of which columns have more missing data and help you decide what action you should take next in your data cleansing and data analysis efforts.
5. Delete strings from columns
def remove_col_str(df): # remove a portion of string in a dataframe column - col_1 df['col_1'].replace('\n', '', regex=True, inplace=True) # remove all the characters after &# (including &#) for column - col_1 df['col_1'].replace(' &#.*', '', regex=True, inplace=True)
Sometimes you may see a new line of characters, or some strange symbols in a string column. You can easily deal with the problem by using df['col_1'].replace, where 'col_1' is a column in the data frame df.
6. Delete spaces in columns
def remove_col_white_space(df): # remove white space at the beginning of string df[col] = df[col].()
When data is very confusing, many unexpected things can happen. It is very common to have some spaces at the beginning of a string. Therefore, this method is useful when you want to remove spaces at the beginning of a string in a column.
7. Stitching together two columns of string data (under certain conditions)
def concat_col_str_condition(df): # concat 2 columns with strings if the last 3 letters of the first column are 'pil' mask = df['col_1'].('pil', na=False) col_new = df[mask]['col_1'] + df[mask]['col_2'] col_new.replace('pil', ' ', regex=True, inplace=True) # replace the 'pil' with emtpy space
This method is useful when you wish to combine two columns of string data together under certain conditions. For example, you want to splice the first and second columns of data together when the first column ends with certain specific letters. Depending on your needs, you can also remove the ending letters after the splicing is complete.
8. Converting timestamps (from string type to date "DateTime" format)
def convert_str_datetime(df): ''' AIM -> Convert datetime(String) to datetime(format we want) INPUT -> df OUTPUT -> updated df with new datetime format ------ ''' (loc=2, column='timestamp', value=pd.to_datetime(, format='%Y-%m-%d %H:%M:%S.%f'))
When working with time series data, you may come across timestamp columns in string format. This means that we may have to convert the data in string format to a date "datetime" format specified according to our needs in order to use this data for meaningful analysis and presentation.
To this article on the Python eight data cleaning example code details of the article is introduced to this, more related Python data cleaning content please search my previous posts or continue to browse the following related articles I hope you will support me more in the future!