Pandas: Dealing with missing values in datasets
In real world nothing is perfect, there will be times that the dataset you have to work would be incomplete and values will missing, and…
In real world nothing is perfect, there will be times that the dataset you have to work would be incomplete and values will missing, and there are multiple ways to deal with missing values.
Check with the data source for a better quality dataset
In case you have plenty of time you can ask the data source for a better quality dataset, this would be always the best thing to do.
Drop missing values
If the missing values belong to columns that you don’t need you can just drop the whole column, dropping a column in pandas is done using the dropna() method of a dataframe.
Syntax: How to drop a column from a dataframedf.dropna(subset = ["column_name"], inplace = True, axis = 1)
- subset : The name of the column to drop.
- inplace : if True the operation will take place in current dataframe.
- axis : if value is set to 1 means will drop the whole column.
Another option might be to delete only the rows that a specific column has missing values. To do this we use the previous command and we set the axis parameter to 0, setting axis to 0 means drop rows.df.dropna(subset = ["column_name"], inplace = True, axis = 1)
Omitting the subset parameter will apply the dropna() function to all columns.
Replace missing values
Another option could be replacing missing values with an average of the values in the column or the most frequent occurrence the values, In pandas we can replace values using the replace() method of a dataframe.
Syntax: How to replace values in a dataframedataframe["column_name"].replace(old_value, new_value, inplace = True)
Example: calculating the average value of a column and replacing NaN with that.import numpy as np
mean = dataframe["column_name"].mean()
dataframe["column_name"].replace(np.nan, mean, inplace = True)
To replace missing values with the most frequent occurrence use the mode() function instead of mean()import numpy as np
mode = dataframe["column_name"].mode()
dataframe["column_name"].replace(np.nan, mode, inplace = True)
Omitting the column_name will apply the replace function to all columns of the dataframe.
I hope you found this article useful and help you in your day to day operations with Pandas!