Handling Missing Numerical Data with Simple Imputer

 Handling Missing Numerical Data with Simple Imputer






The SimpleImputer class from sklearn.impute is a common and effective way to handle missing numerical data. It allows you to replace missing values with specific strategies, such as the mean, median, mode, or a constant value. Here's everything you need to know:



Why We Use Simple Imputer

1. Clean Data: Machine learning algorithms cannot process missing values. Filling in missing data ensures the dataset is complete.

2. Consistency: Avoid errors caused by missing data.

3. Improved Model Accuracy: Handling missing data correctly can lead to better model performance.

4. Automation: Automates the process of filling missing values in large datasets.


Where Simple Imputer Works


1. Numerical Features: Best suited for filling gaps in numeric datasets.

2. Machine Learning Pipelines: Commonly used in preprocessing stages.

3. Real-World Applications:

Healthcare: Impute missing blood pressure or glucose readings.

Finance: Fill missing customer credit scores or income data.

Retail: Handle missing sales data for products.


How to Use Simple Imputer


Steps:

1. Import the SimpleImputer class.

2. Instantiate it with a specific strategy (mean, median, most_frequent, or constant).

3. Fit the imputer to the data.

4. Transform the data to fill missing values.


Code Example


# Import libraries

import numpy as np

import pandas as pd

from sklearn.impute import SimpleImputer


# Create a sample dataset with missing values

data = {

    "Age": [25, np.nan, 35, 29, np.nan],

    "Salary": [50000, 60000, np.nan, 80000, 70000]

}

df = pd.DataFrame(data)


print("Original Data:")

print(df)


# Initialize the SimpleImputer with strategy 'mean'

imputer = SimpleImputer(strategy="mean")


# Fit the imputer and transform the data

df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)


print("\nData After Imputation:")

print(df_imputed)



Output


Original Data:


Age Salary

0 25.0 50000.0

1 NaN 60000.0

2 35.0 NaN

3 29.0 80000.0

4 NaN 70000.0


Data After Imputation:

Age Salary

0 25.0 50000.0

1 29.7 60000.0

2 35.0 65000.0

3 29.0 80000.0

4 29.7 70000.0


Real-Life Example:


Scenario: Healthcare Data


A hospital has collected patient data for a diabetes study. However, some patients have missing blood sugar levels. Using SimpleImputer, we can impute these values with the mean of the available readings, allowing researchers to analyze the data without bias.


Code Example


# Dataset with missing blood sugar readings

data_health = {

    "Patient_ID": [101, 102, 103, 104],

    "Blood_Sugar_Level": [120, np.nan, 135, np.nan]

}

df_health = pd.DataFrame(data_health)


# Impute missing values with the mean

imputer = SimpleImputer(strategy="mean")

df_health["Blood_Sugar_Level"] = imputer.fit_transform(df_health[["Blood_Sugar_Level"]])


print(df_health)


Output:

Patient_ID Blood_Sugar_Level

0 101 120.0

1 102 127.5

2 103 135.0

3 104 127.5


Conclusion


SimpleImputer is an essential tool in data preprocessing for handling missing numerical data. Its flexibility and ease of use make it suitable for diverse applications across industries 

like healthcare, finance, and retail.


Comments

Popular posts from this blog

Feature Engineering in Machine Learning: A Beginner's Guide Missing value imputation, handling categorical data ,outlier detection and feature scaling

Feature Construction and Feature Splitting in Machine Learning data science