Complete Case Analysis MCAR

 Complete Case Analysis


The standard treatment of missing data in most statistical packages is complete case (CC) analysis done by casewise deletion. Any observation that has a missing value for any variable is automatically discarded and only complete observations are analyzed. The main potential advantage of this approach is its simplicity, as standard complete data statistical analysis can be applied straightforwardly. Disadvantages stem from the potential loss of information in discarding incomplete cases. This is particularly acute in multivariate settings, where incomplete cases are often a substantial portion of the entire data set and deleting them would entail throwing away large amounts of data. This loss of information leads to two problems: loss of precision (increasing the uncertainty of the estimations) and bias when the missing data mechanism is not MCAR, but only MAR, because in that situation the complete observations are not a random sample of all the observations.1

Complete case analysis

Complete case analysis, also known as listwise deletion (LD), utilizes only the cases in a data set for which there are no missing values on any of the variables. This can result in loss of significant amount of information even in data sets that contain a modest number of variables. For example, when ten variables are independently measured with a 90% chance of observing a single case of each variable, the probability that a case contains no missing values is only 35%. When data are MCAR, the complete cases form a random subsample from the population, thus the estimates obtained will not be biased. But obviously, there can be a significant loss of efficiency in parameter estimates when a large amount of data are discarded. Of course, if the data are MNAR the results of LD will most likely be biased.


Available case analysis

Available case analysis, also known as pairwise deletion (PD), uses all the available data rather than just cases which have no missing values. This avoids throwing away possibly useful information, especially if there are few or no complete cases. In almost all SEM packages, the sample mean and sample covariance are sufficient input to fit a model. In the available case analysis, the sample mean and the sample variance for each variable are computed based on all the observed cases for the corresponding variable, and the covariance between a pair of variables is computed based on all the observed cases for that pair. Brown (1983) investigated this method in the context of factor analysis. This method leads to unbiased estimates if data are MCAR, and it leads to biased estimates for MAR data. One major shortcoming of the available case method is that it can result in a covariance matrix that is not positive definite. Notwithstanding this problem, because it uses more data, the available case analysis is expected to be more efficient than the complete case analysis. If data are MCAR and the correlations between variables are modest, a simulation by Kim and Curry (1977) supports this expected conclusion. Other simulations, however, indicate superiority of complete case analysis in presence of large correlations (Azen and Van Guilder, 1981). Marsh (1998) performed a simulation that indicates this method can lead to substantially biased test statistics, depending on the percentage of missing data and the sample size.


Steps for Complete Case Analysis


1. Understand the Dataset:

Familiarize yourself with the structure of the dataset, including variable types (categorical, numerical), and missing values.

2. Check Missing Values:

Identify rows with missing values in any variable.

3. Remove Rows with Missing Values:

Exclude rows that contain any missing values (listwise deletion).

4. Verify the Dataset:

Check the resulting dataset to ensure it is complete and ready for analysis.

5. Proceed with Analysis:

Use the cleaned dataset for descriptive or inferential statistical analysis.



Example Code (Python)


import pandas as pd

import numpy as np


# Sample DataFrame

data = {

    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],

    'Age': [25, 30, np.nan, 40, 22],

    'Gender': ['F', 'M', 'M', 'M', 'F'],

    'Income': [50000, np.nan, 70000, 80000, 45000],

    'City': ['NY', 'LA', 'SF', 'LA', 'NY']

}

df = pd.DataFrame(data)

# Step 1: Inspect the dataset

print("Original Dataset:")

print(df)

# Step 2: Identify missing values

print("\nMissing Values:")

print(df.isnull().sum())


# Step 3: Perform complete case analysis (remove rows with missing values)

df_complete = df.dropna()


# Step 4: Verify the dataset

print("\nDataset after Complete Case Analysis:")

print(df_complete)


# Step 5: Example analysis (summary statistics)

print("\nDescriptive Statistics (Numerical Variables):")

print(df_complete.describe())


print("\nFrequency Distribution (Categorical Variables):")

for col in df_complete.select_dtypes(include=['object']):

    print(f"{col}:\n{df_complete[col].value_counts()}\n")


Output:

1. The original dataset includes all rows, including those with missing values.

2. After dropping rows with missing values, the cleaned dataset contains only complete cases.

3. Descriptive statistics (for numerical variables) and frequency distribution (for categorical variables) are computed for the cleaned data.

Comments

Popular posts from this blog

Feature Engineering in Machine Learning: A Beginner's Guide Missing value imputation, handling categorical data ,outlier detection and feature scaling

Handling Missing Numerical Data with Simple Imputer

Feature Construction and Feature Splitting in Machine Learning data science