Missing Values
Missing values can arise from various sources including data corruption, failure to record data, or during the data collection process where some responses may be omitted.
Missing values in data represent instances where no data value is stored for a variable in an observation. These are often represented as NaN (Not a Number), NA (not available), None, or some other placeholder in datasets.
The presence of missing values can significantly impact the performance of statistical tests, data visualizations, machine learning models, leading to biased or inaccurate predictions. Effective management of missing values is necessary to make accurate inferences from the data.
😎 Explain the differences between Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), provide a real-world example for each?
Missing Completely at Random (MCAR): The missingness is purely random, like a malfunctioning sensor that sometimes fails to record data. For example, Data for other sensors, or the same sensor at other times, tells us nothing about the missingness.
Missing at Random (MAR): Missingness is related to other observed variables. For example, people with higher incomes might be less likely to disclose that information in a survey. Even though you don't have their income data directly, you might have correlated data about their job type or neighborhood.
Missing Not at Random (MNAR): The missingness is directly related to the missing value itself. This is tricky! For example, patients dropping out of a drug trial specifically because they're experiencing severe side effects – the reason for the missing outcome data is the negative outcome.
