The ‘Robust’ Data Scientist: Winning with Messy Data and Pingouin

In the dynamic and often unpredictable landscape of real-world data science, the idealized conditions presented in academic textbooks frequently diverge from practical realities. Data scientists embarking on projects outside the meticulously curated environments of theoretical exercises are routinely confronted with a barrage of challenges: pervasive outliers, significantly skewed distributions, and highly variable data that defy classical statistical assumptions. This discrepancy between theory and practice underscores a critical need for adaptable methodologies, a need increasingly met by the principles of robust statistics.

The Pervasiveness of Imperfect Data

The notion that data in the wild is often "messy" is not a lament but a fundamental truth of modern data acquisition. From sensor readings in industrial IoT applications to customer behavior logs on e-commerce platforms, and even scientific experimental results, data rarely arrives in a pristine, perfectly bell-curved state. Data collection processes are susceptible to errors, instrumentation noise, human input mistakes, and inherent complexities of the phenomena being measured. Outliers, for instance, can arise from legitimate but extreme events, data entry errors, or sensor malfunctions. Similarly, skewed distributions are common in economic data (e.g., income, wealth), biological measurements (e.g., drug response times), or web analytics (e.g., visit duration), where a small number of observations significantly diverge from the majority.

Classical statistical methods, such as the t-test or Analysis of Variance (ANOVA), are powerful tools built upon specific assumptions, including normality, homoscedasticity (equal variances), and independence of observations. When these assumptions are violated, the reliability of the p-values, confidence intervals, and effect sizes derived from these tests can be severely compromised, leading to erroneous conclusions and potentially flawed decision-making. Ignoring these violations, a common pitfall in high-pressure data analysis environments, can result in models that perform poorly in production or insights that are simply not valid.

The Rise of Robust Statistics: A Historical Context

The recognition of data’s inherent messiness and its implications for statistical inference is not new. Statisticians have long grappled with the limitations of classical parametric tests when assumptions are not met. The development of non-parametric methods in the mid-20th century, such as the Mann-Whitney U test and the Wilcoxon Signed-Rank Test, marked a significant step towards robustness. These methods eschew strict distributional assumptions, relying instead on ranks or signs of differences, thereby becoming less sensitive to outliers and non-normal data.

More broadly, robust statistics emerged as a distinct field aimed at developing methods that are less sensitive to small deviations from model assumptions, particularly concerning the tails of distributions or the presence of outliers. Pioneering work by statisticians like John Tukey in the 1960s and 70s emphasized exploratory data analysis and robust estimation techniques, laying the groundwork for a more practical and resilient approach to data analysis. This evolution reflects a growing understanding that while ideal conditions simplify mathematical proofs, real-world applications demand methods that can withstand the rigors of imperfect data.

Pingouin: A Practitioner’s Ally in Robust Analysis

In this context, specialized libraries like Pingouin have become indispensable for modern data scientists. Pingouin is a comprehensive open-source statistical package for Python, designed to be user-friendly while offering a wide array of statistical tests, including many robust alternatives to classical methods. It simplifies the process of performing complex statistical analyses, making advanced techniques accessible even to those without a deep theoretical background in mathematical statistics. By abstracting away the computational complexities, Pingouin allows data scientists to focus on the interpretation of results and the implications for their specific domain problems.

A previous exploration using Pingouin highlighted its utility in building robust Exploratory Data Analysis (EDA) pipelines, specifically in detecting violations of assumptions like homoscedasticity and normality. The crucial follow-up question, however, is: what action should be taken when these diagnostic tests fail? The answer is not to discard valuable data or resort to overly aggressive data cleaning that might distort underlying patterns. Instead, it is to pivot towards robust statistical methods that are inherently designed to operate effectively in such challenging conditions.

This article delves into the practical craftsmanship of deploying robust statistics within typical data science workflows. We will explore three common scenarios where classical tests falter and demonstrate how Pingouin provides elegant, robust solutions. Our "choose your own adventure" approach will illustrate how to navigate the "ugliest" aspects of real-world data, ensuring reliable and valid results even when the data refuses to conform to textbook ideals.

Case Study: Navigating the Wine Quality Dataset

To anchor our exploration, we will utilize a well-known, notoriously "messy" dataset: the combined red and white wine quality dataset. This dataset, available publicly, presents a rich tableau of chemical properties and quality ratings for thousands of wine samples. Its inherent characteristics, including varying distributions and the presence of outliers across different chemical parameters, make it an excellent testbed for robust statistical methods.

Our initial setup involves installing the necessary libraries and loading the dataset:

!pip install pingouin pandas

import pandas as pd
import pingouin as pg

# Loading our messy, real-world-like dataset, containing red and white wine samples
url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/wine-quality-white-and-red.csv"
df = pd.read_csv(url)

# Take a small peek at what we are about to deal with
df.head()

As many practitioners familiar with this dataset will attest, it frequently fails to meet the stringent assumptions of classical parametric tests across various features. This makes it an ideal candidate to demonstrate Pingouin’s robust capabilities through a series of practical "adventures."

Challenge 1: Non-Normal Distributions – The Mann-Whitney U Test

One of the most frequent challenges in data analysis is encountering non-normally distributed data. Suppose a data scientist is tasked with comparing the alcohol content between white wine samples and red wine samples. A natural first inclination might be to use an independent samples t-test. However, the t-test assumes that the data within each group is approximately normally distributed. Let’s perform a normality test using Pingouin:

white_wine_alcohol = df[df['type'] == 'white']['alcohol']
red_wine_alcohol = df[df['type'] == 'red']['alcohol']

print("Normality test for White Wine Alcohol content:")
print(pg.normality(white_wine_alcohol))
print("nNormality test for Red Wine Alcohol content:")
print(pg.normality(red_wine_alcohol))

Upon executing these tests, one would observe extremely low p-values for both white and red wine alcohol content, indicating a severe departure from normality. While non-normality itself doesn’t directly imply the presence of outliers or extreme skewness, a strong deviation often suggests these characteristics are present, or that the distribution is fundamentally different from a Gaussian shape (e.g., bimodal, exponential). In such a scenario, a standard t-test would yield unreliable p-values and confidence intervals, potentially leading to incorrect conclusions about the true difference in alcohol content between wine types.

The robust fix for comparing two independent groups when normality is violated is the Mann-Whitney U test, also known as the Wilcoxon rank-sum test. Unlike the t-test, which compares means, the Mann-Whitney U test compares the ranks of observations across the two groups. Conceptually, it determines if one group’s values tend to be larger or smaller than the other’s, without assuming any specific distribution. This rank-based approach is incredibly powerful: by converting raw values into ranks, the potentially distorting influence of extreme outliers is significantly mitigated. An outlier with a very large magnitude will still be the highest rank, but its absolute distance from the next highest value no longer disproportionately sways the test statistic.

Here’s how to apply it using Pingouin:

# Separating our two groups
red_wine = df[df['type'] == 'red']['alcohol']
white_wine = df[df['type'] == 'white']['alcohol']

# Running the robust Mann-Whitney U test
mwu_results = pg.mwu(x=red_wine, y=white_wine)
print(mwu_results)

The output might appear as follows:

         U_val alternative     p_val       RBC      CLES
MWU  3829043.5   two-sided  0.181845 -0.022193  0.488903

In this result, the p-value of 0.181845 is notably higher than the conventional significance level of 0.05. This indicates that, based on the Mann-Whitney U test, there is no statistically significant difference in alcohol content between red and white wines. Crucially, this conclusion is robustly established, meaning it is largely unaffected by the non-normal distributions or any potential outliers present in the alcohol content data. A data scientist can confidently report this finding, knowing the statistical integrity of the analysis is maintained despite the "messiness" of the underlying data. The CLES (Common Language Effect Size) of 0.488903 further supports this, suggesting that a randomly selected red wine is only marginally less likely to have higher alcohol content than a randomly selected white wine (or vice versa, given it’s close to 0.5), indicating a negligible practical difference.

Challenge 2: Paired Data Discrepancies – The Wilcoxon Signed-Rank Test

Another common analytical scenario involves comparing two measurements taken from the same subject or unit. For instance, a pharmaceutical company might measure a patient’s blood sugar level before and after administering a new drug, or a chemist might measure two different acidity levels (e.g., fixed acidity vs. volatile acidity) within the same wine sample. The core interest here lies in the differences between these paired measurements. A standard paired t-test would typically be applied, but it assumes that these differences are normally distributed. When this assumption is violated, the paired t-test can yield unreliable confidence intervals and p-values, potentially obscuring real effects or falsely indicating them.

The ideal robust alternative for such situations is the Wilcoxon Signed-Rank Test. This non-parametric test is the robust sibling of the paired t-test. It operates by first calculating the differences between each pair of measurements. Then, it ranks the absolute values of these differences and assigns the original sign (positive or negative) back to the ranks. By comparing the sum of positive ranks to the sum of negative ranks, it assesses whether there is a systematic shift in one direction. Similar to the Mann-Whitney U test, its reliance on ranks makes it highly resistant to outliers in the differences and does not require the differences to be normally distributed.

In Pingouin, the pg.wilcoxon() function is used, taking the two columns containing the paired measures as input. Let’s compare ‘fixed acidity’ and ‘volatile acidity’ in our wine dataset, treating them as paired measures within each wine sample:

# Run the robust Wilcoxon signed-rank test for paired data
wilcoxon_results = pg.wilcoxon(x=df['fixed acidity'], y=df['volatile acidity'])
print(wilcoxon_results)

The output from this test would likely be:

          W_val alternative  p_val  RBC  CLES
Wilcoxon    0.0   two-sided    0.0  1.0   1.0

This result shows a p-value of 0.0, indicating an extremely statistically significant difference. The W_val of 0.0, combined with RBC (Rank-Biserial Correlation) and CLES values of 1.0, points to a "perfect separation." This means that the ‘fixed acidity’ values are consistently and significantly different from ‘volatile acidity’ values across nearly all wine samples, with one consistently higher than the other. This isn’t just a statistically significant difference; it suggests that these two properties operate on entirely different magnitude tiers and likely represent distinct chemical characteristics within the wine. The robustness of the Wilcoxon test ensures that this strong conclusion is not an artifact of a few extreme wine samples or non-normal distributions of the acidity differences.

Challenge 3: Unequal Variances Across Groups – Welch’s ANOVA

For our third and final adventure, consider a scenario where a data scientist wants to determine if residual sugar levels in wine differ significantly across various quality ratings. Wine quality ratings typically range from 3 to 9, taking integer values, which can be treated as discrete categories. A traditional one-way ANOVA is the go-to test for comparing means across three or more independent groups. However, ANOVA assumes homoscedasticity – that the variance of the dependent variable (residual sugar) is equal across all groups (quality ratings).

Real-world data often violates this assumption. For instance, the variability in residual sugar might be very high in wines of mediocre quality (rating 5 or 6) because these categories might encompass a broader range of styles and winemaking practices, while top-quality wines (rating 8 or 9) might have very consistent, low sugar levels. If a diagnostic test like Pingouin’s Levene test for homoscedasticity reveals a dramatic failure (e.g., extremely low p-value), then a classical one-way ANOVA would produce misleading results, as its F-statistic and p-values are sensitive to unequal variances.

The robust solution for comparing means across multiple groups when variances are unequal is Welch’s ANOVA. Unlike standard ANOVA, Welch’s ANOVA does not assume equal variances. Instead, it adjusts the degrees of freedom and the F-statistic to account for differences in group variances. This "penalty" on groups with higher variance effectively balances out the scales, making comparisons fairer and the statistical inference more reliable even in the presence of heteroscedasticity.

Here’s how to implement Welch’s ANOVA using Pingouin:

# Run Welch's ANOVA to compare sugar across quality ratings
welch_results = pg.welch_anova(data=df, dv='residual sugar', between='quality')
print(welch_results)

The output from Welch’s ANOVA might look like this:

    Source  ddof1      ddof2          F         p_unc       np2
0  quality      6  54.507934  10.918282  5.937951e-08  0.008353

Even in a situation where a traditional one-way ANOVA would have struggled due to unequal variances across quality groups, Welch’s ANOVA provides a robust and reliable conclusion. The very small p-value (5.937951e-08) is clear and compelling evidence that residual sugar levels differ significantly across wine quality ratings. This robust finding allows a data scientist to confidently state that wine quality has an association with residual sugar content.

However, it is crucial to interpret the effect size alongside the statistical significance. The eta-squared (np2) value of 0.008353 is quite low. This suggests that while there is a statistically significant difference, residual sugar content explains only a very small proportion of the variance in wine quality. This implies that residual sugar is just one minor piece of a much larger and more complex puzzle influencing wine quality, a point that a robust data scientist would highlight in their analysis to provide a balanced and nuanced perspective.

Broader Implications for Data Science Practice and Expert Commentary

These three scenarios vividly illustrate a fundamental shift in the paradigm of effective data science. The ability to navigate and extract meaningful insights from imperfect, real-world data is paramount. Leading statisticians and data science thought leaders consistently underscore the critical need for methods resilient to data imperfections. Industry experts often emphasize that blindly applying classical parametric tests without validating assumptions is a common pitfall that can lead to costly business decisions or misinterpretations of scientific findings. The increasing complexity and volume of data sources, coupled with the inherent messiness of data generation processes, have prompted a consensus among industry professionals that a purely classical statistical approach is often insufficient.

The "robust" data scientist is not merely an individual who can write code or build models; they are a craftsman capable of applying the right statistical tool for the job, even when the data is uncooperative. This means understanding the assumptions behind various tests, diagnosing when those assumptions are violated, and possessing the knowledge and tools to pivot to robust alternatives. Libraries like Pingouin play a pivotal role in democratizing these advanced statistical methods, making them accessible to a wider audience of data practitioners. By abstracting the complex mathematical implementations, Pingouin empowers data scientists to focus on the inferential questions at hand, rather than getting bogged down in the minutiae of statistical theory.

Conclusion

Through these practical examples, we have traversed a path that pairs common messy-data problems with sophisticated yet accessible robust statistical strategies. The journey underscores a vital lesson: true proficiency in data science does not hinge on the unattainable goal of having perfectly clean data or meticulously tuning it to fit rigid assumptions. Instead, it lies in the mastery of knowing how to proceed effectively when data presents its inevitable challenges. Pingouin’s suite of functions, by implementing a variety of robust tests, provides an invaluable escape route from the "failed assumptions" trap. It enables data scientists to extract mathematically sound and reliable insights with minimal additional effort, fostering greater confidence in their analytical conclusions and the data-driven decisions that follow. The rise of the robust data scientist signifies a maturation of the field, embracing the complexities of real-world data to deliver more accurate, reliable, and impactful results.

Ivan Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

Leave a Reply

Your email address will not be published. Required fields are marked *