Stop Plotting Randomly: A Systematic Framework for Exploratory Data Analysis

DS
LDS Team
Let's Data Science
10 min readAudio
Stop Plotting Randomly: A Systematic Framework for Exploratory Data Analysis
0:00 / 0:00

You have a new dataset. It has 50 columns, 100,000 rows, and messy variable names. The overwhelming temptation is to immediately import libraries and start generating random histograms, hoping a pattern magically appears. This "spaghetti at the wall" method is the primary reason data science projects fail or drag on for weeks.

True Exploratory Data Analysis (EDA) is not just visualization; it is an interrogation. It is a systematic process of questioning data to reveal its structure, its secrets, and its broken parts. Without a structured framework, you risk missing critical data leakage, ignoring silent data corruption, or engineering features based on false assumptions.

This article outlines a four-phase EDA framework—Structure, Uniqueness, Relationships, and Anomalies—that transforms raw data into a clear roadmap for modeling.

Why is a systematic EDA framework necessary?

A systematic EDA framework ensures comprehensive data understanding by enforcing a checklist of statistical and visual checks. This prevents confirmation bias, where analysts only look for patterns they expect to find, and exposes "silent failures" like non-random missingness or subtle data leakage. It turns ad-hoc exploration into a repeatable scientific process.

Phase 1: The Structural Health Check

Before visualizing distributions, you must understand the container your data arrives in. This phase diagnoses the technical health of the dataset.

Checking Dimensions and Types

The first step determines if the data matches your expectations. A common issue is numerical data loading as objects (strings) due to a single "N/A" or special character.

python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Setting style for visibility
sns.set_theme(style="whitegrid")

# Simulating a dataset for this walkthrough
data = {
    'price': np.random.lognormal(mean=10, sigma=1, size=1000),
    'sq_ft': np.random.normal(1500, 500, 1000),
    'type': np.random.choice(['Apartment', 'House', 'Condo'], 1000),
    'year_built': np.random.randint(1950, 2023, 1000),
    'garage': np.random.choice([0, 1, np.nan], 1000) # Introducing missing values
}
df = pd.DataFrame(data)

# The first command you should run
print(df.info())

Expected Output:

text
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   price       1000 non-null   float64
 1   sq_ft       1000 non-null   float64
 2   type        1000 non-null   object 
 3   year_built  1000 non-null   int64  
 4   garage      658 non-null    float64
dtypes: float64(3), int64(1), object(1)
memory usage: 39.2+ KB
None

⚠️ Common Pitfall: If a column like price appears as object (Dtype), do not proceed. Pandas has likely detected a string character (like "$") mixed with numbers. You must coerce these types before analysis.

What is the "cardinality check" and why does it matter?

Cardinality refers to the number of unique values in a column. Checking cardinality helps identify categorical variables masquerading as numbers (e.g., zip codes) or high-cardinality features that will break tree-based models.

If a categorical column has too many unique values, you may need strategies like Frequency Encoding.

python
# Checking unique values
print(df.nunique())

Phase 2: Univariate Analysis (The Soloists)

Once the structure is solid, we examine variables in isolation. The goal is to understand the shape of the data. Is it symmetrical? Is it skewed? Is it multi-modal?

Visualizing Numerical Distributions

For continuous variables, we look for skewness and spread.

Skewness=E[(Xμ)3]σ3\text{Skewness} = \frac{E[(X - \mu)^3]}{\sigma^3}

In Plain English: Skewness measures the asymmetry of your data. A skewness of 0 is a perfect bell curve. Positive skew means the "tail" drags out to the right (like income or house prices). Negative skew means the tail drags to the left. If your data is heavily skewed, linear models will struggle, and you might need a log transformation.

We use histograms and boxplots simultaneously. The histogram shows the shape; the boxplot highlights the outliers.

python
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Histogram
sns.histplot(df['price'], kde=True, ax=axes[0])
axes[0].set_title('Price Distribution (Histogram)')

# Boxplot
sns.boxplot(x=df['price'], ax=axes[1])
axes[1].set_title('Price Boxplot')

plt.show()

Visualizing Categorical Distributions

For categorical data, we care about class imbalance. If 99% of your target variable is "No Fraud" and 1% is "Fraud," your accuracy metrics will lie to you.

💡 Pro Tip: Always visualize categorical counts as percentages, not just raw numbers. It makes understanding the magnitude of imbalance easier.

For handling these categorical features later, you will likely need Categorical Encoding.

Phase 3: Bivariate Analysis (The Relationships)

Data science happens when variables interact. In this phase, we test hypotheses: "Does square footage actually predict price?"

Numerical vs. Numerical: The Correlation Matrix

We quantify relationships using the Pearson Correlation Coefficient (rr):

r=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}

In Plain English: This formula calculates how much two variables change together (covariance) divided by how much they change individually (variance). The result is a standardized number between -1 and 1.

  • +1: Perfect synchronization (as X goes up, Y goes up).
  • 0: No linear relationship (random chaos).
  • -1: Perfect inverse (as X goes up, Y goes down).

However, reliance on correlation alone is dangerous. rr only captures linear relationships. A parabolic relationship (U-shape) could have a correlation of 0 but be perfectly predictive.

⚠️ Common Pitfall: Never trust the correlation number without seeing the scatter plot. Anscombe's Quartet is a famous dataset where four totally different plots have the exact same correlation statistics.

python
# Correlation Matrix
corr = df[['price', 'sq_ft', 'year_built']].corr()

# Heatmap visualization
plt.figure(figsize=(6, 4))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

# Scatter plot validation
sns.scatterplot(x='sq_ft', y='price', data=df, alpha=0.5)
plt.title("Price vs. Square Footage")
plt.show()

If you have too many features, you should consult our guide on Feature Selection to narrow down the noise.

Categorical vs. Numerical

To see how a category affects a number (e.g., "Do Houses cost more than Condos?"), use grouped boxplots.

python
plt.figure(figsize=(8, 5))
sns.boxplot(x='type', y='price', data=df)
plt.title("Price Distribution by Property Type")
plt.show()

If the boxes (Interquartile Ranges) overlap significantly, the categorical variable might not have strong predictive power for that numerical target.

Phase 4: The "Dark Matter" (Missingness & Anomalies)

Real-world data is messy. Ignoring gaps or weird spikes ensures model failure.

How should we analyze missing data?

Don't just fill NaN with zero. You must visualize patterns of missingness. Is the data missing at random (MCAR), or is it missing because the event happened? (e.g., A "Null" garage might mean "No Garage," not "We forgot to write it down").

Using a heatmap to visualize missingness is often more revealing than a simple count.

python
plt.figure(figsize=(10, 4))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Data Heatmap (Yellow = Missing)")
plt.show()

For a deep dive on how to fix these gaps, read Missing Data Strategies.

Detecting Outliers

Outliers skew averages and confuse gradient descent. We can detect them using the IQR (Interquartile Range) method or Z-Scores.

Z=xμσZ = \frac{x - \mu}{\sigma}

In Plain English: The Z-score tells you how many standard deviations a data point is from the average. If a Z-score is greater than 3, that data point is extremely rare (0.15% probability in a normal distribution). It is likely an outlier or an error.

For complex, multi-dimensional datasets, simple Z-scores fail. You will need advanced algorithms like Isolation Forest or Local Outlier Factor to find anomalies based on density rather than just distance.

Conclusion

EDA is not a passive step; it is an active defense against bad models. By following a systematic framework—checking structure, analyzing univariate distributions, validating bivariate relationships, and scrutinizing anomalies—you move from guessing to knowing.

When you finish your EDA, you should know:

  1. Which features need transformation (skewed data).
  2. Which features are redundant (high correlation).
  3. How to handle missing gaps (imputation strategy).
  4. Whether your classes are imbalanced.

From here, you are ready to engineer features that actually work. To continue building your pipeline, check out our guide on Feature Engineering.


Hands-On Practice

Following the systematic framework outlined in the article—Structure, Uniqueness, Relationships, and Anomalies—we will perform a structured Exploratory Data Analysis (EDA) on the Customer Analytics dataset. This process moves beyond random plotting to rigorously check data types, distribution shapes, correlations, and potential outliers using pure Python and Matplotlib.

Dataset: Customer Analytics (Data Analysis) Rich customer dataset with 1200 rows designed for EDA, data profiling, correlation analysis, and outlier detection. Contains intentional correlations (strong, moderate, non-linear), ~5% missing values, ~3% outliers, various distributions, and business context for storytelling.

Try It Yourself

Data Analysis
Loading editor...
0/50 runs

Data Analysis: 1,200 customer records with demographics, behavior, and churn data

By systematically checking the structure, distributions, relationships, and anomalies, we have validated the dataset's quality before attempting any complex modeling. This process revealed the distribution of spending, confirmed potential relationships between income and purchase habits, and identified outliers that could skew linear models.