>Exploratory Data Analysis

> It is important to understand what you can do before you learn to measure how well you seem to have done it. (John Tukey)

\ Otho Mantegazza _ Dataviz for Scientists _ Part 2.3

BOXPLOT

Year: 1977

Author: John Tukey

Book: Exploratory Data Analysis

The Boxplot is one of the main visual models used to explore data. It shows Summary Quantile Statistics and outlier for a stratified set of data.

Exploratory Data Analysis

Investigative Work

Exploratory data analysis, as stated by Tukey, is the investigative work on data.

When you explore data, you leave no stone unturned. Relying on graphical methods, robust summary statistic and dimension reduction, you quickly gain insights in all possible patterns, correlations and cause-effect relationships that are in the data. In technical terms, you generate hypothesis.

Confirmation

After you have done your investigative work, you should switch from inspector to judge and test your hypothesis with inference and tests.

But there’s no point in testing the statistical relevance of poorly formulated hypotheses. Exploratory Data Analysis is fundamental in modern statistics, because it allows to formulate the best hypothesis possible.

Visual Immediacy

When you explore data, you have to turn them into insightful formats. Most of the time this involves turning data into graphical and visual shapes.

STEM AND LEAF

Year: 1977

Author: John Tukey

Book: Exploratory Data Analysis

A graphical intuitive representation of car prices.

A big part Tukey’s book “Exploratory Data Analysis” relies on graphical representation of data that you can draw yourself with pen and paper. Luckily today you can use powerful software designed for data exploration purpose, such as the Tidyverse.

Semantics

If you want to represent data intuitively, first you have to learn terms that allow you to describe the structure of a dataset semantically.

The Tidy Data Theory lets us do just that.

TIDY DATA THEORY

Year: 2014

Author: Hadley Wickham

Paper: Tidy Data

A common framework to organize data semantically: if you organize data based on their structure, it’s easier for you to make sense of them, to realize what data you have and what’s missing.

If you organize data with a common framework, it’s also easier to share them with others.

  • Each column is a variable.
  • Each row is an observation.
  • Each cell is a value.
  • Each dataset is one observational unit.

Exploratory Data Analysis

  • Visualize

  • Summarize

  • Stratify

  • Transform

More on EDA (in R).

# all functions to manipulate data
import pandas as pd

# Seaborn Object Interface
import seaborn.objects as so

# The palmer penguins dataset;
# that we are going to use for practice
from palmerpenguins import load_penguins
penguins = load_penguins()

One variable summaries

Histogram

(
  so.Plot(
      data = penguins,
      x = 'bill_depth_mm'
  )
  .add(
    so.Bar(),
    so.Hist()
  )
  .layout(size=(5, 6))
)

Histogram

(
  so.Plot(
      data = penguins,
      x = 'bill_depth_mm'
  )
  .add(
    so.Bars(),
    # The number of bins is arbitrary.
    # Different number of bins can 
    # highlight different patterns
    so.Hist(bins = 30)
  )
  .layout(size=(5, 6))
)

Histogram

(
  so.Plot(
      data = penguins,
      x = 'bill_depth_mm',
      # You can stratify
      # by sex
      color = 'sex'
  )
  .add(
    so.Bars(),
    so.Hist(bins = 30),
    so.Stack()
  )
  .layout(size=(5, 6))
)

Histogram

(
  so.Plot(
      data = penguins,
      x = 'bill_depth_mm',
      # You can stratify
      # by sex
      color = 'sex'
  )
  .add(
    so.Bars(),
    so.Hist(bins = 30),
    so.Stack()
  )
  # and by species
  .facet(row = "species")
  .layout(size=(5, 6))
)

Exercise

Describe an histogram in terms of the grammar of graphics.

Which step do you have to define explicitly? Which step are defined implicitly by Seaborn?

Stripchart

(
  so.Plot(
      data = penguins,
      x = 'bill_depth_mm',
      y = 'species'
  )
  .add(
    so.Dots()
  )
  .layout(size=(5, 6))
)

Stripchart

(
  so.Plot(
      data = penguins,
      x = 'bill_depth_mm',
      y = 'species'
  )
  .add(
    so.Dots(),
    # jitter the points
    # to reduce overplotting
    so.Jitter()
  )
  .layout(size=(5, 6))
)

Stripchart

(
  so.Plot(
      data = penguins,
      x = 'bill_depth_mm',
      y = 'species'
  )
  .add(
    so.Dots(),
    so.Jitter()
  )
  # Show the median
  .add(
    so.Dash(),
    so.Agg('median')
  )
  .layout(size=(5, 6))
)

Stripchart

(
  so.Plot(
      data = penguins,
      x = 'bill_depth_mm',
      y = 'species',
      # stratify by sex
      color = 'sex'
  )
  .add(
    so.Dots(),
    so.Jitter(),
    so.Dodge()
  )
  .add(
    so.Dash(),
    so.Agg('median'),
    so.Dodge()
  )
  .layout(size=(5, 6))
)

Two variables

Let’s use the diamonds dataset.

from seaborn import load_dataset
diamonds = load_dataset('diamonds')

Scatterplot

(
  so.Plot(
    data = diamonds,
    x = 'carat',
    y = 'price'
  )
  .add(
    so.Dot(alpha = .05)
  )
  .layout(size=(5, 6))
)

Scatterplot

(
  so.Plot(
    # remove points with
    # unlikely values
    data = diamonds.query(
      'carat < 3', 
      ),
    x = 'carat',
    y = 'price'
  )
  .add(
    so.Dot(alpha = .05)
  )
  .layout(size=(5, 6))
)

Scatterplot

(
  so.Plot(
     data = diamonds.query(
      'carat < 3', 
      ),
    x = 'carat',
    y = 'price'
  )
  .add(
    so.Dot(alpha=.05)
  )
  # add an approximative
  # data summary
  .add(
    so.Line(
      color = '#000000'
      ),
    so.PolyFit(
      order = 4
      )
  )
  .layout(size=(5, 6))
)

Scatterplot

(
  so.Plot(
   data = diamonds.query(
      'carat < 3', 
      ),
    x = 'carat',
    y = 'price'
  )
  .add(
    so.Dot(
      alpha=0.1
    ),
    # map and additional variable
    color = 'clarity'
  )
  .add(
    so.Line(
      color = '#000000'
      ),
    so.PolyFit(
      order = 4
      )
  )
  .scale(color="flare")
  .layout(size=(5, 6))
)

Scatterplot

(
  so.Plot(
   data = diamonds.query(
      'carat < 3', 
      ),
    x = 'carat',
    y = 'price'
  )
  .add(
    so.Dot(
      alpha=0.1
    ),
    color = 'clarity'
  )
  .add(
    so.Line(
      color = '#000000'
      ),
    so.PolyFit(
      order = 4
      )
  )
  # use facets to stratify
  # by cut
  .facet(
    row="cut",
    wrap = 3
  )
  .scale(color="flare")
  .layout(size=(5, 6))
)

Exercise

Describe the faceted scatterplot, that you can find on the previous page in terms of the grammar of graphics.

Which step do you have to define explicitly? Which step are defined implicitly by Seaborn?

Residual Plot

We can transform the data with statistical models to highlight the patterns that are hidden in them.

We can also use visual exploration of the output of statistical model, to see if the model fit the data properly.

# import a framework 
# for statistical inference
import statsmodels.formula.api as smf

Residual Plot

(
  so.Plot(
    data=penguins,
    x='bill_length_mm',
    y='bill_depth_mm'
  )
  .add(
    so.Dots(),
    color = 'sex'
  )
  .add(
    so.Line(),
    so.PolyFit(order = 1)
  )
  .facet(
    row = 'species',
    wrap = 2
    )
  .layout(size=(5, 6))
)

Residual Plot

# We can define a model
# before plotting
fit = smf.ols(
    'bill_depth_mm ~ bill_length_mm + species + bill_length_mm*species',
      data = penguins
).fit()

penguins_fit = (
  penguins
    .assign(
      fitted_bill_depth = fit.fittedvalues,
      residuals = lambda x: x.bill_depth_mm - x.fitted_bill_depth
    )
  )

Residual Plot

(
  so.Plot(
    data=penguins_fit,
    x='bill_length_mm',
    y='residuals'
  )
  .add(
    so.Dots(),
    color = 'sex'
  )  
  .add(
    so.Line(),
    so.PolyFit(order = 1)
  )
  .facet(
    row = 'species',
    wrap = 2
    )
  .layout(size=(5, 6))
)

Seaborn Standard Interface

Some visual model are not yet implemented in the object interface.

To make heatmaps or pair plots we have to use Seaborn’s standard API.

import seaborn as sns

Heatmap

Heatmap are useful for exploring big datasets, where many observation are similar to one another. To avoid overplotting, on those datasets, you can turn scatterplots into heatmaps.

In a heatmap we map a quantitative value to a color. Heatmaps can be used both with categorical x and y axes, or binned continuous axes.

Heatmap

# set the right size for the figure
sns.set(rc={
  'figure.figsize':(5,6)
  })

sns.histplot(
  data=diamonds,
  x='carat',
  y='price',
  cbar=True
);

Heatmap

sns.set(rc={
  'figure.figsize':(5,6)
  })

sns.histplot(
  # remove outliers
  data=(diamonds
    .query('carat <= 3')),
  x='carat',
  y='price',
  cbar=True, 
);

Pairs

If you want to get a quick overview of

Pairs

sns.set(
  font_scale=2
  )

sns.pairplot(
  data = diamonds
)

# Figure on the next slide

Pairs

sns.set(
  font_scale=2
  )

sns.pairplot(
  data = diamonds,
  # improve pairs look
  kind='hist',
  plot_kws=dict(bins = 10),
  diag_kws=dict(bins = 10),
  corner= True
)

# Figure on the next slide

Exercise - Detective Hat

A very bad criminal organization, have hidden a message for one of his hitmen in this file.

You have intercepted the file, but you must decode the message. You don’t have much time to stop a catastrophe. Work fast!

More details in the next page →

Exercise - Detective Hat

The aim of this exercise is to let you practice making many exploratory graphs, quickly.

Visualize the content of this dataset in different ways, until you find the secret message. Be fast, you have a lot of data to explore.

Be essential. You, right here, right now, are the only person that needs to understand these graphs. Do not waste your time making the graphs nicer, change only what you need to change to understand them better.

Show the data. The message is often well hidden, if you summarize the data too much, they might get lost.