> It is important to understand what you can do before you learn to measure how well you seem to have done it. (John Tukey)
\ Otho Mantegazza _ Dataviz for Scientists _ Part 2.3

BOXPLOT
Year: 1977
Author: John Tukey
Book: Exploratory Data Analysis
The Boxplot is one of the main visual models used to explore data. It shows Summary Quantile Statistics and outlier for a stratified set of data.
Exploratory data analysis, as stated by Tukey, is the investigative work on data.
When you explore data, you leave no stone unturned. Relying on graphical methods, robust summary statistic and dimension reduction, you quickly gain insights in all possible patterns, correlations and cause-effect relationships that are in the data. In technical terms, you generate hypothesis.
After you have done your investigative work, you should switch from inspector to judge and test your hypothesis with inference and tests.
But there’s no point in testing the statistical relevance of poorly formulated hypotheses. Exploratory Data Analysis is fundamental in modern statistics, because it allows to formulate the best hypothesis possible.
When you explore data, you have to turn them into insightful formats. Most of the time this involves turning data into graphical and visual shapes.

STEM AND LEAF
Year: 1977
Author: John Tukey
Book: Exploratory Data Analysis
A graphical intuitive representation of car prices.
A big part Tukey’s book “Exploratory Data Analysis” relies on graphical representation of data that you can draw yourself with pen and paper. Luckily today you can use powerful software designed for data exploration purpose, such as the Tidyverse.
If you want to represent data intuitively, first you have to learn terms that allow you to describe the structure of a dataset semantically.
The Tidy Data Theory lets us do just that.

A common framework to organize data semantically: if you organize data based on their structure, it’s easier for you to make sense of them, to realize what data you have and what’s missing.
If you organize data with a common framework, it’s also easier to share them with others.
Visualize
Summarize
Stratify
Transform
Describe an histogram in terms of the grammar of graphics.
Which step do you have to define explicitly? Which step are defined implicitly by Seaborn?
Let’s use the diamonds dataset.
Describe the faceted scatterplot, that you can find on the previous page in terms of the grammar of graphics.
Which step do you have to define explicitly? Which step are defined implicitly by Seaborn?
We can transform the data with statistical models to highlight the patterns that are hidden in them.
We can also use visual exploration of the output of statistical model, to see if the model fit the data properly.
Heatmap are useful for exploring big datasets, where many observation are similar to one another. To avoid overplotting, on those datasets, you can turn scatterplots into heatmaps.
In a heatmap we map a quantitative value to a color. Heatmaps can be used both with categorical x and y axes, or binned continuous axes.
If you want to get a quick overview of


A very bad criminal organization, have hidden a message for one of his hitmen in this file.
You have intercepted the file, but you must decode the message. You don’t have much time to stop a catastrophe. Work fast!
More details in the next page →
The aim of this exercise is to let you practice making many exploratory graphs, quickly.
Visualize the content of this dataset in different ways, until you find the secret message. Be fast, you have a lot of data to explore.
Be essential. You, right here, right now, are the only person that needs to understand these graphs. Do not waste your time making the graphs nicer, change only what you need to change to understand them better.
Show the data. The message is often well hidden, if you summarize the data too much, they might get lost.