> It is important to understand what you can do before you learn to measure how well you seem to have done it. (John Tukey)
\ Otho Mantegazza _ Dataviz for Scientists _ Part 2.3
Exploratory data analysis, as stated by Tukey, is the investigative work on data.
When you explore data, you leave no stone unturned. Relying on graphical methods, robust summary statistic and dimension reduction, you quickly gain insights in all possible patterns, correlations and cause-effect relationships that are in the data. In technical terms, you generate hypothesis.
After you have done your investigative work, you should switch from inspector to judge and test your hypothesis with inference and tests.
But there’s no point in testing the statistical relevance of poorly formulated hypotheses. Exploratory Data Analysis is fundamental in modern statistics, because it allows to formulate the best hypothesis possible.
When you explore data, you have to turn them into insightful formats. Most of the time this involves turning data into graphical and visual shapes.
If you want to represent data intuitively, first you have to learn terms that allow you to describe the structure of a dataset semantically.
The Tidy Data Theory lets us do just that.
Visualize
Summarize
Stratify
Transform
Describe an histogram in terms of the grammar of graphics.
Which step do you have to define explicitly? Which step are defined implicitly by Seaborn?
Let’s use the diamonds dataset.
Describe the faceted scatterplot, that you can find on the previous page in terms of the grammar of graphics.
Which step do you have to define explicitly? Which step are defined implicitly by Seaborn?
We can transform the data with statistical models to highlight the patterns that are hidden in them.
We can also use visual exploration of the output of statistical model, to see if the model fit the data properly.
Heatmap are useful for exploring big datasets, where many observation are similar to one another. To avoid overplotting, on those datasets, you can turn scatterplots into heatmaps.
In a heatmap we map a quantitative value to a color. Heatmaps can be used both with categorical x and y axes, or binned continuous axes.
If you want to get a quick overview of
A very bad criminal organization, have hidden a message for one of his hitmen in this file.
You have intercepted the file, but you must decode the message. You don’t have much time to stop a catastrophe. Work fast!
More details in the next page →
The aim of this exercise is to let you practice making many exploratory graphs, quickly.
Visualize the content of this dataset in different ways, until you find the secret message. Be fast, you have a lot of data to explore.
Be essential. You, right here, right now, are the only person that needs to understand these graphs. Do not waste your time making the graphs nicer, change only what you need to change to understand them better.
Show the data. The message is often well hidden, if you summarize the data too much, they might get lost.