>Better Graphs - I

> How to use graphical variables effectively.

\ Otho Mantegazza _ Dataviz For Scientists _ Part 3.1

AUTHOR: Jacques Bertin

YEAR: 1967

BOOK: The Semiology of Graphics

Examples of planar and retinal variables.

Graphical Variables

When you draw a data visualization on a two dimensional screen, you map values from data variables to graphical variables:

Planar Variables: x, y.
Retinal Variables: colour hue, color value, shape, orientation, size, area, texture.

In terms of the grammar of graphics, the graphical variables are the aesthetics that you map your data to.

Planar Variables

The x and y planar variables are largely perceived as a quantitative linear space. And they are great for representing quantitative and qualitative data.

The planar variables are the x and y coordinates on your planar screen, which readily translate to the x and y positions in your graph.

(f you use cartesian coordinates or to a transformed version of them if you use more elaborate coordinate systems)

Planar Variables

You can place qualitative variables both in the x and y variable. In this case the x might stop being the independent variable, and the y stops being the response.

Often there is no clear hypothetical relationship of cause effect between two variables, in that case you can invert the x and the y freely.

Retinal Variables

The retinal variables are all those other graphical variables, that cannot be interpreted directly as a position on the screen’s x and y.

The most important retinal variables are colour hue, color value, shape, orientation, size, area, texture.

Each one has its own peculiarities and its own rules about how it can be used best.

Colour

Colours can be mapped both to categorical and continuous variables.

With some caveats, colours are a multidimensional space:

Not perceived in a fully linear way.
Perceived in different ways by different people.

Color Spaces

If you find it hard to plan colours, don’t worry, colours are complex for everyone.

On a screen, colours are defined as three hexadecimal strings, that combine 256 levels of red, green and blue.

Color Spaces

Colors are perceived non linearly, and the model of how colors are perceived by people gets constantly updated. The most used model is CIECAM02.

What should interest you is:

HUE: what we call colour.
LIGHTNESS: how close to black?
SATURATION: how close to gray?

Categorical Variables

You can use colours to encode for categorical variables.

If the categorical variable is not ordered you should modulate the colors hue, with also small changes to saturation and lightness.

Always check if your colour palette is accessible by colour blind people.

Continous Variables

Using colours to encode continuous variables is somehow easier.

Check that your colour palette is colour-blind friendly.
Check that lightness and saturation change consistently with the data.
You can modulate the hue resembling colours of natural phenomena; such as clouds, sunsets, rivers…

Colour Perception

If you check that your palette are friendly to colour blind people, you can also detect unwanted patterns perception patterns.

You can use Firefox accessibility tools to simulate colour blind perception.

Good Colour Palettes

Categorical

In categorical palettes, you should be able to distinguish colours, even in small plotting characters.

Check if colours are different, even when plotted in black and white. Otherwise consider using and additional graphical variable to encode the information.

Quantitative

Continuous colour palette should be perceived linearly and univocally throughout the spectra. Check that this is true also for color blindness and black and white.

You should handle ordered categorical variables as if they were quantitative, not categorical.

Good Colour Palettes

Remember, data visualization is processed intuitively by the readers.

Colours have meaning. Don’t represent ice coverage in red, don’t represent the warming of the ocean in blue, ask yourself if your colour palette is appropriate for the topic your data are about.

Colours in Python

A presentation of the cubehelix color palette.

There are plenty of colour palette available in Python through Seaborn or other tools, so it’s unlikely that you’ll have to design your own.

It’s more likely that you’ll have to be able to choose a good one.

The Seaborn’s tutorial on how to choose a colour palette, is a great place to start.

Look also at Matplotlib’s palette guide, and at the blog post presenting the viridis and batlow perceptually uniform color maps.

WARMING STRIPES

Year: 2018

Author: Ed Hawkins

Source: Climate Lab Book

Researcher Ed Hawkins was searching for an intuitive way to represent global warming.

He designed the concept of climate stripes, removing everything that’s not a direct mapping of data from an heatmap of average temperature.

On the x axis, each stripe is a year, from the year when the first data recording is available reliably.

The colour represents the relative change in temperature.

The y axis is not mapped to the data.

Exercise

Check the climate stripes with a tool to simulate colour blind vision. Are the climate stripes colour blind friendly? Motivate your anwser.

Colours in Seaborn

Seaborns has many function to explore, set and generate color palettes:

set_palette() to set a color palette.
color_palette() to explore the existing color palettes, or to build new ones.

Colors in Seaborn

# All functions to manipulate data
import pandas as pd

# Seaborn Object Interface
import seaborn.objects as so
# and other Seaborn functions
import seaborn as sns

# The palmer penguins dataset;
# that we are going to use for practice
from palmerpenguins import load_penguins
penguins = load_penguins()

# The diamonds dataset
from seaborn import load_dataset
diamonds = load_dataset('diamonds')

Colours - Continuous

# set the right size for the figure
sns.set(
  rc={
    'figure.figsize':(4.5,6)
  })

sns.histplot(
  data=(diamonds
    .query('carat <= 3')),
  x='carat',
  y='price',
  cbar=True
);

Colours - Continuous

# set the right size for the figure
sns.set(
  rc={
    'figure.figsize':(4.5,6)
  })

sns.histplot(
  data=(diamonds
    .query('carat <= 3')),
  x='carat',
  y='price',
  cbar=True,
  # change color palette
  cmap='dark:#0A9_r'
);

Colours - Categorical

(
  so.Plot(
    data = penguins,
    x = 'bill_length_mm',
    y = 'bill_depth_mm',
    color = 'species'
 )
 .add(
  so.Dot()
 )
 .layout(size=(5, 6))
)

Colours - Categorical

(
  so.Plot(
    data = penguins,
    x = 'bill_length_mm',
    y = 'bill_depth_mm',
    color = 'species'
 )
 .add(
  so.Dot()
 )
 # Set the color scale manually
 .scale(color = [
  '#007a49',
  '#fdcb55',
  '#3f98d4'
 ])
 .layout(size=(5, 6))
)

Shape

You can use marker shapes to encode categorical information.

Shapes are simple and easy to understand.

They can’t be used to represent quantitative data. They could be used for ordered categorical data, but I’d advise against this practice.

Shape

(
  so.Plot(
    data = penguins,
    x = 'bill_length_mm',
    y = 'bill_depth_mm',
    marker = 'species'
 )
 .add(
  so.Dot()
 )
 .layout(size=(5, 6))
)

Shape

# set markers manually
(
  so.Plot(
    data = penguins,
    x = 'bill_length_mm',
    y = 'bill_depth_mm',
    marker = 'species'
 )
 .add(
  so.Dot()
 )
 .scale(
  marker = [
    'o',
    '^',
    's'
  ]
 )
 .layout(size=(5, 6))
)

Shape

# use empty shapes  
(
  so.Plot(
    data = penguins,
    x = 'bill_length_mm',
    y = 'bill_depth_mm',
    marker = 'species'
 )
 .add(
  so.Dot(
    fill = False,
    stroke = 1.5
    )
 )
 .scale(
  marker = [
    'o',
    '^',
    's'
  ]
 )
 .layout(size=(5, 6))
)

Shape

# double encode species
# with shape and colour
(
  so.Plot(
    data = penguins,
    x = 'bill_length_mm',
    y = 'bill_depth_mm',
    marker = 'species',
    color = 'species'
 )
 .add(
  so.Dot(
    stroke = 1.5,
    fill = False
    ),
 )
 .scale(
  marker = [
    'o',
    '^',
    's'
  ],
  color = [
  '#007a49',
  '#fdcb55',
  '#3f98d4'
 ]
 )
 .layout(size=(5, 6))
)

Size and Area

Linear size and area are often mapped to quantitative values, such as absolute measurements and percentages.

Size is often used together with colour to display a quantitative value stratified by a qualitative factor.

Size - Bar Chart

# color and clarity of diamonds
# are ordered categorical 
# variables
(
  so.Plot(
    data = diamonds,
    y = 'color',
    color = 'clarity'
  )
  .add(
    so.Bars(),
    so.Count(),
    so.Stack()
  )
  .layout(size=(5, 6))
)

Size - Bar Chart

# map the ordered variable
# "clarity" to an ordered
# color scale
(
  so.Plot(
    data = diamonds,
    y = 'color',
    color = 'clarity'
  )
  .add(
    so.Bars(),
    so.Count(),
    so.Stack()
  )
  .scale(color = 'flare')
  .layout(size=(5, 6))
)

Size - Bar Chart

# Map additional variables
# to facets
(
  so.Plot(
    data = diamonds,
    y = 'color',
    color = 'clarity'
  )
  .add(
    so.Bars(),
    so.Count(),
    so.Stack()
  )
  .scale(
    color = 'flare'
    )
  .facet(
    row = 'cut',
    wrap = 3
  )
  .layout(size=(5, 6))
)

Size - Bar Chart

When we use a barolot, we map the data to the length of the bars, not to the area.

The area of the bars is directly proportional to the data values.

Though, conceptually, if we would be mapping the data to the bars’ area instead of their length, we could no represent negative values.

Area - Pie Chart

To make a pie chart in Python we have to use, instead of Seaborn, the old interface of Pyplot.

import matplotlib.pyplot as plt

Area - Pie Chart

# count diamonds by clarity
counts = (diamonds
    .query('color == "J"')
    .groupby('clarity')
    .size())

# transform to percent
perc = (counts.values/
        counts.values.sum())
labels = counts.index

# define a color palette of the 
# same length of the data
colors = sns.color_palette(
  'viridis', 
  n_colors = perc.size
  )

# plot the py chart
plt.pie(
  perc,
  labels = labels, 
  colors = colors, 
  autopct='%.0f%%'
  );

Area - Pie Chart

Pie charts get a bad reputation for not being a nuanced analytical graphs.

But pie charts are effective at representing percentages, and they outscore barcharts when the number of slices is high.

Can you describe the pie chart from the previous page in terms of the grammar of graphics?

Area of Circles and Shapes

You can encode a continuous variable in the area of the circles in a scatterplot, or in the area the plotting character of your choice.

Our perception is not as good at comparing areas, so use this retinal variable with parsimony.

You can map data to the radius of circles or to the area directly. Mapping data to the radius might be perceptually better, although neither choice is optimal.

Radius of Circles - Data

planets = sns.load_dataset('planets')

Radius of Circles

# by default seaborn maps
# the magnitude of the
# data to the diameter of
# the circles
(
  so.Plot(
    data = planets,
    x = 'distance',
    y = 'orbital_period',
    pointsize = 'mass'
  )
  .add(
    so.Dots()
  )
  .scale(
    x = 'log',
    y = 'log'
  )
  .layout(size=(5, 6))
)

Area As Uncertainty

A light coloured area is often used to encode for a measurement of uncertainty or dispersion of the data.

For example, the confidence interval of a regression model, or the prediction of how a natural phenomena will evolve in space and time.

How to represent uncertainty intuitively, is an active field of research.

Area As Uncertainty

sns.lmplot(
  data = penguins,
  x = 'body_mass_g',
  y = 'flipper_length_mm'
)

Area As Uncertainty

# a spurious correlation
# with higher uncertainty
sns.lmplot(
  data = penguins,
  x = 'bill_length_mm',
  y = 'bill_depth_mm'
)

Area As Uncertainty

# That improves if we
# stratify the data 
# by species
sns.lmplot(
  data = penguins,
  x = 'bill_length_mm',
  y = 'bill_depth_mm',
  hue = 'species'
)

Area As Uncertainty

HURRICANE PROBABILITY CONE

Year: Current

Website: NOAA - Hurricane Center

The cone shows the forecasted path of the hurricane, from its current position. The size of the cone grows with the uncertainty in the path’s forecast

There’s evidence that we intuitively associate the size of the cone to the forecasted size of the hurricane, not to the uncertainty of its path.

Area As Uncertainty

How to represent uncertainty is an active area of research.

You can check Dr. Lace Padilla’s Work, for an overview of the best practices and the latest findings.

Orientation

The orientation of plotting characters is used to show the vectorial orientation of dimensions such as wind or other types of movements on a map.

The orientation of plotting characters is often used combined with their length, to show intensity and direction.

Orientation

In Seaborn or Pyplot there is no way to control the orientation of a plotting character directly. So you’ll have to use an arrow, calculating their start and end points from data.

Orientation - Data

from pyreadr import read_r
from math import sin, cos, pi
wind = read_r(
  'data/wind.data.RData'
  )

wind = wind['wind.data']

wind = (
  wind
   .query('lon > -5')
  )

Orientation

ax = plt.figure(
  figsize=(10,12)
  )

plt.scatter(
  wind.lon,
  wind.lat,
  s = 1
  )

arrow_scale = .15

for index, row in wind.iterrows():
    plt.arrow(
      x = row['lon'],
      y = row['lat'],
      dx = cos(
        row['dir']*pi/180
      )*arrow_scale,
      dy = sin(
        row['dir']*pi/180
      )*arrow_scale,
      color = '#000',
      head_width=0.1,
      head_length=0.1
      )

Orientation and Size

ax = plt.figure(
  figsize=(10,12)
  )

plt.scatter(
  wind.lon,
  wind.lat,
  s = 1
  )

arrow_scale = .05

for index, row in wind.iterrows():
    plt.arrow(
      x = row['lon'],
      y = row['lat'],
      dx = row['speed']*cos(
        row['dir']*pi/180
      )*arrow_scale,
      dy = row['speed']*sin(
        row['dir']*pi/180
      )*arrow_scale,
      color = '#000',
      head_width=0.1,
      head_length=0.1
      )

Orientation and Size

Texture

The texture is often used to distinguish various types of lines.

It can also be used to fill shapes in a semi-quantitative way. This aspect fell in disuse, but it can be a good choice for printer-friendly visualization.

Unfortunately python does not have good support for texture filling areas; the example on the side is from texture.js.

Texture - Line Type

healthexp = sns.load_dataset('healthexp')

(
  so.Plot(
    data = healthexp,
    x="Year",
    y="Life_Expectancy",
    linestyle="Country"
  )
  .add(
    so.Line()
  )
)

Exercise #1

On 2021-02-16, TidyTuesday published a challenge to remember the work of W.E.B Du Bois.

On this Github Page you can find the data for 10 of the renown Du Bois’ charts

Redesign or just redraw one or more of Du Bois’ graphs. Feel free to modify them as much as you want, but explain your design choices and how they improve the original graph.

Exercise #2

On 2023-08-22, TidyTuesday published a challenge on UNHCR’s migration data.

You can find data and further information in this github repository.

Explore those data, find a message that you would like to communicate, design a graph to convey that message. Explain your stylistical choices, how do they help you convey your message.