Tidying data

4 minutes read

Tidy data

Tidying data is the practice of turning untidy data into tidy data, and is a necessary step in the preparation of data visualisation based on the Grammar of Graphics. Below are some of the more common operations for tidying data.

Separating data

This table contains 2 variables in the rate column:

country	year	rate
Belgium	2020	14.444/11.522.440
Belgium	2021	14.791/11.554.767
Bulgaria	2020	23.128/6.916.548
Bulgaria	2021	22.305/6.951.482
Czechia	2020	8.610/10.494.836
Czechia	2021	8.990/10.693.939

To make the table tidy, the values in the rate column need to be split into 2 new columns. This operation is often called separating or spreading a column into new columns. In order to separate or spread a column, you need to provide the character (or characters) that are used as the separator between the values in the column. In this case, the separator is the forward slash “/”.

After separating the values in a column, you need to give the newly created columns new column names (which are the variable names of the created tidy data set).

Making data wider

The table below is untidy, because the count column contains the values for 2 variables (cases and population).

country	year	type	count
Belgium	2020	cases	14.444
Belgium	2020	population	11.522.440
Belgium	2021	cases	14.791
Belgium	2021	population	11.554.767
Bulgaria	2020	cases	23.128
Bulgaria	2020	population	6.916.548
Bulgaria	2021	cases	22.305
Bulgaria	2021	population	6.951.482
Czechia	2020	cases	8.610
Czechia	2020	population	10.494.836
Czechia	2021	cases	8.990
Czechia	2021	population	10.693.939

To make the data tidy, the values in the count column need to be organised in two columns, based on the values in the type column. This operation is called making data wider, or pivot data wider.

Making data longer

Consider the following untidy data table:

country	2020	2021
Belgium	14.444	14.791
Bulgaria	23.128	22.305
Czechia	8.610	8.990

This table is untidy, because the 2020 and 2021 column names are values of the year variable, so this variable is spread over multiple columns. The tidy version of this table looks like this:

country	year	cases
Belgium	2020	14.444
Bulgaria	2020	23.128
Czechia	2020	8.610
Belgium	2021	14.791
Bulgaria	2021	22.305
Czechia	2021	8.990

The operation of taking the names of multiple columns and “collapsing” them into a single column (in this case this is the year column) and putting the values of the original columns into another column (this is the cases column here) is called making data longer, pivotting data longer or unpivotting data).

This operation usually results in tables that are longer (with more rows, and less columns) than the original table. You can read more about wide and long data in Wide versus long data.

Transposing data

Below is another example of an untidy data table:

country	Belgium	Belgium	Bulgaria	Bulgaria	Czechia	Czechia
year	2020	2021	2020	2021	2020	2021
cases	14.444	14.791	23.128	22.305	8.610	8.990
population	11.522.440	11.554.767	6.916.548	6.951.482	10.494.836	10.693.939

This table is untidy, because it has the variables as rows and the observations in the columns, instead of vice versa. The operation to make this data table tidy is called transposing a table. Transposing a table will make rows out of the columns, and columns out of the rows.

Intro to tidy data

Advantages of tidy data

Grammar of Graphics in practice: Tableau

Grammar of Graphics in practice: Vega-Lite

Grammar of Graphics in practice: ggplot2

Grammar of Graphics in practice: Observable Plot