🌱 Tidy Data Research

Videos

Why

Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset.

Not doing so invites errors and requires extra computation for aggregate functions (sum, avg, etc)

Tidy data

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Said the same but differently:

variables are columns (contains all values that measure the same attribute across units)
observations are rows (contains all values measured on the same unit across attributes)
values are individual cells where variables and observations meet they; belong to a variable and observation

there should be one observational unit per table

For a given dataset, it is usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general.

A general rule of thumb:

it is easier to describe functional relationships between variables(columns) than between rows
it is easier to make comparisons between groups of observations(rows) than between groups of columns.

Not tidy data

Can be useful for data entry as it can reduce duplicate entries

don’t combine more than one variables in a single column
be wary of type columns
don’t split observations across rows
observations are from a point in specific time don’t combine them
Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.

Tidy data is all alike, messy data is mess in its own way.

Most of these issues can be solved by one of the following:

melting: turning columns into rows (sometimes a hidden categorical variable exists in columns of un tidy data)
string splitting
casting.

Examples

Non tidy

|              | treatment a | treatment b |
| ------------ | ----------- | ----------- |
| john smith   | –           | 2           |
| jane doe     | 16          | 11          |
| mary johnson | 3           | 1           |

Non tidy transposed

|             | john smith | jane doe | mary johnson |
| ----------- | ---------- | -------- | ------------ |
| treatment a | –          | 16       | 3            |
| treatment b | 2          | 11       | 1            |

tidy

| person       | treatment | result |
| ------------ | --------- | ------ |
| john smith   | a         | –      |
| jane doe     | a         | 16     |
| mary johnson | a         | 3      |
| john smith   | b         | 2      |
| jane doe     | b         | 11     |
| mary johnson | b         | 1      |

Melting

Raw data

| row | a | b | c |
| --- | - | - | - |
| A   | 1 | 4 | 7 |
| B   | 2 | 5 | 8 |
| C   | 3 | 6 | 9 |

Molten data

| row | columns | value |
| --- | ------- | ----- |
| A   | a       | 1     |
| B   | a       | 2     |
| C   | a       | 3     |
| A   | b       | 4     |
| B   | b       | 5     |
| C   | b       | 6     |
| A   | c       | 7     |
| B   | c       | 8     |
| C   | c       | 9     |

Casting is the opposite of melting

Manipulation

Filter: subsetting or removing observations based on some condition.
Transform: adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).
Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means).
Sort: changing the order of observations.

data tidy data