🌱 Tidy Data Research
Videos
- https://www.youtube.com/watch?v=oQuupzfX9OQ
- https://www.youtube.com/watch?v=-gI5MN0jkOA
- https://www.youtube.com/watch?v=K-ss_ag2k9E
Why
Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset.
Not doing so invites errors and requires extra computation for aggregate functions (sum, avg, etc)
Tidy data
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Said the same but differently:
-
variables are columns (contains all values that measure the same attribute across units)
-
observations are rows (contains all values measured on the same unit across attributes)
-
values are individual cells where variables and observations meet they; belong to a variable and observation
there should be one observational unit per table
For a given dataset, it is usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general.
A general rule of thumb:
- it is easier to describe functional relationships between variables(columns) than between rows
- it is easier to make comparisons between groups of observations(rows) than between groups of columns.
Not tidy data
Can be useful for data entry as it can reduce duplicate entries
- don’t combine more than one variables in a single column
- be wary of type columns
- don’t split observations across rows
- observations are from a point in specific time don’t combine them
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.
Tidy data is all alike, messy data is mess in its own way.
Most of these issues can be solved by one of the following:
- melting: turning columns into rows (sometimes a hidden categorical variable exists in columns of un tidy data)
- string splitting
- casting.
Examples
Non tidy
| | treatment a | treatment b |
| ------------ | ----------- | ----------- |
| john smith | – | 2 |
| jane doe | 16 | 11 |
| mary johnson | 3 | 1 |
Non tidy transposed
| | john smith | jane doe | mary johnson |
| ----------- | ---------- | -------- | ------------ |
| treatment a | – | 16 | 3 |
| treatment b | 2 | 11 | 1 |
tidy
| person | treatment | result |
| ------------ | --------- | ------ |
| john smith | a | – |
| jane doe | a | 16 |
| mary johnson | a | 3 |
| john smith | b | 2 |
| jane doe | b | 11 |
| mary johnson | b | 1 |
Melting
Raw data
| row | a | b | c |
| --- | - | - | - |
| A | 1 | 4 | 7 |
| B | 2 | 5 | 8 |
| C | 3 | 6 | 9 |
Molten data
| row | columns | value |
| --- | ------- | ----- |
| A | a | 1 |
| B | a | 2 |
| C | a | 3 |
| A | b | 4 |
| B | b | 5 |
| C | b | 6 |
| A | c | 7 |
| B | c | 8 |
| C | c | 9 |
Casting is the opposite of melting
Manipulation
- Filter: subsetting or removing observations based on some condition.
- Transform: adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).
- Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means).
- Sort: changing the order of observations.