Tracking data provenance

When wrangling complex data sets into a form suitable for analysis there may be many steps where transformations and filtering of the data is performed in a data pipeline. At each stage we want to reassure ourselves that if we have removed data we have done so for the right reasons, and for scientific publication we need to communicate what processing the data has undergone, and whether the data processing unevenly affects the groups we wish to compare. In frequently updated analyses this must be automated.

dtrackr provides a generic capability that can address this and other problems. By extending the normal data manipulation functions provided by dplyr, we allow operations on a dataframe to be recorded as metadata, as the dataframe passes through a data pipeline, in a “history graph”. We can export the history graph as a flowchart which helps with documentation, and makes accurate reporting in the scientific literature simpler.

Tracking the data through the pipeline also allows us a way to capture where and why data is being excluded, and see both a visual summary and a detailed line list of exclusions. This can help with identifying and rectifying data quality issues.

In the situation where data is analysed interactively, as pipeline code is being created, tracking the history of a dataframe, and visualising it, allows us to identify if data has been manipulated in the way we intend, and helps debug pipeline code, or uncover unsupported assumptions about the data.

There are other places where tracking data provenance could also help, by visualising the steps an individual data item takes through a pipeline we can compare different data, such as new versions of data sets, and quickly validate our data pipeline for use with new data.

# devtools::load_all()
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dtrackr)
#> 
#> Attaching package: 'dtrackr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     add_tally, bind_cols, bind_rows
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following object is masked from 'package:utils':
#> 
#>     history
#> The following object is masked from 'package:base':
#> 
#>     comment

Basic operation - comments and stratification

The main documentation for data pipelines is provided by the comment function. This (and all other functions) uses a glue package specification to define the comment text. This glue specification can use a range of variables to describe the data as it passes through the pipeline. Firstly it can use any global variables such as filename in this example. Secondly the .count variable is the number of rows in the current group. Thirdly the .strata variable is defined to be a description of the group(s) we are currently in, but in grouped data the grouping variables (in this case Species) can also be used. Finally the .total variable returns the whole size of the ungrouped data-set.

Comments can either be a single .headline or a list of .messages. Setting either of these to “” disables them for a given comment. As in the example, thanks to glue, any expression can be evaluated in the messages but be warned, debugging them is hard. If an error in the glue spec is present dtrackr will try and tell you what is the problem was and what variables should have been available for the glue spec. (N.B. A common mistake is to use .message rather than .messages when providing a glue spec.)

# devtools::load_all()
filename = "~/tmp/iris.csv"
# this is just a pretend example
# iris = read.csv(filename)


iris %>%
  track(
    .headline = "Iris data:",
    .messages = c(
      "loaded from \"{filename}\"",
      "starts with {.count} items")
  ) %>%
  group_by(Species) %>%
  comment(
    .headline = "{.strata}",
    .messages = c(
    "In {Species}",
    "there are {.count} items",
    "that is {sprintf('%1.0f',.count/.total*100)}% of the total"),
    .tag = "note1"
    ) %>%
  ungroup() %>%
  comment("Final data has {.total} items", .tag="note2") %>%  
  flowchart()
%0 6:s->7 3:s->6 4:s->6 5:s->6 2:s->3 2:s->4 2:s->5 1:s->2 7 Final data has 150 items6 150 items3 Species:setosaIn setosathere are 50 itemsthat is 33% of the total4 Species:versicolorIn versicolorthere are 50 itemsthat is 33% of the total5 Species:virginicaIn virginicathere are 50 itemsthat is 33% of the total2 stratify by Species1 Iris data:loaded from "~/tmp/iris.csv"starts with 150 items

Status - Further analysis in the workflow

In the middle of a pipeline you may wish to document something about the data that is more complex than the simple counts. comment has a more sophisticated counterpart status. status is essentially a dplyr summarisation step which is connected to a glue specification output, that is recorded in the data frame history. In plain English this means you can do an arbitrary summarisation and put the result into the flowchart, without interrupting the data flow, for example:


iris %>%
  track("starts with {.count} items") %>%
  group_by(Species) %>%
  status(
    petalMean = sprintf("%1.1f", mean(Petal.Width)),
    petalSd = sprintf("%1.1f", sd(Petal.Width)),
    .messages = c(
    "In {Species} the petals are",
    "on average {petalMean} \u00B1 {petalSd} cms wide")) %>%
  ungroup(.messages = "ends with {.total} items") %>%
  flowchart()
%0 3:s->6 4:s->6 5:s->6 2:s->3 2:s->4 2:s->5 1:s->2 6 ends with 150 items3 Species:setosaIn setosa the petals areon average 0.2 ± 0.1 cms wide4 Species:versicolorIn versicolor the petals areon average 1.3 ± 0.2 cms wide5 Species:virginicaIn virginica the petals areon average 2.0 ± 0.3 cms wide2 stratify by Species1 starts with 150 items

A frequent use case for a more detailed description is to have a subgroup count within a flowchart. This is different enough to have its own function count_subgroup(). This works best for factor subgroup columns but other data will be converted to a factor automatically. As this uses glue specifications a modified formatted output is also possible as in this example, where the subgroup percentages are calculated to 1 decimal place.

ggplot2::diamonds %>%
  track() %>%
  group_by(cut) %>%
  count_subgroup(
    color,
    .messages = "colour {.name}: {sprintf('%1.1f%%', {.count}/{.subtotal}*100)}"
  ) %>%
  ungroup() %>%
  flowchart()
%0 3:s->8 4:s->8 5:s->8 6:s->8 7:s->8 2:s->3 2:s->4 2:s->5 2:s->6 2:s->7 1:s->2 8 53940 items3 cut:Faircolour D: 10.1%colour E: 13.9%colour F: 19.4%colour G: 19.5%colour H: 18.8%colour I: 10.9%colour J: 7.4%4 cut:Goodcolour D: 13.5%colour E: 19.0%colour F: 18.5%colour G: 17.8%colour H: 14.3%colour I: 10.6%colour J: 6.3%5 cut:Idealcolour D: 13.2%colour E: 18.1%colour F: 17.8%colour G: 22.7%colour H: 14.5%colour I: 9.7%colour J: 4.2%6 cut:Premiumcolour D: 11.6%colour E: 16.9%colour F: 16.9%colour G: 21.2%colour H: 17.1%colour I: 10.4%colour J: 5.9%7 cut:Very Goodcolour D: 12.5%colour E: 19.9%colour F: 17.9%colour G: 19.0%colour H: 15.1%colour I: 10.0%colour J: 5.6%2 stratify by cut1 53940 items

Filtering, exclusions and inclusions

Documenting the data set is only useful if you can also manipulate it, and one part of this is including and excluding things we don’t want. The standard dplyr::filter approach works for this, and we can use the before and after .count.in and .count.out, and the difference between the two .excluded to document what the result was.

In this example we exclude items that are more than one standard deviation above the mean. The default message (.messages = "excluded {.excluded} items") has been left as is which simply returns how many things have been excluded. With no customization the goal is for the pipeline to look as much as possible like a dplyr pipeline and it should be possible to take a normal dplyr pipeline and insert track() at the front and flowchart() at the end to get a documented pipeline.


iris %>%
  track() %>%
  group_by(Species) %>%
  filter(
    Petal.Width < mean(Petal.Width)+sd(Petal.Width)
  ) %>%
  ungroup() %>%
  flowchart()
%0 2:s->6 2:e->3 2:e->4 2:e->5 1:s->2 6 130 items2 stratify by Species3 Species:setosaexcluded 9 items4 Species:versicolorexcluded 5 items5 Species:virginicaexcluded 6 items1 150 items

This is useful but the reason for exclusion is not as clear as we would like, and this does not scale particularly well to multiple criteria, which are typical of the filters needed to massage real life data. For this we have written exclude_all which takes multiple criteria and applies them in parallel, combining the result at the end, after documenting their individual effects. Rather than the logical expression expected by dplyr::filter we provide matching criteria as a formula relating the filtering criteria on the left hand side, to the glue specification on the right hand side (a trick inspired by dplyr::case_when()’s syntax). This is very much slower than filter but gives fine control over the output.

The logic of exclude_all() is reversed compared to dplyr::filter(). In filter() only items for which the filtering criteria is TRUE are INCLUDED. In this example there are no missing values, however the behaviour of the filter() when the criteria cannot be evaluated and NA values are generated is to exclude them.

In exclude_all() there can be multiple criteria and ALL items that match any of the criteria are EXCLUDED. If a particular criteria cannot be evaluated for a data item the behaviour of exclude_all() is controlled by the na.rm parameter. This defaults to FALSE which means that any values that cannot be evaluated are NOT excluded.

For each different criteria the number of items excluded is recorded in the history graph in a manner defined by the right hand side of the exclusion criteria formulae.


dataset1 = iris %>%
  track() %>%
  comment("starts with {.count} items") %>%
  exclude_all(
    Species=="versicolor" ~ "removing {.excluded} versicolor"
  ) %>%
  group_by(Species) %>%
  comment("{Species} has {.count} items") %>%
  exclude_all(
    Petal.Width > mean(Petal.Width)+sd(Petal.Width) ~ "{.excluded} with petals > 1 SD wider than the mean",
    Petal.Length > mean(Petal.Length)+sd(Petal.Length) ~ "{.excluded} with petals > 1 SD longer than the mean",
    Sepal.Width > mean(Sepal.Width)+sd(Sepal.Width) ~ "{.excluded} with sepals > 1 SD wider than the mean",
    Sepal.Length > mean(Sepal.Length)+sd(Sepal.Length) ~ "{.excluded} with sepals > 1 SD longer than the mean"
  ) %>%
  comment("{Species} now has {.count} items") %>%
  ungroup() %>%
  comment("ends with {.total} items")

dataset1 %>% flowchart()
%0 11:s->12 9:s->11 10:s->11 5:s->9 5:e->7 6:s->10 6:e->8 4:s->5 4:s->6 2:s->4 2:e->3 1:s->2 12 ends with 67 items11 67 items9 Species:setosasetosa now has 33 items10 Species:virginicavirginica now has 34 items5 Species:setosasetosa has 50 items6 Species:virginicavirginica has 50 items7 Species:setosa9 with petals > 1 SD wider than the mean6 with petals > 1 SD longer than the mean6 with sepals > 1 SD wider than the mean10 with sepals > 1 SD longer than the mean8 Species:virginica6 with petals > 1 SD wider than the mean6 with petals > 1 SD longer than the mean8 with sepals > 1 SD wider than the mean8 with sepals > 1 SD longer than the mean4 stratify by Species2 starts with 150 items3 removing 50 versicolor1 150 items

Exclusions produced like this are additive and the items may be excluded by more than one exclusion criteria, and the list of exclusion counts won’t necessarily add up to an exclusion total.

Sometimes inclusion criteria are more important. For this we use include_any which works in a similar manner but INCLUDING items which match ANY of the supplied criteria, essentially combining the criteria with a logical OR operation, and in this case resulting in very different result from our previous example.


dataset2 = iris %>%
  track() %>%
  comment("starts with {.count} items") %>%
  include_any(
    Species=="versicolor" ~ "{.included} versicolor",
    Species=="setosa" ~ "{.included} setosa"
  ) %>%
  #mutate(Species = forcats::fct_drop(Species)) %>%
  group_by(Species) %>%
  comment("{Species} has {.count} items") %>%
  include_any(
    Petal.Width < mean(Petal.Width)+sd(Petal.Width) ~ "{.included} with petals <= 1 SD wider than the mean",
    Petal.Length < mean(Petal.Length)+sd(Petal.Length) ~ "{.included} with petals <= 1 SD longer than the mean",
    Sepal.Width < mean(Sepal.Width)+sd(Sepal.Width) ~ "{.included} with sepals <= 1 SD wider than the mean",
    Sepal.Length < mean(Sepal.Length)+sd(Sepal.Length) ~ "{.included} with sepals <= 1 SD longer than the mean"
  ) %>%
  comment("{Species} now has {.count} items") %>%
  ungroup() %>%
  comment("ends with {.total} items")
  
dataset2 %>% flowchart()
%0 11:s->12 9:s->11 10:s->11 7:s->9 8:s->10 5:s->7 6:s->8 4:s->5 4:s->6 3:s->4 2:s->3 1:s->2 12 ends with 99 items11 99 items9 Species:setosasetosa now has 49 items10 Species:versicolorversicolor now has 50 items7 Species:setosainclusions:41 with petals <= 1 SD wider than the mean44 with petals <= 1 SD longer than the mean44 with sepals <= 1 SD wider than the mean40 with sepals <= 1 SD longer than the mean8 Species:versicolorinclusions:45 with petals <= 1 SD wider than the mean44 with petals <= 1 SD longer than the mean42 with sepals <= 1 SD wider than the mean41 with sepals <= 1 SD longer than the mean5 Species:setosasetosa has 50 items6 Species:versicolorversicolor has 50 items4 stratify by Species3 inclusions:50 versicolor50 setosa2 starts with 150 items1 150 items

Excluded data

When considering a data pipeline it is sometimes important to know what has been excluded and at what stage. This can help for debugging or for addressing data quality issues. dtrackr can collect all data excluded at the same time as the history graph, along with a record of when in the pipeline and why an item was excluded. This behaviour is enabled by the capture_exclusions() flag.


tmp = iris %>%
  track() %>% 
  capture_exclusions() %>%
  exclude_all(
    Petal.Length > 5.8 ~ "{.excluded} long ones",
    Petal.Length < 1.3 ~ "{.excluded} short ones",
    .stage = "petal length exclusion"
  ) %>%
  comment("leaving {.count}") %>%
  group_by(Species) %>%
  filter(
    Sepal.Length >= quantile(Sepal.Length, 0.05),
    .messages="removing {.count.in-.count.out} with sepals < q 0.05",
    .type = "comment",
    .stage = "sepal length exclusion"
  ) %>%
  comment("leaving {.count}") %>%
  exclude_all(
    Petal.Width < 0.2 ~ "{.excluded} narrow ones",
    Petal.Width > 2.1 ~ "{.excluded} wide ones"
  ) %>%
  comment("leaving {.count}")

tmp %>% flowchart()
%0 8:s->14 8:e->11 9:s->15 9:e->12 10:s->16 10:e->13 5:s->8 6:s->9 7:s->10 4:s->5 4:s->6 4:s->7 3:s->4 1:s->3 1:e->2 14 Species:setosaleaving 3915 Species:versicolorleaving 4716 Species:virginicaleaving 248 Species:setosaleaving 439 Species:versicolorleaving 4710 Species:virginicaleaving 3511 Species:setosa4 narrow ones0 wide ones12 Species:versicolor0 narrow ones0 wide ones13 Species:virginica0 narrow ones11 wide ones5 Species:setosaremoving 3 with sepals < q 0.056 Species:versicolorremoving 3 with sepals < q 0.057 Species:virginicaremoving 2 with sepals < q 0.054 stratify by Species3 leaving 1331 150 items2 13 long ones4 short ones

Give the previous data pipeline we can identify the items that were excluded, the stage of the pipeline at which they were excluded and details of the code that resulted in them being excluded.

tmp %>% excluded()
#> # A tibble: 40 × 9
#>    .stage     .strata .message Sepal.Length Sepal.Width Petal.Length Petal.Width
#>    <chr>      <chr>   <glue>   <chr>        <chr>       <chr>        <chr>      
#>  1 petal len… ""      13 long… 6.3          3.3         6            2.5        
#>  2 petal len… ""      13 long… 7.1          3           5.9          2.1        
#>  3 petal len… ""      13 long… 7.6          3           6.6          2.1        
#>  4 petal len… ""      13 long… 7.3          2.9         6.3          1.8        
#>  5 petal len… ""      13 long… 7.2          3.6         6.1          2.5        
#>  6 petal len… ""      13 long… 7.7          3.8         6.7          2.2        
#>  7 petal len… ""      13 long… 7.7          2.6         6.9          2.3        
#>  8 petal len… ""      13 long… 7.7          2.8         6.7          2          
#>  9 petal len… ""      13 long… 7.2          3.2         6            1.8        
#> 10 petal len… ""      13 long… 7.4          2.8         6.1          1.9        
#> # ℹ 30 more rows
#> # ℹ 2 more variables: Species <chr>, .filter <chr>

Advanced handling of the history graph.

The history graph is a stored as an attribute on a tracked dataframe. The contents of this attribute is a list of dataframes including an edge list and node list. These can be imported into other graph processing packages, and visualised in different ways. Alternatively they could be used for automated testing of data pipelines, for example.

tmp2 = tmp %>% p_get()

# the nodes, .id is a graph unique identifier
tmp2$nodes %>% glimpse()
#> Rows: 16
#> Columns: 6
#> $ .id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
#> $ .rank   <dbl> 1, 1, 2, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6
#> $ .strata <chr> "", "", "", "", "Species:setosa", "Species:versicolor", "Speci…
#> $ .label  <chr> "150 items<BR ALIGN='LEFT'/>", "13 long ones<BR ALIGN='LEFT'/>…
#> $ .type   <chr> "info", "exclusion", "info", "stratify", "comment", "comment",…
#> $ .stage  <chr> "", "petal length exclusion", "", "", "sepal length exclusion"…

# the edges, .to and .from are foreign keys for .id
tmp2$edges %>% glimpse()
#> Rows: 15
#> Columns: 5
#> $ .to          <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
#> $ .from        <dbl> 1, 1, 3, 4, 4, 4, 5, 6, 7, 8, 9, 10, 8, 9, 10
#> $ .rel         <chr> "exclusion", "info", "stratify", "comment", "comment", "c…
#> $ .strata      <chr> "", "", "", "Species:setosa", "Species:versicolor", "Spec…
#> $ .strata.prev <chr> "", "", "", "", "", "", "Species:setosa", "Species:versic…

The GraphViz language provides many options for formatting the flowchart. Rather than try and provide an interface for them, we have gone for sane defaults. If you want to change this or use a different layout engine the GraphViz output can be retrieved and edited directly. Alternatively if the output you need is different, rendered SVG output can be edited by hand.

cat(tmp %>% p_get_as_dot())
#> digraph {
#>      graph [layout = 'dot',
#>         splines='ortho',
#>         rankdir = 'TB',
#>         outputorder = 'edgesfirst',
#>         bgcolor = 'white',
#>         ranksep = '0.25',
#>         nodesep = '0.2',
#>         newrank='true']
#> 
#>     node [fontname = 'Helvetica',
#>         fontsize = '8',
#>         shape='box',
#>         fixedsize = 'false',
#>         margin = '0.1,0.1',
#>         width = '0',
#>         height = '0',
#>         style = 'filled',
#>         color = 'black',
#>         fontcolor = 'black',
#>         labeljust='l']
#> 
#>     edge [fontname = 'Helvetica',
#>         fontsize = '8',
#>         len = '0.5',
#>         color = 'black',
#>         arrowsize = '0.5']
#>     
#> { rank='same';
#> '14' [label=<<B>Species:setosa</B><BR ALIGN='LEFT'/>leaving 39<BR ALIGN='LEFT'/>>,group='Species:setosa',fillcolor='white'];
#> '15' [label=<<B>Species:versicolor</B><BR ALIGN='LEFT'/>leaving 47<BR ALIGN='LEFT'/>>,group='Species:versicolor',fillcolor='white'];
#> '16' [label=<<B>Species:virginica</B><BR ALIGN='LEFT'/>leaving 24<BR ALIGN='LEFT'/>>,group='Species:virginica',fillcolor='white'];
#> }
#> { rank='same';
#> '8' [label=<<B>Species:setosa</B><BR ALIGN='LEFT'/>leaving 43<BR ALIGN='LEFT'/>>,group='Species:setosa',fillcolor='white'];
#> '9' [label=<<B>Species:versicolor</B><BR ALIGN='LEFT'/>leaving 47<BR ALIGN='LEFT'/>>,group='Species:versicolor',fillcolor='white'];
#> '10' [label=<<B>Species:virginica</B><BR ALIGN='LEFT'/>leaving 35<BR ALIGN='LEFT'/>>,group='Species:virginica',fillcolor='white'];
#> '11' [label=<<B>Species:setosa</B><BR ALIGN='LEFT'/>4 narrow ones<BR ALIGN='LEFT'/>0 wide ones<BR ALIGN='LEFT'/>>,group='Species:setosa',fillcolor='grey80'];
#> '12' [label=<<B>Species:versicolor</B><BR ALIGN='LEFT'/>0 narrow ones<BR ALIGN='LEFT'/>0 wide ones<BR ALIGN='LEFT'/>>,group='Species:versicolor',fillcolor='grey80'];
#> '13' [label=<<B>Species:virginica</B><BR ALIGN='LEFT'/>0 narrow ones<BR ALIGN='LEFT'/>11 wide ones<BR ALIGN='LEFT'/>>,group='Species:virginica',fillcolor='grey80'];
#> }
#> { rank='same';
#> '5' [label=<<B>Species:setosa</B><BR ALIGN='LEFT'/>removing 3 with sepals &lt; q 0.05<BR ALIGN='LEFT'/>>,group='Species:setosa',fillcolor='white'];
#> '6' [label=<<B>Species:versicolor</B><BR ALIGN='LEFT'/>removing 3 with sepals &lt; q 0.05<BR ALIGN='LEFT'/>>,group='Species:versicolor',fillcolor='white'];
#> '7' [label=<<B>Species:virginica</B><BR ALIGN='LEFT'/>removing 2 with sepals &lt; q 0.05<BR ALIGN='LEFT'/>>,group='Species:virginica',fillcolor='white'];
#> }
#> { rank='same';
#> '4' [label=<stratify by Species<BR ALIGN='LEFT'/>>,group='',fillcolor='white'];
#> }
#> { rank='same';
#> '3' [label=<leaving 133<BR ALIGN='LEFT'/>>,group='',fillcolor='white'];
#> }
#> { rank='same';
#> '1' [label=<150 items<BR ALIGN='LEFT'/>>,group='',fillcolor='white'];
#> '2' [label=<13 long ones<BR ALIGN='LEFT'/>4 short ones<BR ALIGN='LEFT'/>>,group='',fillcolor='grey80'];
#> }
#> 
#> 
#> '10' -> '16' [tailport='s',weight='100']
#> '9' -> '15' [tailport='s',weight='100']
#> '8' -> '14' [tailport='s',weight='100']
#> '10' -> '13' [tailport='e',weight='1']
#> '9' -> '12' [tailport='e',weight='1']
#> '8' -> '11' [tailport='e',weight='1']
#> '7' -> '10' [tailport='s',weight='100']
#> '6' -> '9' [tailport='s',weight='100']
#> '5' -> '8' [tailport='s',weight='100']
#> '4' -> '7' [tailport='s',weight='100']
#> '4' -> '6' [tailport='s',weight='100']
#> '4' -> '5' [tailport='s',weight='100']
#> '3' -> '4' [tailport='s',weight='100']
#> '1' -> '3' [tailport='s',weight='100']
#> '1' -> '2' [tailport='e',weight='1']
#> }

Combined data flows

Data sets that undergo different processing may be joined into a single dataset. This is supported in dtrackr and demonstrated in the “joining-pipelines” vignette.