Skip to contents

Joins across data sets

Joining dtrackr tracked data is supported and allows us to combine linked data sets. In this toy example the data sets are characters from a popular film from my youth.

# here we create a set of linked data from the starwars data
# in a real example these data sets would have come from different places
people = starwars %>% select(-films, -vehicles, -starships)
vehicles = starwars %>% select(name,vehicles) %>% unnest(cols = c(vehicles))
starships = starwars %>% select(name,starships) %>% unnest(cols = c(starships))
films = starwars %>% select(name,films) %>% unnest(cols = c(films))
# these 4 data frames are linked together by the name attribute

# we track both input data sets:
tmp1 = people %>% track() %>% comment("People df {.total}")
tmp2 = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment")

# and here we (re)join the two data sets:
tmp1 %>% 
  inner_join(tmp2, by="name") %>% 
  comment("joined {.total}") %>% 
  flowchart()
%0 6:s->7 2:s->6 5:s->6 1:s->2 4:s->5 3:s->4 7 joined 1736 Inner join by name87 on LHS173 on RHS173 in linked set2 People df 875 a test comment1 87 items4 Films df 1733 173 items
# The join message used by inner_join here is configurable but defaults to 
# {.count.lhs} on LHS
# {.count.rhs} on RHS
# {.count.out} in linked set

All dplyr join types are supported by dtrackr which allows us to report on the numbers on either side of the join and on the resulting total. This can help detect if any data items are lost during the join. However we do not yet capture data that becomes excluded during joins, as the interpretation depends on the type of join employed.

Unions

Another type of binary operator is a union. This is a simpler problem and works as expected. In this example the early part of the pipeline is detected to be the same on both branches of the data flow. This therefore results in a flow that splits then subsequently joins again during the union (bind_rows) operator.

tmp = people %>% comment("start")

tmp1 = tmp %>% include_any(
  species == "Human" ~ "{.included} humans",
  species == "Droid" ~ "{.included} droids"
  )

tmp2 = tmp %>% include_any(
  species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans")

tmp3 = bind_rows(tmp1,tmp2) %>% comment("{.count} human,droids and gungans") 
tmp3 %>% flowchart()
%0 5:s->6 2:s->5 4:s->5 3:s->4 1:s->2 1:s->3 6 44 human,droids and gungans5 Union44 in union2 inclusions:35 humans6 droids4 3 gungans3 inclusions:3 gungans1 start

Other dplyr set operations are supported such as setdiff(), union(), union_all() and intersect() which are included in the function documentation.