Stratifying your analysis

Grouping a data set acts in the normal way. When tracking a dataframe sometimes a group_by() operation will create a lot of groups. This happens for example if you are doing a group_by(), summarise() step that is aggregating data on a fine scale, e.g. by day in a time-series. This is generally a terrible idea when tracking a dataframe as the resulting flowchart will have many many branches and be illegible. dtrackr will detect this issue and pause tracking the dataframe with a warning. It is up to the user to the resume() tracking when the large number of groups have been resolved e.g. using a dplyr::ungroup(). This limit is configurable with options("dtrackr.max_supported_groupings"=XX). The default is 16. See dplyr::group_by().

Usage

# S3 method for class 'trackr_df'
group_by(
  .data,
  ...,
  .messages = "stratify by {.cols}",
  .headline = NULL,
  .tag = NULL,
  .maxgroups = .defaultMaxSupportedGroupings()
)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details.

...

In group_by(), variables or computations to group by. Computations are always done on the ungrouped data frame. To perform computations on the grouped data, you need to use a separate mutate() step before the group_by(). Computations are not allowed in nest_by(). In ungroup(), variables to remove from the grouping. Named arguments passed on to dplyr::group_by

.add

When FALSE, the default, group_by() will override existing groups. To add to the existing groups, use .add = TRUE.

This argument was previously called add, but that prevented creating a new grouping variable called add, and conflicts with our naming conventions.

.drop

Drop groups formed by factor levels that don't appear in the data? The default is TRUE except when .data has been previously grouped with .drop = FALSE. See group_by_drop_default() for details.

x

A tbl()

.messages

a set of glue specs. The glue code can use any global variable, or {.cols} which is the columns that are being grouped by.

.headline

a headline glue spec. The glue code can use any global variable, or {.cols}.

.tag

if you want the summary data from this step in the future then give it a name with .tag.

.maxgroups

the maximum number of subgroups allowed before the tracking is paused.

Value

the .data but grouped.

Examples

library(dplyr)
library(dtrackr)

tmp = iris %>% track() %>% group_by(Species, .messages="stratify by {.cols}")
tmp %>% comment("{.strata}") %>% history()
#> dtrackr history:
#> number of flowchart steps: 3 (approx)
#> tags defined: <none>
#> items excluded so far: <not capturing exclusions>
#> last entry / entries:
#> ├ [Species:setosa]: "Species:setosa", "Species:setosa"
#> ├ [Species:versicolor]: "Species:versicolor", "Species:versicolor"
#> └ [Species:virginica]: "Species:virginica", "Species:virginica"

Usage

Arguments

Value

See also

Examples