Skip to contents

ggrrr supports a super simple passthrough caching layer. This assumes calculations are deterministic, and with a stated set of inputs will compute and cache the result.

Setting the caching location should be transparent. The default directory is the user cache directory, but in an analysis project a specific cache directory may be preferred

# Setting a cache in an analysis project might be in a sub-directory of the project
# options("cache.dir"=here::here("cache"))

# If in a package you might be using a package specific cache
# options("cache.dir"=rappdirs::user_cache_dir("my-package"))

# for our purposes a temporary directory is OK, but caching to a tempdir will only persist as long as the session is active.
options("cache.dir"= tempdir())

Caching an item

The item caches is the result of running an expression with in the current environment. We manually specify which items in the environment may affect the outcome. The cache key is a combination of the hash of the expression and the input data.

We do not check the rest of the environment so it is up to the user to make sure all variables which influence the expression output are states. Likewise if the code in the expression changes a new value will be cached but if the contents of a function used in the expression changes thsi might not be picked up.

# Ten iterations of the same code
# only the first iteration is actually calculated. The rest are loaded from disk.

for (i in 1:10) {
  
  quantCutOff = 0.95
  start = Sys.time()
  
  fit = cached({
    
    expensiveDiamonds = diamonds %>% mutate(expensive = price>quantile(price,quantCutOff))
    glm(expensive ~ carat+cut+color+clarity+depth+table, family = binomial(link='logit'), data=expensiveDiamonds )
    
  }, diamonds, quantCutOff )
  
  duration = Sys.time()-start
  cat("iteration:",i," duration: ",duration,"\n")
  
}
## caching item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## iteration: 1  duration:  1.159843
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 2  duration:  0.09126902
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 3  duration:  0.09219599
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 4  duration:  0.09015965
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 5  duration:  0.1410227
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 6  duration:  0.08353686
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 7  duration:  0.08875036
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 8  duration:  0.08673692
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 9  duration:  0.0854876
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 10  duration:  0.08891273

Re-caching when code changes

A change in the code (in this example a trivial change in a variable name) will result in the item being re-cached. White space changes do not trigger recalculation.

fit = cached({
  expensiveDiamonds2 = diamonds %>% mutate(expensive = price>quantile(price,quantCutOff))
  glm(expensive ~ carat+cut+color+clarity+depth+table, family = binomial(link='logit'), data=expensiveDiamonds2 )
}, diamonds, quantCutOff )
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda

Forcing execution of cached item

If you want to defeat the caching for a one off execution the .nocache option achieves that.

fit = cached({
  expensiveDiamonds = diamonds %>% mutate(expensive = price>quantile(price,quantCutOff))
  glm(expensive ~ carat+cut+color+clarity+depth+table, family = binomial(link='logit'), data=expensiveDiamonds )
}, diamonds, quantCutOff, .nocache = TRUE )
## caching item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Globally disabling cache

You might want to cache interim results while developing (for speed) and disable the cache completely during final analysis.

oldopt = options(cache.disable = TRUE)

fit = cached({
  expensiveDiamonds = diamonds %>% mutate(expensive = price>quantile(price,quantCutOff))
  glm(expensive ~ carat+cut+color+clarity+depth+table, family = binomial(link='logit'), data=expensiveDiamonds )
}, diamonds, quantCutOff)
## caching item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
options(oldopt)

Manually removing items from cache

In general items can be left in the cache, and they will get removed when it is determined that they are stale. A manual flush of the cache can be done like this. Usually the interactive option will be omitted if run at the command line

cache_clear(interactive = FALSE)

Caching multiple types of data

Controlling the naming of the cache items helps to keep track of the items in the cache if you are manually inspecting it. If the items follow a naming convention then deleting items from the cache can then be done on a type by type basis.

fit = cached({
  expensiveDiamonds = diamonds %>% mutate(expensive = price>quantile(price,quantCutOff))
  glm(expensive ~ carat+cut+color+clarity+depth+table, family = binomial(link='logit'), data=expensiveDiamonds )
}, diamonds, quantCutOff, .prefix = "diamonds")
## caching item: /tmp/RtmpZBelAG/diamonds-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
fit2 = cached({
  fastCars = mtcars %>% mutate(fast = hp>quantile(hp,quantCutOff))
  glm(fast ~ mpg+cyl+disp+drat+wt+gear, family = binomial(link='logit'), data=fastCars )
}, mtcars, quantCutOff, .prefix = "cars")
## caching item: /tmp/RtmpZBelAG/cars-4d33d39f25b44f2dd0e3f1c284e46470-d80edaedd01c0689a508e11d7fa6c365.rda
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cache_clear(.prefix = "cars", interactive = FALSE)

cache_clear(.prefix = "diamonds", interactive = FALSE)

Caching data in multiple stages of analysis

One use of this is to allow analysis to be cached in stages, and deleting cached content for unsuccessful stages.

N.b. as cache_clear uses regex it is probably best to avoid full stop characters in the prefix.

oldopts = options(cache.item.prefix = "stage-1")

fit = cached({
  expensiveDiamonds = diamonds %>% mutate(expensive = price>quantile(price,quantCutOff))
  glm(expensive ~ carat+cut+color+clarity+depth+table, family = binomial(link='logit'), data=expensiveDiamonds )
}, diamonds, quantCutOff)
## caching item: /tmp/RtmpZBelAG/stage-1-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
options(cache.item.prefix = "stage-2")

fit2 = cached({
  fastCars = mtcars %>% mutate(fast = hp>quantile(hp,quantCutOff))
  glm(fast ~ mpg+cyl+disp+drat+wt+gear, family = binomial(link='logit'), data=fastCars )
}, mtcars, quantCutOff)
## caching item: /tmp/RtmpZBelAG/stage-2-4d33d39f25b44f2dd0e3f1c284e46470-d80edaedd01c0689a508e11d7fa6c365.rda
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cache_clear(.prefix = "stage-1", interactive = FALSE)
cache_clear(.prefix = "stage-2", interactive = FALSE)

options(oldopts)

Caching a url download

Getting a file from the internet and locally caching. This does not check for remote changes. as before forcing the re-download can be achieved with .nocache.

file = cache_download(
  "https://raw.githubusercontent.com/terminological/arear/main/data-raw/NHSSurgeCapacityMarch2020.csv",
  .extn = "csv",
  .nocache = TRUE
)
## downloading item: NHSSurgeCapacityMarch2020.csv

Downloading changing data

If you want a daily download setting the .stale parameter to one day will download the data at most once per day. Staleness is determined by the number of days from 2am on the current day in the current time-zone. A item cached for only one day becomes stale at 2am the day after it is cached. The time is configurable and for example option(cache.time_day_starts = 0) would set this to be midnight. Automated analysis using caches and updated data should ensure that analysis does not run over this time point otherwise it may end up unexpectedly using old data.

# number of days before reloading
file = cache_download(
  "https://raw.githubusercontent.com/terminological/arear/main/data-raw/NHSSurgeCapacityMarch2020.csv",
  .stale = 1,
  .extn = "csv"
)
## downloading item: NHSSurgeCapacityMarch2020.csv