ggrrr
supports a super simple passthrough caching layer.
This assumes calculations are deterministic, and with a stated set of
inputs will compute and cache the result.
Setting the caching location should be transparent. The default directory is the user cache directory, but in an analysis project a specific cache directory may be preferred
# Setting a cache in an analysis project might be in a sub-directory of the project
# options("cache.dir"=here::here("cache"))
# If in a package you might be using a package specific cache
# options("cache.dir"=rappdirs::user_cache_dir("my-package"))
# for our purposes a temporary directory is OK, but caching to a tempdir will only persist as long as the session is active.
options("cache.dir"= tempdir())
Caching an item
The item caches is the result of running an expression with in the current environment. We manually specify which items in the environment may affect the outcome. The cache key is a combination of the hash of the expression and the input data.
We do not check the rest of the environment so it is up to the user to make sure all variables which influence the expression output are states. Likewise if the code in the expression changes a new value will be cached but if the contents of a function used in the expression changes thsi might not be picked up.
# Ten iterations of the same code
# only the first iteration is actually calculated. The rest are loaded from disk.
for (i in 1:10) {
quantCutOff = 0.95
start = Sys.time()
fit = cached({
expensiveDiamonds = diamonds %>% mutate(expensive = price>quantile(price,quantCutOff))
glm(expensive ~ carat+cut+color+clarity+depth+table, family = binomial(link='logit'), data=expensiveDiamonds )
}, diamonds, quantCutOff )
duration = Sys.time()-start
cat("iteration:",i," duration: ",duration,"\n")
}
## caching item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## iteration: 1 duration: 1.159843
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 2 duration: 0.09126902
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 3 duration: 0.09219599
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 4 duration: 0.09015965
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 5 duration: 0.1410227
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 6 duration: 0.08353686
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 7 duration: 0.08875036
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 8 duration: 0.08673692
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 9 duration: 0.0854876
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## iteration: 10 duration: 0.08891273
Re-caching when code changes
A change in the code (in this example a trivial change in a variable name) will result in the item being re-cached. White space changes do not trigger recalculation.
fit = cached({
expensiveDiamonds2 = diamonds %>% mutate(expensive = price>quantile(price,quantCutOff))
glm(expensive ~ carat+cut+color+clarity+depth+table, family = binomial(link='logit'), data=expensiveDiamonds2 )
}, diamonds, quantCutOff )
## using cached item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
Forcing execution of cached item
If you want to defeat the caching for a one off execution the
.nocache
option achieves that.
fit = cached({
expensiveDiamonds = diamonds %>% mutate(expensive = price>quantile(price,quantCutOff))
glm(expensive ~ carat+cut+color+clarity+depth+table, family = binomial(link='logit'), data=expensiveDiamonds )
}, diamonds, quantCutOff, .nocache = TRUE )
## caching item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Globally disabling cache
You might want to cache interim results while developing (for speed) and disable the cache completely during final analysis.
oldopt = options(cache.disable = TRUE)
fit = cached({
expensiveDiamonds = diamonds %>% mutate(expensive = price>quantile(price,quantCutOff))
glm(expensive ~ carat+cut+color+clarity+depth+table, family = binomial(link='logit'), data=expensiveDiamonds )
}, diamonds, quantCutOff)
## caching item: /tmp/RtmpZBelAG/cached-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
options(oldopt)
Manually removing items from cache
In general items can be left in the cache, and they will get removed when it is determined that they are stale. A manual flush of the cache can be done like this. Usually the interactive option will be omitted if run at the command line
cache_clear(interactive = FALSE)
Caching multiple types of data
Controlling the naming of the cache items helps to keep track of the items in the cache if you are manually inspecting it. If the items follow a naming convention then deleting items from the cache can then be done on a type by type basis.
fit = cached({
expensiveDiamonds = diamonds %>% mutate(expensive = price>quantile(price,quantCutOff))
glm(expensive ~ carat+cut+color+clarity+depth+table, family = binomial(link='logit'), data=expensiveDiamonds )
}, diamonds, quantCutOff, .prefix = "diamonds")
## caching item: /tmp/RtmpZBelAG/diamonds-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
fit2 = cached({
fastCars = mtcars %>% mutate(fast = hp>quantile(hp,quantCutOff))
glm(fast ~ mpg+cyl+disp+drat+wt+gear, family = binomial(link='logit'), data=fastCars )
}, mtcars, quantCutOff, .prefix = "cars")
## caching item: /tmp/RtmpZBelAG/cars-4d33d39f25b44f2dd0e3f1c284e46470-d80edaedd01c0689a508e11d7fa6c365.rda
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cache_clear(.prefix = "cars", interactive = FALSE)
cache_clear(.prefix = "diamonds", interactive = FALSE)
Caching data in multiple stages of analysis
One use of this is to allow analysis to be cached in stages, and deleting cached content for unsuccessful stages.
N.b. as cache_clear uses regex it is probably best to avoid full stop characters in the prefix.
oldopts = options(cache.item.prefix = "stage-1")
fit = cached({
expensiveDiamonds = diamonds %>% mutate(expensive = price>quantile(price,quantCutOff))
glm(expensive ~ carat+cut+color+clarity+depth+table, family = binomial(link='logit'), data=expensiveDiamonds )
}, diamonds, quantCutOff)
## caching item: /tmp/RtmpZBelAG/stage-1-4d33d39f25b44f2dd0e3f1c284e46470-c6e03ac8f53dbecd19a32787db92df7a.rda
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
options(cache.item.prefix = "stage-2")
fit2 = cached({
fastCars = mtcars %>% mutate(fast = hp>quantile(hp,quantCutOff))
glm(fast ~ mpg+cyl+disp+drat+wt+gear, family = binomial(link='logit'), data=fastCars )
}, mtcars, quantCutOff)
## caching item: /tmp/RtmpZBelAG/stage-2-4d33d39f25b44f2dd0e3f1c284e46470-d80edaedd01c0689a508e11d7fa6c365.rda
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cache_clear(.prefix = "stage-1", interactive = FALSE)
cache_clear(.prefix = "stage-2", interactive = FALSE)
options(oldopts)
Caching a url download
Getting a file from the internet and locally caching. This does not check for remote changes. as before forcing the re-download can be achieved with .nocache.
file = cache_download(
"https://raw.githubusercontent.com/terminological/arear/main/data-raw/NHSSurgeCapacityMarch2020.csv",
.extn = "csv",
.nocache = TRUE
)
## downloading item: NHSSurgeCapacityMarch2020.csv
Downloading changing data
If you want a daily download setting the .stale parameter to one day
will download the data at most once per day. Staleness is determined by
the number of days from 2am on the current day in the current time-zone.
A item cached for only one day becomes stale at 2am the day after it is
cached. The time is configurable and for example
option(cache.time_day_starts = 0)
would set this to be
midnight. Automated analysis using caches and updated data should ensure
that analysis does not run over this time point otherwise it may end up
unexpectedly using old data.
# number of days before reloading
file = cache_download(
"https://raw.githubusercontent.com/terminological/arear/main/data-raw/NHSSurgeCapacityMarch2020.csv",
.stale = 1,
.extn = "csv"
)
## downloading item: NHSSurgeCapacityMarch2020.csv