A workflow manager for

targets

Aurélien Ginolhac

University of Luxembourg

Thursday, the 6th of June, 2024

Workflow managers

  • Workflow Managers are designed to compose and execute a series of computational steps.
  • Workflows are typically represented as a visual graph where nodes are connected together.
  • Workflow managers support abstractions and provide automation.
  • Scientific workflow systems enable large scale scientific experiments.
  • They make computational methods reproducible, portable, maintainable, and shareable.

Makefiles

It started with Makefile, when computers power was limiting. Compile objects (*.o) only when needed: source (*.c) modified. make first release is April 1988.

Dependency rules.

target: dependencies
      commands

# This is a comment line
CC=gcc
# CFLAGS will be the options passed to the compiler.
CFLAGS= -c -Wall
all: prog

prog: main.o factorial.o hello.o
    $(CC) main.o factorial.o hello.o -o prog

main.o: main.c
    $(CC) $(CFLAGS) main.c

factorial.o: factorial.c
    $(CC) $(CFLAGS) factorial.c

hello.o: hello.c
    $(CC) $(CFLAGS) hello.c

clean:
    rm -rf *.o

Compile with make (rule all)

{targets} and companion package tarchetypes

  • Saving you time and stress
  • Understand how it is implemented in targets
    • Define your targets
    • Connect targets to create the dependencies
    • Check dependencies with visnetwork
    • Embrace either or combined
      • Dynamic branching
      • Static branching
    • Run only what needs to be executed and in fresh session with {callr}
    • Embrace literate programming with qmd or Rmd docs
    • Bundle dependencies in a documents with tar_render()/tar_quarto()
  • Be better at scheduling your work

Folder structure

├── .git/
├── _targets.R
├── _targets/
├── Repro.Rproj
├── R
   ├── functions.R
   └── utils.R
├── run.R*
├── renv/
├── renv.lock
└── report.qmd

Targets

  • With renv. Snapshot your package environment
  • _targets.R is the only mandatory file
  • Use a R sub-folder for functions, gets closer to a package
  • Rmd/qmd file allows to gather results in a report
  • In a RStudio project
  • Version tracked with git
  • An executable run.R allows to use Build Tools in RStudio

DatasauRus example, smart animation caching

This example is available at the target_demos repo

targets script _targets_ds_fun1.R

library(targets)
library(tarchetypes)
source("R/plotting.R")
# load the tidyverse quietly for each target
# which each runs in a fresh R session
tar_option_set(packages = "tidyverse")

list(
  # track if distant file has changed
  tar_url(ds_file, "https://raw.githubusercontent.com/jumpingrivers/datasauRus/main/inst/extdata/DatasaurusDozen-Long.tsv"),
  tar_target(ds, read_tsv(ds_file, show_col_types = FALSE)),
  tar_target(all_facets, facet_ds(ds)),
  # animation is worth caching  ~ 1 min
  tar_target(anim, anim_ds(ds), 
             packages = c("ggplot2", "gganimate", "gifski")),
  tar_file(gif, {
    anim_save("ds.gif", animation = anim, title_frame = TRUE)
    # anim_save returns NULL, we need to get the file output path
    "ds.gif"},
             packages = c("gganimate")),
  tar_quarto(report, "ds1.qmd")
)

Corresponding Directed Acyclic Graph

  • Directed: each node has a one-way direction.
  • Acyclic: no loop, no ambiguity.

Manifest, a good companion to the DAG

Precise description of steps in a table

> tar_manifest()
# A tibble: 6 × 2
  name       command     
  <chr>      <chr>  
1 ds_file    "\"https://raw.gi[...]n/inst/extdata/DatasaurusDozen-Long.tsv\""
2 ds         "read_tsv(ds_file, show_col_types = FALSE)"    
3 anim       "anim_ds(ds)"
4 all_facets "facet_ds(ds)"
5 gif        "{\n     anim_save(\"ds.gif\", animation = anim, title_frame = TRUE)\n     \"ds.gif\"\n }"  
6 report     "tarchetypes::tar_quarto_run(args = list(input = \"ds1.qmd\", \n     execute = TRUE,

Literate programming

We recommend using it within a target and not the Target Markdown that overloads the document.

Multi-projects in one folder

Like the targets_demos repo which has 4 projects

Config file: _targets.yaml

targets needs a R script and a store location

ds_linear:
  store: _ds_1
  script: _targets_ds_1.R
ds_fun_linear:
  store: _ds_fun1
  script: _targets_ds_fun1.R
ds_dynamic:
  store: _ds_2
  script: _targets_ds_2.R
ds_static:
  store: _ds_3
  script: _targets_ds_3.R
  reporter_make: verbose_positives # do not display skipped targets

Usage

In your Rmd/qmd/console, one env variable to set:

Sys.setenv(TAR_PROJECT = "ds_fun_linear")

Custom Building Tool

Tools > Projects Options > Custom

run.R:

#!/usr/bin/env Rscript
# Optional var env for > 1 _targets.R
Sys.setenv(TAR_PROJECT = "ds_fun_linear")
targets::tar_make()

Running targets

  • Useful shortcut Shift-Ctrl-B

  • Animation takes the most time

Issue on Windows

Seems that a custom script is not working on

Re-running, same shortcut only what is needed

Without changes

Change in facet_ds()

facet_wrap(vars(dataset), ncol = 4) # <- 3

Dynamic branching

Often we start from multiple files

data/
├── dset_10.tsv
├── dset_11.tsv
├── dset_12.tsv
├── dset_13.tsv
├── dset_1.tsv
├── dset_2.tsv
├── dset_3.tsv
├── dset_4.tsv
├── dset_5.tsv
├── dset_6.tsv
├── dset_7.tsv
├── dset_8.tsv
└── dset_9.tsv

And we want to apply the same treatment to each

Functional programming again, iteration for what’s needed

Done by the pattern = map() keyword. Use cross() for combinations.

tar_target(ds, read_tsv(dset, show_col_types = FALSE),
           pattern = map(dset)),

Directly with tar_files_input() (pair of targets)

Changing one input file

Re-run only one file and downstream dependencies

✔ skipped target dset_files
[...]
✔ skipped branch dset_1357daeb5edc5b3b
▶ dispatched branch dset_376af7da24ddcfc7
● completed branch dset_376af7da24ddcfc7 [0.001 seconds]
✔ skipped branch dset_fc156975d3544187
[...]
✔ skipped branch ds_4bc1a3d4ea6fdf12
▶ dispatched branch ds_501bf242796ba6b2
● completed branch ds_501bf242796ba6b2 [0.892 seconds]
✔ skipped branch ds_c601ea8afad80c5f
● completed pattern ds
✔ skip branch summary_stat_ad2f392a
[...]
✔ skipped branch summary_stat_aad2733c0eca3cae
▶ dispatched branch summary_stat_0f7ac98a50809586
● completed branch summary_stat_0f7ac98a50809586 [0.02 seconds]
✔ skipped branch summary_stat_9cefee38f54d6115
[...]
✔ skipped branch plots_aad2733c0eca3cae
▶ dispatched branch plots_0f7ac98a50809586
● completed branch plots_0f7ac98a50809586 [0.031 seconds]
● completed pattern plots
▶ dispatched target report
● completed target report [13.378 seconds]
▶ ended pipeline [16.281 seconds]

Automatic aggregation

For vectors/tibbles happens directly

> tar_read(ds)
# A tibble: 1,846 × 3
   dataset     x     y
   <chr>   <dbl> <dbl>
 1 away     32.3  61.4
 2 away     53.4  26.2
 3 away     63.9  30.8

Use branches for subsetting

> tar_read(ds, branches = 2L)
# A tibble: 142 × 3
   dataset     x     y
   <chr>   <dbl> <dbl>
 1 star     58.2  91.9
 2 star     58.2  92.2
 3 star     58.7  90.3

For plots, use iteration = "list"

tar_target(plots, ggplot(ds, aes(x, y)) +
             geom_point() +
             labs(title = unique(ds$dataset)),
           pattern = map(ds),
           iteration = "list")
> tar_read(plots, branches = 2L)
## $plots_a55f1afc

Then this list can be used by patchwork

library(patchwork)
wrap_plots(tar_read(plots)) +
   plot_annotation(title = "13 datasets bundled 
     with patchwork") & theme_void()

Static branching, with dynamic inside

Dynamic branch names are not meaningful, just hashes

Multi-folders input data

We still have multiple files per folder

circles/
├── dset_2.tsv
└── dset_3.tsv
lines/
├── dset_11.tsv
├── dset_12.tsv
├── dset_13.tsv
├── dset_6.tsv
├── dset_7.tsv
├── dset_8.tsv
└── dset_9.tsv
others/
├── dset_10.tsv
├── dset_1.tsv
├── dset_4.tsv
└── dset_5.tsv

Dynamic vs Static

Dynamic Static
Pipeline creates new targets at runtime. All targets defined in advance.
Cryptic target names. Friendly target names.
Scales to hundreds of branches. Does not scale as easily for tar_visnetwork() etc.
No metaprogramming required. Familiarity with metaprogramming is helpful.

static branching is most useful for smaller number of heterogeneous targets.

Dynamic within static, best of both worlds

More difficult to write with tar_map() (see example)

But meaningful names and combine when needed:

Use tar_manifest() to display exactly the command to be run

Parallel static branches and combine

From _targets_ds_3.R, static branches:

# Static branching with dynamic branching inside
values <- tibble(
  folders = c("lines", "circles", "others")
)

# tar_map() generates R expressions, and substitute the desired 'values'
mapped <- tar_map(
  values = values,
  names = "folders", # to avoid targets reporting "files_lines_lines"
  tar_target(filenames, fs::dir_ls(folders, glob = "*tsv")),
  # filenames is not of format file, no checksum is done
  # we need a dynamic pattern at this step to read them dynamically too
  tar_target(files, format = "file", filenames, 
             pattern = map(filenames)),
  # Dynamic within static
  tar_target(ds, read_tsv(files, show_col_types = FALSE),
             pattern = map(files)),
  tar_target(summary_stat, summarise(ds, m_x = mean(x), m_y = mean(y)),
             pattern = map(ds)),
  tar_target(plots, ggplot(ds, aes(x, y)) +
               geom_point(),
             pattern = map(ds),
             iteration = "list"),
  # Patchwork each group into one plot
  tar_target(patch_plots, 
             wrap_plots(plots) + 
               # Title the last bit of path_plots_{circles,lines,others}
               plot_annotation(title = stringr::str_split_i(tar_name(), '_', -1)),
             packages = "patchwork")
)

Combining step

# We want to combined in one tibble the 3 tibble of summary stats
# Each of one them is actually composed of 2, 4 and 7 tibbles
stat_combined <- tar_combine(
  stat_summaries,
  mapped[["summary_stat"]],
  # Force evaluation using triple bang (!!!)
  command = dplyr::bind_rows(!!!.x, .id = "ds_type")
)
# And the plots now, a patchwork of patchwork
plot_combined <- tar_combine(
  plots_agg,
  mapped[["patch_plots"]],
  # Force evaluation of all patchwork plots again with triple bang!
  command = {wrap_plots(list(!!!.x), ncol = 2) + 
               plot_annotation(title = "Master Saurus")},
  packages = "patchwork"
)
# Wrap all targets in one list
list(mapped, 
     stat_combined, 
     plot_combined, 
     tar_quarto(report, "ds3.qmd"))

Manifest

tar_manifest() (paged version in ds3.qmd)

# A tibble: 21 × 4
   name                 command                                                                pattern description
   <chr>                <chr>                                                                  <chr>   <chr>      
 1 filenames_circles    "fs::dir_ls(\"circles\", glob = \"*tsv\")"                             NA      circles    
 2 filenames_others     "fs::dir_ls(\"others\", glob = \"*tsv\")"                              NA      others     
 3 filenames_lines      "fs::dir_ls(\"lines\", glob = \"*tsv\")"                               NA      lines      
 4 files_circles        "filenames_circles"                                                    map(fi… circles    
 5 files_others         "filenames_others"                                                     map(fi… others     
 6 files_lines          "filenames_lines"                                                      map(fi… lines      
 7 ds_circles           "read_tsv(files_circles, show_col_types = FALSE)"                      map(fi… circles    
 8 ds_others            "read_tsv(files_others, show_col_types = FALSE)"                       map(fi… others     
 9 ds_lines             "read_tsv(files_lines, show_col_types = FALSE)"                        map(fi… lines      
10 summary_stat_circles "summarise(ds_circles, m_x = mean(x), m_y = mean(y))"                  map(ds… circles    
11 plots_circles        "ggplot(ds_circles, aes(x, y)) + geom_point()"                         map(ds… circles    
12 summary_stat_others  "summarise(ds_others, m_x = mean(x), m_y = mean(y))"                   map(ds… others     
13 plots_others         "ggplot(ds_others, aes(x, y)) + geom_point()"                          map(ds… others     
14 plots_lines          "ggplot(ds_lines, aes(x, y)) + geom_point()"                           map(ds… lines      
15 summary_stat_lines   "summarise(ds_lines, m_x = mean(x), m_y = mean(y))"                    map(ds… lines      
16 patch_plots_circles  "wrap_plots(plots_circles) + plot_annotation(title = stringr::str_spl… NA      circles    
17 patch_plots_others   "wrap_plots(plots_others) + plot_annotation(title = stringr::str_spli… NA      others     
18 patch_plots_lines    "wrap_plots(plots_lines) + plot_annotation(title = stringr::str_split… NA      lines      
19 stat_summaries       "dplyr::bind_rows(summary_stat_lines = summary_stat_lines, \n     sum… NA      NA         
20 plots_agg            "wrap_plots(list(patch_plots_lines = patch_plots_lines, \n  …          NA      Key step t…
21 report               "tarchetypes::tar_quarto_run(args = list(input = \"ds3.qmd\", \n…      NA      Rendering …

Final plot

Descriptions, free text field

Recent addition, showing up in tar_manifest() and network

plot_combined <- tar_combine(
  plots_agg,
  mapped[["patch_plots"]],
  command = wrap_plots(list(!!!.x), ncol = 2) + plot_annotation(title = "Master Saurus"),
  packages = "patchwork",
  description = "Key step to wrap plots"
)

list(mapped, stat_combined, plot_combined, tar_quarto(report, "ds3.qmd", description = "Rendering quarto doc"))

Also useful for selection of targets using tar_described_as():

tar_manifest(names = tar_described_as(starts_with("survival model")))

Static-in-static

Dynamic branches still have cryptic names. What is we want to go full static where all steps are known upfront.

Nested tar_map(): toy example:

library(targets)
library(tarchetypes)
mapped <- tar_map(
  #unlist = FALSE, # Return a nested list from tar_map()
  values = list(model = c("mod_1", "mod_2")),
  tar_target(
    distrib,
    tar_name(),
  ),
  # static in static
  tar_map(
    values = list(sim = c("A", "B")),
    tar_target(
      estim,
      paste(distrib, tar_name()),
    )
  )
)
combined <- tar_combine(combi, 
                        # select all estimations
                        tar_select_targets(mapped, starts_with("estim")), 
                        command = paste(!!!.x))
list(mapped, combined)

No more square targets, no pattern = map(...)

Full static for datasauRus, _targets_ds_4.R

mapped <- tar_map(
  values = values,
  names = "names", # to avoid targets reporting "files_data.lines"
  # special pair of targets
  # readr is in charge of the aggregation (bind_rows())
  tar_file_read(files, fs::dir_ls(folders, glob = "*tsv"), read_tsv(file = !!.x, show_col_types = FALSE)),
  # nested tar_map
  tar_map(
    values = list(funs = c("mean", "sd")),
    tar_target(summary, summarise(files, x_sum = funs(x), y_sum = funs(y)))
  )
)
mcombined <- tar_combine(mean_combine, 
                         # tarchetypes helper to select all averages 
                         tar_select_targets(mapped, contains("_mean_")),
                         # .x placeholder all matching targets
                         # !!! unquote-splice operator
                         command = bind_rows(!!!.x, .id = "set"))

scombined <- tar_combine(sd_combine, 
                         # tarchetypes helper to select all averages 
                         tar_select_targets(mapped, contains("_sd_")),
                         # .x placeholder all matching targets
                         # !!! unquote-splice operator
                         command = bind_rows(!!!.x, .id = "set"))

combi <- tar_combine(stats, mcombined, scombined)

list(mapped, mcombined, scombined, combi)

Corresponding DAG

> tar_read(mean_combine)
# A tibble: 3 × 3
  set                  x_sum y_sum
  <chr>                <dbl> <dbl>
1 summary_mean_circles  54.3  47.8
2 summary_mean_lines    54.3  47.8
3 summary_mean_others   54.3  47.8

And final stat object:

> tar_read(stats)
# A tibble: 6 × 3
  set                  x_sum y_sum
  <chr>                <dbl> <dbl>
1 summary_mean_circles  54.3  47.8
2 summary_mean_lines    54.3  47.8
3 summary_mean_others   54.3  47.8
4 summary_sd_circles    16.7  26.9
5 summary_sd_lines      16.7  26.9
6 summary_sd_others     16.7  26.9

Better project design

Thinking at what is a good targets helps tremendously the coding

  1. Are large enough to subtract a decent amount of runtime when skipped.
  2. Are small enough that some targets can be skipped even if others need to run.
  3. Invoke no side effects (tar_target(format = “file”) can save files.)
  4. Return a single value that is:
  • Easy to understand and introspect.
  • Meaningful to the project […]

William Landau

Data storage, rds is the default, but quite slow

Watch out

For malicious promises!

Relevant blog post: CVE-2024-27322 Should Never Have Been Assigned And R Data Files Are Still Super Risky Even In R 4.4.0 by Bob Rudis

From {tarchetypes}:

  • tar_fst_tbl() for tibbles ({fst})

  • tar_qs() for lists (Quick serialization of objects {qs})

Excellent possibilities for debugging

  • Finish the pipeline anyway
    • tar_option_set(error = "null")
    • Useful for dynamic branching
  • Error messages
    • tar_meta(fields = error, complete_only = TRUE)
  • Save a targets workspace
    • tar_option_set(workspace_on_error = TRUE)
    • list workspaces: tar_workspaces()
    • load one: tar_workspace(analysis_02de2921) all object, variables are visible interactively
    • also: tar_traceback(analysis_02de2921)
  • Pause the pipeline with the targets debug option.
    • tar_option_set(debug = "analysis_58_b59aa384")
    • see example

Simplify the layers

Remember that all code run in a fresh session, so needs to load its package dependencies.

To avoid it:

  • Remove {callr}: tar_make(callr_function = NULL)

  • Or the opposite, remove {targets}:

# What about just {callr} without {targets}?
callr::r( # same error
  func = function() {
    set.seed(-1012558151) # from tar_meta(name = dataset1, field = seed)
    library(targets)
    suppressMessages(tar_load_globals())
    data <- simulate_data(units = 100)
    analyze_data(data)
  },
  show = TRUE
)

Missing parts

HPC

  • {crew} for autoscaling on workers

library(targets)
library(crew)
tar_option_set(
  controller = crew_controller_local(workers = 2)
)
tar_crew()
#> # A tibble: 10 × 5
#>    controller    worker launches seconds targets
#>    <chr>          <int>    <int>   <dbl>   <int>
#>  1 my_controller      1        1   103.      104
#>  2 my_controller      2        1   100.      100
  • {crew.cluster} for job scheduler submission
    • sge
    • slurm

Cloud computing

  • AWS
  • GCP

Before we stop

Highlights

  • targets, dependencies manager, re-run what’s needed

William Landau intro:

Further reading 📚

Acknowledgments 🙏 👏

  • Eric Koncina early adopter of targets
  • William Landau main developer of targets

Thank you for your attention!