Controlling Splitting Behavior

Gabriel Becker

2024-09-20

Controlling Facet Levels

Provided Functions

By default, split_*_by(varname, ...) generates a facet for each level the variable varname takes in the data - including unobserved ones in the factor case. This behavior can be customized in various ways.

The most straightforward way to customize which facets are generated by a split is with one of the split functions or split function families provided by rtables.

These predefined split functions and function factories implement commonly desired customization patterns of splitting behavior (i.e., faceting behavior). They include:

The first four of these are fairly self-describing and for brevity, we refer our readers to ?split_funcs for details including working examples.

Controlling Combinations of Levels Across Multiple Variables

Often with nested splitting involving multiple variables, the values of the variables in question are logically nested; meaning that certain values of the inner variable are only coherent in combination with a specific value or values of the outer variable.

As an example, suppose we have a variable vehicle_class, which can take the values "automobile", and "boat", and a variable vehicle_type, which can take the values "car", "truck", "suv","sailboat", and "cruiseliner". The combination ("automobile", "cruiseliner") does not make sense and will never occur in any (correctly cleaned) data set; nor does the combination ("boat", "truck").

We will showcase strategies to deal with this in the next sections using the following artificial data:

set.seed(0)
levs_type <- c("car", "truck", "suv", "sailboat", "cruiseliner")

vclass <- sample(c("auto", "boat"), 1000, replace = TRUE)
auto_inds <- which(vclass == "auto")
vtype <- rep(NA_character_, 1000)
vtype[auto_inds] <- sample(
  c("car", "truck"), ## suv missing on purpose
  length(auto_inds),
  replace = TRUE
)
vtype[-auto_inds] <- sample(
  c("sailboat", "cruiseliner"),
  1000 - length(auto_inds),
  replace = TRUE
)

vehic_data <- data.frame(
  vehicle_class = factor(vclass),
  vehicle_type = factor(vtype, levels = levs_type),
  color = sample(
    c("white", "black", "red"), 1000,
    prob = c(1, 2, 1),
    replace = TRUE
  ),
  cost = ifelse(
    vclass == "boat",
    rnorm(1000, 100000, sd = 5000),
    rnorm(1000, 40000, sd = 5000)
  )
)
head(vehic_data)
#>   vehicle_class vehicle_type color      cost
#> 1          boat     sailboat black 100393.81
#> 2          auto          car white  38150.17
#> 3          boat     sailboat white  98696.13
#> 4          auto        truck white  37677.16
#> 5          auto        truck black  38489.27
#> 6          boat  cruiseliner black 108709.72

split_functions.R

trim_levels_in_group

The trim_levels_in_group split function factory creates split functions which deal with this issue empirically; any combination which is observed in the data being tabulated will appear as nested facets within the table, while those that do not, will not.

If we use default level-based faceting, we get several logically incoherent cells within our table:

library(rtables)

lyt <- basic_table() %>%
  split_cols_by("color") %>%
  split_rows_by("vehicle_class") %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt, vehic_data)
#>                   black      white        red   
#> ————————————————————————————————————————————————
#> auto                                            
#>   car                                           
#>     Mean        40431.92    40518.92   38713.14 
#>   truck                                         
#>     Mean        40061.70    40635.74   40024.41 
#>   suv                                           
#>     Mean           NA          NA         NA    
#>   sailboat                                      
#>     Mean           NA          NA         NA    
#>   cruiseliner                                   
#>     Mean           NA          NA         NA    
#> boat                                            
#>   car                                           
#>     Mean           NA          NA         NA    
#>   truck                                         
#>     Mean           NA          NA         NA    
#>   suv                                           
#>     Mean           NA          NA         NA    
#>   sailboat                                      
#>     Mean        99349.69    99996.54   101865.73
#>   cruiseliner                                   
#>     Mean        100212.00   99340.25   100363.52

split_functions.R

This is obviously not the table we want, as the majority of its space is taken up by meaningless combinations. If we use trim_levels_in_group to trim the levels of vehicle_type separately within each level of vehicle_class, we get a table which only has meaningful combinations:

lyt2 <- basic_table() %>%
  split_cols_by("color") %>%
  split_rows_by("vehicle_class", split_fun = trim_levels_in_group("vehicle_type")) %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt2, vehic_data)
#>                   black      white        red   
#> ————————————————————————————————————————————————
#> auto                                            
#>   car                                           
#>     Mean        40431.92    40518.92   38713.14 
#>   truck                                         
#>     Mean        40061.70    40635.74   40024.41 
#> boat                                            
#>   sailboat                                      
#>     Mean        99349.69    99996.54   101865.73
#>   cruiseliner                                   
#>     Mean        100212.00   99340.25   100363.52

split_functions.R

Note, however, that it does not contain all meaningful combinations, only those that were actually observed in our data; which happens to not include the perfectly valid "auto", "suv" combination.

To restrict level combinations to those which are valid regardless of whether the combination was observed, we must use trim_levels_to_map() instead.

trim_levels_to_map

trim_levels_to_map is similar to trim_levels_in_group in that its purpose is to avoid combinatorial explosion when nesting splitting with logically nested variables. Unlike its sibling function, however, with trim_levels_to_map we define the exact set of allowed combinations a priori, and that exact set of combinations is produced in the resulting table, regardless of whether they are observed or not.

library(tibble)
map <- tribble(
  ~vehicle_class, ~vehicle_type,
  "auto",         "truck",
  "auto",         "suv",
  "auto",         "car",
  "boat",         "sailboat",
  "boat",         "cruiseliner"
)

lyt3 <- basic_table() %>%
  split_cols_by("color") %>%
  split_rows_by("vehicle_class", split_fun = trim_levels_to_map(map)) %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt3, vehic_data)
#>                   black      white        red   
#> ————————————————————————————————————————————————
#> auto                                            
#>   car                                           
#>     Mean        40431.92    40518.92   38713.14 
#>   truck                                         
#>     Mean        40061.70    40635.74   40024.41 
#>   suv                                           
#>     Mean           NA          NA         NA    
#> boat                                            
#>   sailboat                                      
#>     Mean        99349.69    99996.54   101865.73
#>   cruiseliner                                   
#>     Mean        100212.00   99340.25   100363.52

split_functions.R

Now we see that the "auto", "suv" combination is again present, even though it is populated with NAs (because there is no data in that category), but the logically invalid combinations are still absent.

Combining Levels

Another very common manipulation of faceting in a table context is the introduction of combination levels that are not explicitly modeled in the data. Most often, this involves the addition of an “overall” category, but in both principle and practice it can involve any arbitrary combination of levels.

rtables explicitly supports this via the add_overall_level (for the all case) and add_combo_levels split function factories.

add_overall_level

add_overall_level accepts valname which is the name of the new level, as well as label, and first (whether it should come first, if TRUE, or last, if FALSE, in the ordering).

Building further on our arbitrary vehicles table, we can use this to create an “all colors” category:

lyt4 <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("color", split_fun = add_overall_level("allcolors", label = "All Colors")) %>%
  split_rows_by("vehicle_class", split_fun = trim_levels_to_map(map)) %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt4, vehic_data)
#>                 All Colors     black      white        red   
#>                  (N=1000)     (N=521)    (N=251)     (N=228) 
#> —————————————————————————————————————————————————————————————
#> auto                                                         
#>   car                                                        
#>     Mean         40095.49    40431.92    40518.92   38713.14 
#>   truck                                                      
#>     Mean         40194.68    40061.70    40635.74   40024.41 
#>   suv                                                        
#>     Mean            NA          NA          NA         NA    
#> boat                                                         
#>   sailboat                                                   
#>     Mean        100133.22    99349.69    99996.54   101865.73
#>   cruiseliner                                                
#>     Mean        100036.76    100212.00   99340.25   100363.52

split_functions.R

With the column counts turned on, we can see that the “All Colors” column encompasses the full 1000 (completely fake) vehicles in our data set.

To add more arbitrary combinations, we use add_combo_levels.

add_combo_levels

add_combo_levels allows us to add one or more arbitrary combination levels to the faceting structure of our table.

We do this by defining a combination data.frame which describes the levels we want to add. A combination data.frame has the following columns and one row for each combination to add:

  • valname - string indicating the name of the value, which will appear in paths.
  • label - a string indicating the label which should be displayed when rendering.
  • levelcombo - character vector of the individual levels to be combined in this combination level.
  • exargs - a list (usually list()) of extra arguments which should be passed to analysis and content functions when tabulated within this column or row.

Suppose we wanted combinations levels for all non-white colors, and for white and black colors. We do this like so:

combodf <- tribble(
  ~valname, ~label, ~levelcombo, ~exargs,
  "non-white", "Non-White", c("black", "red"), list(),
  "blackwhite", "Black or White", c("black", "white"), list()
)


lyt5 <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("color", split_fun = add_combo_levels(combodf)) %>%
  split_rows_by("vehicle_class", split_fun = trim_levels_to_map(map)) %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt5, vehic_data)
#>                   black      white        red      Non-White   Black or White
#>                  (N=521)    (N=251)     (N=228)     (N=749)       (N=772)    
#> —————————————————————————————————————————————————————————————————————————————
#> auto                                                                         
#>   car                                                                        
#>     Mean        40431.92    40518.92   38713.14    39944.93       40460.77   
#>   truck                                                                      
#>     Mean        40061.70    40635.74   40024.41    40050.66       40243.57   
#>   suv                                                                        
#>     Mean           NA          NA         NA          NA             NA      
#> boat                                                                         
#>   sailboat                                                                   
#>     Mean        99349.69    99996.54   101865.73   100179.72      99567.50   
#>   cruiseliner                                                                
#>     Mean        100212.00   99340.25   100363.52   100258.56      99937.47

split_functions.R

Fully Customizing Split (Facet) Behavior

Beyond the ability to select common splitting customizations from the split functions and split function factories rtables provides, we can also fully customize every aspect of splitting behavior by creating our own split functions. While it is possible to do so by hand, the primary way we do this is via the make_split_fun() function, which accepts functions implementing different component behaviors and combines them into a split function which can be used in a layout.

Splitting, or faceting as it is done in rtables, can be thought of as the combination of 3 steps:

  1. preprocessing - transformation of the incoming data which will be faceted
  1. splitting - mapping the incoming data to a set of 1 or more subsets representing individual facets.
  2. postprocessing - operations on the facets - e.g., combining them, removing them, etc.

The make_split_fun() function allows us to specify custom behaviors for each of these steps independently when defining custom splitting behavior via the pre, core_split, and post arguments, which dictate the above steps, respectively.

The pre argument accepts zero or more pre-processing functions, which must accept: df, spl, vals, labels, and can optionally accept .spl_context. They then manipulate df (the incoming data for the split) and return a modified data.frame. This modified data.frame must contain all columns present in the incoming data.frame, but can add columns if necessary. Although, we note that these new columns cannot be used in the layout as split or analysis variables, because they will not be present when validity checking is done.

The pre-processing component is useful for things such as manipulating factor levels, e.g., to trim unobserved ones or to reorder levels based on observed counts, etc.

For a more detailed discussion on what custom split functions do, and an example of a custom split function not implemented via make_split_fun(), see ?custom_split_funs.

An Example Custom Split Function

Here we will implement an arbitrary, custom split function where we specify both pre- and post-processing instructions. It is unusual for users to need to override the core splitting logic - and, in fact, is only supported in row space currently - so we leave this off of our example here but will provide another narrow example of that usage below.

An Illustrative Example of A Custom Split Function

First, we define two aspects of ‘pre-processing step’ behavior:

  1. A function which reverses the order of the levels of a variable (while retaining which level is associated with which observation), and
  2. A function factory which creates a function that removes a level and the data associated with it.
## reverse order of levels

rev_lev <- function(df, spl, vals, labels, ...) {
  ## in the split_rows_by() and split_cols_by() cases,
  ## spl_variable() gives us the variable
  var <- spl_variable(spl)
  vec <- df[[var]]
  levs <- if (is.character(vec)) unique(vec) else levels(vec)
  df[[var]] <- factor(vec, levels = rev(levs))
  df
}

rem_lev_facet <- function(torem) {
  function(df, spl, vals, labels, ...) {
    var <- spl_variable(spl)
    vec <- df[[var]]
    bad <- vec == torem
    df <- df[!bad, ]
    levs <- if (is.character(vec)) unique(vec) else levels(vec)
    df[[var]] <- factor(as.character(vec[!bad]), levels = setdiff(levs, torem))
    df
  }
}

split_functions.R

Finally we implement our post-processing function. Here we will reorder the facets based on the amount of data each of them represents.

sort_them_facets <- function(splret, spl, fulldf, ...) {
  ord <- order(sapply(splret$datasplit, nrow))
  make_split_result(
    splret$values[ord],
    splret$datasplit[ord],
    splret$labels[ord]
  )
}

split_functions.R

Finally, we construct our custom split function and use it to create our table:

silly_splfun1 <- make_split_fun(
  pre = list(
    rev_lev,
    rem_lev_facet("white")
  ),
  post = list(sort_them_facets)
)

lyt6 <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("color", split_fun = silly_splfun1) %>%
  split_rows_by("vehicle_class", split_fun = trim_levels_to_map(map)) %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt6, vehic_data)
#>                    red        black  
#>                  (N=228)     (N=521) 
#> —————————————————————————————————————
#> auto                                 
#>   car                                
#>     Mean        38713.14    40431.92 
#>   truck                              
#>     Mean        40024.41    40061.70 
#>   suv                                
#>     Mean           NA          NA    
#> boat                                 
#>   sailboat                           
#>     Mean        101865.73   99349.69 
#>   cruiseliner                        
#>     Mean        100363.52   100212.00

split_functions.R

Overriding the Core Split Function

Currently, overriding core split behavior is only supported in functions used for row splits.

Next, we write a custom core-splitting function which divides the observations into 4 groups: the first 100, observations 101-500, observations 501-900, and the last hundred. We could claim this was to test for structural bias in the first and last observations, but really its to simply illustrate overriding the core splitting machinery and has no meaningful statistical purpose.

silly_core_split <- function(spl, df, vals, labels, .spl_context) {
  make_split_result(
    c("first", "lowmid", "highmid", "last"),
    datasplit = list(
      df[1:100, ],
      df[101:500, ],
      df[501:900, ],
      df[901:1000, ]
    ),
    labels = c(
      "first 100",
      "obs 101-500",
      "obs 501-900",
      "last 100"
    )
  )
}

split_functions.R

We can use this to construct a splitting function. This can be combined with pre- and post-processing functions, as each of the stages is performed independently, but in this case, we won’t, because our core splitting behavior is such that pre- or post-processing do not make much sense.

even_sillier_splfun <- make_split_fun(core_split = silly_core_split)

lyt7 <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("color") %>%
  split_rows_by("vehicle_class", split_fun = even_sillier_splfun) %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt7, vehic_data)
#>                   black       white        red   
#>                  (N=521)     (N=251)     (N=228) 
#> —————————————————————————————————————————————————
#> first 100                                        
#>   car                                            
#>     Mean        40496.05    37785.41    37623.17 
#>   truck                                          
#>     Mean        41094.17    40437.29    37866.81 
#>   suv                                            
#>     Mean           NA          NA          NA    
#>   sailboat                                       
#>     Mean        100560.80   102017.05   101185.96
#>   cruiseliner                                    
#>     Mean        100838.12   96952.27    100610.71
#> obs 101-500                                      
#>   car                                            
#>     Mean        39350.88    41185.98    37978.72 
#>   truck                                          
#>     Mean        40166.87    41385.32    39885.72 
#>   suv                                            
#>     Mean           NA          NA          NA    
#>   sailboat                                       
#>     Mean        98845.47    99563.02    101462.79
#>   cruiseliner                                    
#>     Mean        101558.62   99039.91    97335.05 
#> obs 501-900                                      
#>   car                                            
#>     Mean        40721.82    40379.48    38681.26 
#>   truck                                          
#>     Mean        39951.92    39846.89    39840.39 
#>   suv                                            
#>     Mean           NA          NA          NA    
#>   sailboat                                       
#>     Mean        99533.20    100347.18   102732.12
#>   cruiseliner                                    
#>     Mean        99140.43    100074.43   101994.99
#> last 100                                         
#>   car                                            
#>     Mean        45204.44    40626.95    41214.33 
#>   truck                                          
#>     Mean        38920.70    40620.47    42899.14 
#>   suv                                            
#>     Mean           NA          NA          NA    
#>   sailboat                                       
#>     Mean        99380.21    97644.77    101691.92
#>   cruiseliner                                    
#>     Mean        100017.53   99581.94    100751.30

split_functions.R

Design of Pre- and Post-Processing Functions For Use in make_split_fun

Pre-processing and post-processing functions in the custom-splitting context are best thought of as (and implemented as) independent, atomic building blocks for the desired overall behavior. This allows them to be reused in a flexible mix-and-match way.

rtables provides several behavior components implemented as either functions or function factories:

  • Pre-processing “behavior blocks”
    • drop_facet_levels - drop unobserved levels in the variable being split
  • Post-processing “behavior blocks”
    • trim_levels_in_facets - provides trim_levels_in_group behavior
    • add_overall_facet - add a combination facet for the full data
    • add_combo_facet - add a single combination facet (can be used more than once in a single make_split_fun call)