phylepic charts: combining phylogenetic trees with epidemic curves

This document steps through creating and customising phylepic charts using an example dataset.

Customising the plot

Looking to the full dataset instead of just that single clade, the default plot needs some configuration to be useful:

phydata <- phylepic(tree, metadata, name, collection_date)

plot(phydata)
#> Warning: Removed 7 rows containing non-finite outside the scale range
#> (`stat_week()`). [phylepic: plot.epicurve]

The date scale

We can choose the first day of the week to match the local conventions:

options(phylepic.week_start = "Monday")

The date range is too big for this type of chart, so we want to focus in on a time period near the outbreak clusters Clust-04 and Clust-05. By using scale_x_week, we get some conveniences such as being able to specify the breaks (tick labels) in intervals of weeks:

date_scale <- scale_x_week(
  name = "Date of sample collection",
  limits = as.Date(c("2023-08-14", "2023-11-15")),
  date_labels = "%d %b",
  week_breaks = 4L,
  week_minor_breaks = 2L
)

We can also control the relative heights of the two rows of panels in the final grid using height.tree (choosing here to make is 6 times the height of the epidemic curve). Our date scale is passed to both the relevant panels:

plot(phydata, scale.date = date_scale, height.tree = 6)
#> Warning: Removed 28 rows containing non-finite outside the scale range
#> (`stat_week()`). [phylepic: plot.epicurve]

Note that dates outside our chosen date limits are drawn on the calendar as triangles at the corresponding edge of the scale. Some rows are missing date metadata, so this allows us to distinguish them visually. This feature is enabled by the default oob (out of bounds) argument to scale_x_week.

Manipulating the tree

It would be nice to represent the genomic clusters using colours on the tree tips instead of a separate column of metadata tiles, since genomic clusters are based on the tree structure. The tree is drawn using ggraph and its dendrogram layout. In order to add a new layer of tip dots, we need to intercept the ggraph plot before it’s assembled into the final chart.

Firstly, let’s manually define a colour scale for the clusters so that we can have it be consistent between panels:

cluster_scale <- scale_colour_brewer(
  # the name affects the legend title
  name = "Cluster",
  # these 3 parameters affect the colour choice
  type = "qual",
  palette = 2,
  direction = -1,
  # don't drop unused levels; we want consistency between panels
  drop = FALSE,
  # suppress the explicit NA entry in the legend; not all tips are in a cluster
  na.translate = FALSE,
  # we'll use this scale later for both fill and colour aesthetics
  aesthetics = c("fill", "colour"),
)

To manipulate the tree, we use plot_tree, which creates the base plot to which we can add our new layers:

plot_tree(phydata) +
  # `filter = leaf` in ggraph geoms means that they only draw the tips
  ggraph::geom_node_point(aes(filter = leaf, colour = cluster), size = 2, show.legend = FALSE) +
  cluster_scale
#> Warning: Removed 11 rows containing missing values or values outside the scale range
#> (`geom_point()`). [phylepic: plot.tree]

Note that the guides from each panel will be merged in the final plot. If we don’t put show.legend = FALSE here, the guide for the cluster aesthetic will show circles instead of squares for the legend keys.

All of the columns from our metadata frame are available (for the tips) to use in ggraph’s aesthetic mappings.

Next, we’ll want to hide the redundant column of coloured tiles corresponding to the cluster. Remember that this column was drawn because the default plot.bars panel makes a column of tiles for each factor column in out data frame. We can use plot_bars to override the default behaviour. This helper takes arguments in the form <column name> = <scale definition>:

plot(
  phydata,
  plot.tree = function(x) {
    # this function will be called with x = phydata
    plot_tree(x) +
      ggraph::geom_node_point(aes(filter = leaf, colour = cluster), size = 2, show.legend = FALSE) +
      cluster_scale
  },
  plot.bars = function(x) {
    plot_bars(
      x,
      # 'source' is the name of the corresponding metadata column
      source = scale_fill_hue(
        name = "Source",
        # this just changes the colours
        h.start = 30,
        # as above, we want to turn off drop and na.translate
        drop = FALSE,
        na.translate = FALSE
      ),
      # if we wanted more tile columns, we would add them here
    )
  },
  scale.date = date_scale,
  width.tree = 20,  # new: also specify the relative widths of the 4 columns:
  width.date = 12,  #
  width.legend = 4, #
  height.tree = 6
)
#> Warning: Removed 11 rows containing missing values or values outside the scale range
#> (`geom_point()`). [phylepic: plot.tree]
#> Warning: Removed 28 rows containing non-finite outside the scale range
#> (`stat_week()`). [phylepic: plot.epicurve]

Each plot.* argument will normally be given a function that is called with a single parameter: the phylepic object provided to plot. The function must produce a ggplot object suitable for aligning in the final chart.

For plot.bars in the above, we only wanted to override the default options without adding any new custom layers. In common cases like these we can reduce the boilerplate a bit:

plot(
  phydata,
  plot.bars = function(x) {
    plot_bars(
      x,
      source = scale_fill_hue(...)
    )
  }
)

# equivalent to the above
plot(
  phydata,
  plot.bars = plot_bars(
    source = scale_fill_hue(...),
  )
)

This is because the plot_* helpers return a function instead of a ggplot if their first parameter is omitted.

Making use of the calendar and epidemic curve

The previous plot shows two clear genomic clusters associated with concurrent outbreaks. To make this point more clearly, we would like to indicate the two clusters on the epidemic curve (using the same colour scale).

To achieve this, we pass fill = cluster to both the plot.epicurve and plot.calendar panels:

plot(
  phydata,
  plot.tree = function(x) {
    plot_tree(x) +
      ggraph::geom_node_point(aes(filter = leaf, colour = cluster), size = 2, show.legend = FALSE) +
      cluster_scale
  },
  plot.bars = plot_bars(
    source = scale_fill_hue(
      name = "Source", h.start = 30, drop = FALSE, na.translate = FALSE
    ),
  ),
  plot.epicurve = plot_epicurve(fill = cluster),
  plot.calendar = plot_calendar(
    fill = cluster,
    labels = "%d",
  ),
  scale.date = date_scale,
  scale.fill = cluster_scale, # new: pass the scale to both panels
  width.tree = 20,
  width.date = 12,
  width.legend = 4,
  height.tree = 6
)
#> Warning: Removed 11 rows containing missing values or values outside the scale range
#> (`geom_point()`). [phylepic: plot.tree]
#> Warning: Dropped 8 calendar arrows with no fill or colour (`geom_calendar()`)
#> [phylepic: plot.calendar]
#> Warning: Removed 28 rows containing non-finite outside the scale range
#> (`stat_week()`). [phylepic: plot.epicurve]

Because the cluster colour scales are consistent across panels, the legend guides are automatically combined.

The warning messages coming from ggplot are annotated to indicate which panel was responsible. Here the warnings are telling us that

11 tree tips couldn’t be drawn because they have no cluster,
8 rows (a subset of those 11) have known dates that are outside the calendar axis range, but the arrows couldn’t be drawn because they have no cluster and we configured the cluster scale with na.translate = FALSE, and
28 rows didn’t contribute to the epidemic curve because their date was missing or out of range.

phylepic charts: combining phylogenetic trees with epidemic curves

Preparing the data

Basic plotting

Customising the plot

The date scale

Manipulating the tree

Making use of the calendar and epidemic curve