tidysq
package is meant to store and conduct operations
on biological sequences. This vignette provides a guide to basic usage
of tidysq
, i.e. reading, manipulating and writing sequences
to file.
The most recent version of tidysq
can be installed with
install_github()
function from devtools
.
Biological sequences can be and often are represented as strings –
sequences of letters. For example, a DNA sequence can take the form of
"TAGGCCCTAGACCTG"
, where A
means adenine,
C
– cytosine, G
– guanine and T
–
thymine. Exact IUPAC recommendations for one-letter codes can be found
in Cornish-Bowden A. Nomenclature for incompletely specified bases
in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. 1985
May 10;13(9):3021-30. doi: 10.1093/nar/13.9.3021. PMID: 2582368; PMCID:
PMC341218.
Within tidysq
package sequence data is stored in
sq
objects, that is, vectors of biological sequences. They
can be created from string vectors as above:
sq_dna <- sq(c("TAGGCCCTAGACCTG", "TAGGCCCTGGGCATG"))
sq_dna
#> basic DNA sequences list:
#> [1] TAGGCCCTAGACCTG <15>
#> [2] TAGGCCCTGGGCATG <15>
There are several thing to note. First, each sequence is an element
of sq
object. Many operations are vectorized — they are
applied to all sequences of a vector — and sq
objects are
no different in this regard. Second, the first line of output says:
basic DNA sequences list
. This means that all sequences of
this object are of DNA type and do not use ambiguous letters (more about
that in “Advanced alphabet techniques” vignette).
Manipulating sequence objects is an integral part of
tidysq
. sq
objects can be easily subsetted
using usual R syntax:
Extracting subsequences is a bit more complicated than that — because
it uses designated function bite()
. Its syntax, however,
closely resembles that of base R — indexing starts with one and negative
indices are interpreted as “anything except that”. It returns an
sq
object with all sequences subsetted:
bite(sq_dna, 5:10)
#> basic DNA sequences list:
#> [1] CCCTAG <6>
#> [2] CCCTGG <6>
bite(sq_dna, c(-9, -11, -13))
#> basic DNA sequences list:
#> [1] TAGGCCCTGCTG <12>
#> [2] TAGGCCCTGCTG <12>
It’s possible to reverse sequences using this function:
# Don't do it like that!
bite(sq_dna, 15:1)
#> basic DNA sequences list:
#> [1] GTCCAGATCCCGGAT <15>
#> [2] GTACGGGTCCCGGAT <15>
However, this usage is strongly discouraged, because it’s both
ineffective and works badly with sequences of different lengths.
Instead, there is a designated function reverse()
:
reverse(sq_dna)
#> basic DNA sequences list:
#> [1] GTCCAGATCCCGGAT <15>
#> [2] GTACGGGTCCCGGAT <15>
Note that it is very different to base rev()
, which
reverses only the order of sequences, not letters:
We can combine two or more sq
objects using base
c()
function:
tidysq
offers two functions specific to DNA/RNA
sequences, namely complement()
and
translate()
. The former creates sequences with
complementary bases, that is, replaces A
with
T
, C
with G
and vice
versa. The latter translates input to amino acid sequences using the
translation table with three-letter codons.
These functions can be called as shown below:
complement(sq_dna)
#> basic DNA sequences list:
#> [1] ATCCGGGATCTGGAC <15>
#> [2] ATCCGGGACCCGTAC <15>
#> [3] CAGGTCTAGGGCCTA <15>
#> [4] CATGCCCAGGGCCTA <15>
translate(sq_dna)
#> basic amino acid sequences list:
#> [1] *ALDL <5>
#> [2] *ALGM <5>
#> [3] VQIPD <5>
#> [4] VRVPD <5>
One noteworthy feature here is that translation can be done with any genetic code table of those listed on this Wikipedia page:
Motifs are short subsequences. These are often searched for in
biological sequences. tidysq
has two distinct functions
that allow the user to perform such search.
One of them is a %has%
operator that takes
sq
object and character vector as parameters respectively.
It returns a logical vector of the same length as sq
object, where each element says whether all motifs passed as strings
were found in given sequence:
sq_dna %has% "ATC"
#> [1] FALSE FALSE TRUE FALSE
# It can be used to subset sq
sq_dna[sq_dna %has% c("AG", "CC")]
#> basic DNA sequences list:
#> [1] TAGGCCCTAGACCTG <15>
#> [2] TAGGCCCTGGGCATG <15>
#> [3] GTCCAGATCCCGGAT <15>
It says nothing about motif placement within sequence nor it exact
form, however. In this case, there is find_motifs()
function that returns a whole tibble
(from
tibble
package; basically improved version of
data.frame
) with various info about found motifs. Important
thing to note here is that the second argument is a character vector of
sequence names to avoid embedding potentially long sequences in
resulting tibble
potentially many times:
find_motifs(sq_dna, c("seq1", "seq2", "rev1", "rev2"), c("ATC", "TAG"))
#> # A tibble: 4 × 5
#> names found sought start end
#> <chr> <dna_bsc> <chr> <int> <int>
#> 1 rev1 ATC <3> ATC 7 9
#> 2 seq1 TAG <3> TAG 1 3
#> 3 seq1 TAG <3> TAG 8 10
#> 4 seq2 TAG <3> TAG 1 3
You can also provide this function with a data.frame
(or, what we recommend, tibble
) containing one column
called sq
, containing the sequences and the other column
name
containing the names.
sqibble <- tibble::tibble(sq = sq_dna,
name = c("seq1", "seq2", "rev1", "rev2"))
# does the same as the call from previous chunk of code
find_motifs(sqibble, c("ATC", "TAG"))
#> # A tibble: 4 × 5
#> names found sought start end
#> <chr> <dna_bsc> <chr> <int> <int>
#> 1 rev1 ATC <3> ATC 7 9
#> 2 seq1 TAG <3> TAG 1 3
#> 3 seq1 TAG <3> TAG 8 10
#> 4 seq2 TAG <3> TAG 1 3
There are ambiguous DNA bases in IUPAC codes and these can be used in
motifs. One of them is "N"
— its meaning is “any of
A
, C
, G
or T
:
find_motifs(sqibble, "GNCC")
#> # A tibble: 7 × 5
#> names found sought start end
#> <chr> <dna_bsc> <chr> <int> <int>
#> 1 seq1 GGCC <4> GNCC 3 6
#> 2 seq1 GCCC <4> GNCC 4 7
#> 3 seq1 GACC <4> GNCC 10 13
#> 4 seq2 GGCC <4> GNCC 3 6
#> 5 seq2 GCCC <4> GNCC 4 7
#> 6 rev1 GTCC <4> GNCC 1 4
#> 7 rev2 GTCC <4> GNCC 7 10
This example displays the difference between "sought"
and "found"
columns. The former contains the string
representation of motif that the user was looking for, while the latter
contains a tidysq
-encoded sequence with an “instance” of
motif.
Two additional characters are reserved because of their special
meaning in motifs. "^"
means that this motif must be found
at the start of a sequence, while "$"
means the same, but
with the end instead. They can be mixed with ambiguous letters, of
course:
After doing computations the user might wish to save their sequences
for future use. One of the most popular formats for storing biological
sequences is FASTA. tidysq
allows the user to write
sequences to FASTA file with write_fasta()
function.
Important thing to remember here that the arguments for the function are
analogous to those used in find_motifs()
– either
sq
object and a vector of names or a tibble
with columns of sequences and names: