library(tidysq)
#>
#> Attaching package: 'tidysq'
#> The following object is masked from 'package:base':
#>
#> paste
Sequences in sq
objects are compressed to take up less
storage space. To achieve that, sq
objects store an
alphabet
attribute that serves as a dictionary of possible
symbols. This attribute can be accessed by its namesake function:
sq_dna <- sq(c("CTGAATGCAGT", "ATGCCGT", "CAGACCATT"))
alphabet(sq_dna)
#> <tidysq alphabet[5]>
#> [1] A C G T -
It is strongly discouraged to manually assign different alphabet, as it may result in undesirable behavior.
Alphabets can be divided into standard and non-standard types. Both these groups have similar behavior, but standard alphabets have additional functionalities available due to their biological interpretation.
Standard alphabets can be subdivided into basic and extended
alphabets, both groups closely linked. For every standard alphabet there
exists a type such that if an sq
object has this type, then
its alphabet
attribute has this alphabet as value.
There are three predefined basic alphabets — for DNA, RNA and amino
acid sequences. They consist of all letter codes used for bases of given
type, as well as gap letter “-” and (in amino acid case) stop letter
“*”. Alphabets are stored as character vectors with added
sq_alphabet
class for additional methods. For instance,
amino acid alphabet contains following letters: A, C, D, E, F, G, H, I,
K, L, M, N, P, Q, R, S, T, V, W, Y, -, *.
Basic DNA/RNA alphabet is necessary for translate()
operation.
For each basic alphabet there is an extended counterpart. These three extended alphabets contain all letters from the respective basic ones and, additionally, ambiguous letters (that is, letters that mean “X-or-Y-or-Z base”, where X, Y and Z are chosen from corresponding base alphabet).
Both basic and extended alphabets can be acquired using
get_standard_alphabet()
function. It uses type interpreting
not to force the user to remember exact type name (although using
consistent naming is beneficial to code readability):
When an sq
object has an extended type, it can be
converted to the basic one by utilizing remove_ambiguous()
function. It works by removing either sequences where an ambiguous
element is present or just this element, depending on
by_letter
parameter value. In the example below
N
is such an element:
sq_rna <- sq(c("UCGGNNCAGNN", "AUUCGGUGA", "CNCUUANNNCNU"))
sq_rna
#> extended RNA sequences list:
#> [1] UCGGNNCAGNN <11>
#> [2] AUUCGGUGA <9>
#> [3] CNCUUANNNCNU <12>
remove_ambiguous(sq_rna)
#> basic RNA sequences list:
#> [1] <NULL> <0>
#> [2] AUUCGGUGA <9>
#> [3] <NULL> <0>
remove_ambiguous(sq_rna, by_letter = TRUE)
#> basic RNA sequences list:
#> [1] UCGGCAG <7>
#> [2] AUUCGGUGA <9>
#> [3] CCUUACU <7>
Should the user wish to keep the original lengths of sequences
unchanged, it’s more appropriate to use
substitute_letters()
function instead. The most obvious
replacement is “-” gap letter, present in all standard alphabets:
substitute_letters(sq_rna, c(N = "-"))
#> atp (atypical alphabet) sequences list:
#> [1] UCGG--CAG-- <11>
#> [2] AUUCGGUGA <9>
#> [3] C-CUUA---C-U <12>
Notice, however, that returned object has atp
alphabet
instead. More on handling that in chapter
about changing sq types.
Non-standard alphabet group consists of two types: untyped
(unt
) and atypical (atp
). The former is a
result of not specifying alphabet and being unable to find a standard
alphabet that would contain all letters appearing in sequences. The
latter, on the other hand, is used whenever the user specifies used
alphabet explicitly. The difference can be best shown with calls to
constructing sq()
function:
sq(c("PFN&I&VO*&P", "&IO*&PVO"))
#> unt (unspecified type) sequences list:
#> [1] PFN&I&VO*&P <11>
#> [2] &IO*&PVO <8>
sq(c("PFN&I&VO*&P", "&IO*&PVO"),
alphabet = c("F", "I", "N", "O", "P", "V", "&", "*"))
#> atp (atypical alphabet) sequences list:
#> [1] PFN&I&VO*&P <11>
#> [2] &IO*&PVO <8>
Obviously, as with standard alphabets, atypical ones can also contain more letters than actually appear:
sq(c("PFN&I&VO*&P", "&IO*&PVO"),
alphabet = c("E", "F", "I", "N", "O", "P", "Q", "V", "&", "*", ":"))
#> atp (atypical alphabet) sequences list:
#> [1] PFN&I&VO*&P <11>
#> [2] &IO*&PVO <8>
The main usage of atypical alphabets is to allow the user to handle data with multicharacter letters. For example sometimes amino acid sequences are described using three-character codes. These can be handled as shown below (although with specifying all, not only a handful of codes):
sq_multichar <- sq(c("TyrGlyArgArgAsp", "AspGlyArgGly", "CysGluGlyTyrProArg"),
alphabet = c("Arg", "Asp", "Cys", "Glu", "Gly", "Pro", "Tyr"))
sq_multichar
#> atp (atypical alphabet) sequences list:
#> [1] Tyr Gly Arg Arg Asp <5>
#> [2] Asp Gly Arg Gly <4>
#> [3] Cys Glu Gly Tyr Pro Arg <6>
These letters are treated as a whole, meaning that they are indivisible. It can be observed during letter replacement operation:
As shown in previous chapters, substitute_letters()
return an sq
object of atp type. If a type isn’t
satisfying, then the user can utilize typify()
function
that creates new sq
object with desired type (backticks are
necessary, when the substituted letter isn’t a valid variable name):
sq_unt <- sq(c("UCGG&&CAG&&", "AUUCGGUGA", "C&CUUA&&&C&U"))
sq_sub <- substitute_letters(sq_unt, c(`&` = "-"))
sq_sub
#> atp (atypical alphabet) sequences list:
#> [1] UCGG--CAG-- <11>
#> [2] AUUCGGUGA <9>
#> [3] C-CUUA---C-U <12>
typify(sq_sub, "rna_bsc")
#> basic RNA sequences list:
#> [1] UCGG--CAG-- <11>
#> [2] AUUCGGUGA <9>
#> [3] C-CUUA---C-U <12>
However, one should note that there is a requirement for
typify()
to work — typified sq
object must not
contain any letters not in the target alphabet. For instance, following
call won’t work:
typify(sq_sub, "dna_bsc")
#> Error: sq object contains letters that do not appear in the alphabet of target type
The user isn’t left alone to guess whether a sequence has invalid
letters or not. In this case they can use
find_invalid_letters()
function that returns a list of
character vectors, where each vector contains invalid letter for
corresponding sequence:
find_invalid_letters(sq_sub, "dna_bsc")
#> [[1]]
#> [1] "U"
#>
#> [[2]]
#> [1] "U"
#>
#> [[3]]
#> [1] "U"
However, all invalid letters within an alphabet have to be
substituted before passing it to typify()
. A more
complicated call that replaces all ambiguous letters with “-” gap letter
can be constructed as follows:
ambiguous_letters <- setdiff(
get_standard_alphabet("rna_ext"),
get_standard_alphabet("rna_bsc")
)
encoding <- rep("-", length(ambiguous_letters))
names(encoding) <- ambiguous_letters
encoding
#> W S M K R Y B D H V N
#> "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-"
sq_rna_sub <- substitute_letters(sq_rna, encoding)
typify(sq_rna_sub, "rna_bsc")
#> atp (atypical alphabet) sequences list:
#> [1] UCGG--CAG-- <11>
#> [2] AUUCGGUGA <9>
#> [3] C-CUUA---C-U <12>