CNVreg
Package to Perform Copy Number Variant Association Analysis with
Penalied Regression# load the package
library("CNVreg")
The CNVreg package provides functions to perform copy number variants (CNV) association analysis with penalized regression model.
This package converts CNVs over a genomic region as a piecewise constant curve to capture the dosage and length of CNVs. The association analysis is then evaluated by regressing outcome traits on all CNV fragments in the region while adjusting for covariates. The corresponding CNV effects are obtained at each genome position. The penalized regression model with Lasso and weighted fusion penalties would perform variable selection and encourage adjacent CNVs to share similar effect size.
This package has 3 main functions:
prep()
: Data preprocessing and format
conversion.
cvfit_WTSMTH()
: Model fitting and effect estimate
with cross-validation(CV). The CV procedure is to tune an optimal model
by selecting the best pair of candidate tuning parameters.
fit_WTSMTH()
: Model fitting and effect estimate with
a given pair of tuning parameters.
All functions use an example data included in the CNVreg
package.
The CNVCOVY
dataset included in this package contains a
small sample of data for demonstration purposes. It has 4 separate data
files: copy number variants data in CNV
, covariate data in
Cov
, and outcome traits Y_QT
(quantitative)
and Y_BT
(Binary).
CNV
: A data frame describing CNV data in PLINK
format with 5 variables ID
, CHR
,
BP1
, BP2
, and TYPE
.
Cov
: A data frame with 3 variables: ID
,
Sex
, and Age
.
Y_QT
and Y_BT
: each is a data frame for
outcome traits. Y_QT
contains a quantitative trait.
Y_BT
contains a binary trait. Both have 2 variables:
ID
and Y
.
Here is how you can load and view the summary of the datasets:
# load the example dataset
data("CNVCOVY", package="CNVreg")
# view the dataset
summary(CNV)
#> ID CHR BP1 BP2
#> Length:2680 Min. :1 Min. :118956400 Min. :118956600
#> Class :character 1st Qu.:1 1st Qu.:175325200 1st Qu.:175325800
#> Mode :character Median :1 Median :203709300 Median :203709400
#> Mean :1 Mean :191999781 Mean :192000245
#> 3rd Qu.:1 3rd Qu.:229563000 3rd Qu.:229564400
#> Max. :1 Max. :238592100 Max. :238593100
#> TYPE
#> Min. :0.0000
#> 1st Qu.:1.0000
#> Median :1.0000
#> Mean :0.9828
#> 3rd Qu.:1.0000
#> Max. :3.0000
summary(Cov)
#> ID Sex Age
#> Length:900 Min. :0.0000 Min. :50.00
#> Class :character 1st Qu.:0.0000 1st Qu.:59.00
#> Mode :character Median :1.0000 Median :70.00
#> Mean :0.5011 Mean :69.82
#> 3rd Qu.:1.0000 3rd Qu.:80.00
#> Max. :1.0000 Max. :89.00
summary(Y_QT)
#> ID Y
#> Length:900 Min. :-4.8950
#> Class :character 1st Qu.:-2.3411
#> Mode :character Median :-1.3730
#> Mean : 0.4128
#> 3rd Qu.: 3.5676
#> Max. :16.7008
summary(Y_BT)
#> ID Y
#> Length:900 Min. :0.0000
#> Class :character 1st Qu.:0.0000
#> Mode :character Median :0.0000
#> Mean :0.4056
#> 3rd Qu.:1.0000
#> Max. :1.0000
Briefly, the dataset has the CNV (2680 records), covariates
(Sex
and Age
), and outcome traits for 900
individuals.
prep()
functionThe prep()
function converts an individual’s CNV events
within a genomic region to fragments, and filters out rare events. It
analyzes the adjacency relationship between CNV fragments and prepares
different weight options for the penalized regression analysis. Here is
how you can use the prep()
function to preprocess
CNV
, Cov
and an outcome trait. The outcome
trait can be a continuous trait Y_QT
or a binary trait
Y_BT
. The syntax of the prep()
command is the
same for continuous and binary outcomes.
prep()
The function prep()
has 4 inputs.
CNV
: takes a data frame describing CNV in PLINK
format with 5 variables: ID
, CHR
,
BP1
, BP2
, and TYPE
.
Y
: takes a data frame describing outcomes with 2
variables: ID
and Y
.
Z
: takes a data frame describing covariates, if
Z
is provided, one variable must be ID
, other
variables can be any covariates of interest.
rare.out
: takes a number in [0, 0.5). A default
value is 0.05, which excludes CNVs with frequency < \(5\%\)
# data preprocessing for a quantitative(continuous) outcome Y_QT
frag_data_QT <- prep(CNV = CNV, Y = Y_QT, Z = Cov, rare.out = 0.05)
The result frag_data_QT
is the output from the
prep()
function, which has a specially designed
“WTsmth.data” format for easy application in the next step for CNV
association analysis. It has 6 components.
design
: a matrix of the CNV fragments in n by p
dimensions, where n is the number of samples and p is the total number
of CNV fragments. Rownames are sample ID, and the order of rownames is
the same as the rownames of the outcome file
(frag_data_QT$Y
).
Z
: a matrix of covariates with sample ID as
rownames. The rownames are in the same order as in the outcome file
(frag_data_QT$Y
).
Y
: a matrix of 1 column with sample ID as rownames.
The rownames are in the same order as in the CNV design matrix
frag_data_QT$design
and covariates
frag_data_QT$Z
.
weight.structure
: a matrix that describes the
adjacency structure of CNV fragments. The matrix is sparse and most
values are zero, while non-zero values represent two adjacent CNV
fragments that are overlapped by at least one CNV event in the
population.
weight.options
: we provide 6 different options of
weights that encourage differential information sharing based on the
relationship between adjacent CNV fragments. Equal weight
eql
, Cosine-similarity based weight wcs
,
Inverse frequency weight wif
, and combining these 3 weights
with the frequency (k) of any CNV events within each CNV-active region
(“keql”, “kwcs”, and “kwif”). Refer to the user manual for more
information.
CNVR.info
summarizes the positions of all CNV
fragments and their adjacency information. Each row represents a CNV
fragment and the fragment names match the column names in
frag_data_QT$design
.
frag_data_QT
# Format of `prep()` funtion output
str(frag_data_QT)
#> List of 6
#> $ design :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#> .. ..@ i : int [1:5552] 9 13 15 21 24 38 55 57 62 66 ...
#> .. ..@ p : int [1:20] 0 177 320 465 608 752 1189 1914 2517 3081 ...
#> .. ..@ Dim : int [1:2] 900 19
#> .. ..@ Dimnames:List of 2
#> .. .. ..$ : chr [1:900] "U1" "U10" "U100" "U101" ...
#> .. .. ..$ : chr [1:19] "del1" "del3" "del4" "del5" ...
#> .. ..@ x : num [1:5552] 200 200 200 200 200 200 200 200 200 200 ...
#> .. ..@ factors : list()
#> $ Z :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#> .. ..@ i : int [1:1351] 3 4 5 6 7 9 10 12 14 15 ...
#> .. ..@ p : int [1:3] 0 451 1351
#> .. ..@ Dim : int [1:2] 900 2
#> .. ..@ Dimnames:List of 2
#> .. .. ..$ : chr [1:900] "U1" "U10" "U100" "U101" ...
#> .. .. ..$ : chr [1:2] "Sex" "Age"
#> .. ..@ x : num [1:1351] 1 1 1 1 1 1 1 1 1 1 ...
#> .. ..@ factors : list()
#> $ Y : Named num [1:900] -1.816 -0.321 -2.624 -0.688 3.288 ...
#> ..- attr(*, "names")= chr [1:900] "U1" "U10" "U100" "U101" ...
#> $ weight.structure:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#> .. ..@ i : int [1:22] 1 1 2 2 3 3 4 4 5 5 ...
#> .. ..@ p : int [1:20] 0 0 1 3 5 6 7 9 11 12 ...
#> .. ..@ Dim : int [1:2] 15 19
#> .. ..@ Dimnames:List of 2
#> .. .. ..$ : NULL
#> .. .. ..$ : NULL
#> .. ..@ x : num [1:22] -1 1 -1 1 -1 1 -1 1 -1 1 ...
#> .. ..@ factors : list()
#> $ weight.options : num [1:6, 1:15] 0 0 0 0 0 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:6] "eql" "keql" "wcs" "kwcs" ...
#> .. ..$ : NULL
#> $ CNVR.info :'data.frame': 19 obs. of 7 variables:
#> ..$ grid.id : int [1:19] 1 3 4 5 6 8 9 10 11 13 ...
#> ..$ CNV.id : int [1:19] 1 2 2 2 2 3 3 3 3 4 ...
#> ..$ freq : int [1:19] 177 143 145 143 144 437 725 603 564 640 ...
#> ..$ CHR : int [1:19] 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ lower.boundary: num [1:19] 1.19e+08 1.21e+08 1.21e+08 1.21e+08 1.21e+08 ...
#> ..$ upper.boundary: num [1:19] 1.19e+08 1.21e+08 1.21e+08 1.21e+08 1.21e+08 ...
#> ..$ deldup : chr [1:19] "del" "del" "del" "del" ...
#> - attr(*, "class")= chr "WTsmth.data"
# data preprocessing with a binary trait
frag_data_BT <- prep(CNV = CNV, Y = Y_BT, Z = Cov, rare.out = 0.05)
The result frag_data_BT
is the output from the
prep()
function with a binary trait, which has the same
“WTsmth.data” format. It contains the same list in the continuous
scenario as mentioned earlier: design
, Z
,
Y
, weight.structure
,
weight.options
, and CNVR.info
.
Here is an alternative way to prepare data when performing CNV
association analysis with the same set of CNVs for multiple outcome
traits. Since we have the same CNV
data, Cov
data, and a different outcome trait Y_BT
, we can manually
format Y_BT
to match the format in
frag_data_QT$Y
.
# It would be useful when we have large CNV data set and perform association analysis for multiple traits with the same set of CNV data.
## copy frag_data_QT
#frag_data_BT <- frag_data_QT
#
### replace Y with Y_BT in the correct format: ordered named vector
### order the sample in Y_BT as in frag_data_QT$Y
#rownames(Y_BT) <- Y_BT$ID
#
#frag_data_BT$Y <- Y_BT[names(frag_data_QT$Y), "Y"] |> drop()
#names(frag_data_QT$Y) <- rownames(frag_data_QT$Y)
## Directly replace frag_data_QT$Z is also possible, keep in mind to use the correct variable names and sample ID order.
The cvfit_WTSMTH()
function analyzes the association
between a continuous/binary trait value and CNV
while
adjusting for the covariates Cov
.
We already have frag_data_QT
prepared in the
prep()
step. We can fit a model to perform CNV association
analysis for a continuous outcome using CV to fine-tune tuning
parameters and fit an optimal model with the selected parameters.
set.seed(12345)
QT_TUNE <- cvfit_WTSMTH(data = frag_data_QT,
lambda1=seq(-8, -3, 1),
lambda2 = seq(12, 25, 2),
weight="eql",
family = "gaussian",
cv.control = list(n.fold = 5L,
n.core = 1L,
stratified = FALSE),
verbose = FALSE)
The cvfit_WTSMTH()
function takes the output of
prep()
function as one of the major inputs, for example,
frag_data_QT
prepared for the continuous trait and
frag_data_BT
prepared for the binary trait.
lambda1
and lambda2
take the candidate
tuning parameters that control variable selection (lambda1
)
and effect smoothness (lambda2
). Provided values will be
transformed to 2^(lambda1
) and 2^(lambda2
). We
provide default values for both vectors. The user can customize the
range and step_size of the candidate tuning parameters. In most cases,
the user will need to run the function more than one time to adjust the
range and step_size of tuning parameters to locate to a reasonable range
according to the previous round of model fitting.
weight
has six different options as described earlier.
Since we only have a small dataset, varying the weight
options will not have much influence on the model fitting results. In
real CNV data with different similarity patterns and CNV frequencies,
varying the weight
option are expected to have different
effects.
family
has two options: gaussian
for a
continuous outcome, and binomial
for a binary outcome.
This function also supports parallel computing and change of n-folds
in CV by adjusting the cv.control
list.
n.fold
controls the number of folds used in
CV.
n.core
controls the the number of cores used in
parallel computing.
stratified
only has control for a binary outcome. We
will skip it here and describe it in the binary section.
If choose verbose
= TRUE, it will print a message about
where the program is currently working on.
The output of the cvfit_WTSMTH()
function is a list
object containing 3 elements: Loss
,
lambda.selected
, and coef
.
Loss
The Loss
keeps track of the average validation loss in
CV for each pair of candidate tuning parameters \(\lambda_{1}\) and \(\lambda_{2}\). In the following table, the
minimum loss is highlighted and the corresponding \(\lambda_{1}\) and \(\lambda_{2}\) values are selected to fit a
final model.
In this simulated data, the variation of loss for different \(\lambda_{2}\) with the same \(\lambda_{1}\) is not very large. One reason is that \(\lambda_{2}\) controls the effect smoothness between adjacent CNVs, and the simulation data only has a small number of CNVs in adjacent that share effects to other CNVs. The effect of changing \(\lambda_{2}\) seems not prominence in this case. When we have more CNVs in adjacent and share effects, it should have larger variance across \(\lambda_{2}\).
Lambda2 | -8 | -7 | -6 | -5 | -4 | -3 |
---|---|---|---|---|---|---|
12 | 1.004534 | 1.003579 | 0.998444 | 0.993686 | 0.997349 | 1.02374 |
14 | 1.008237 | 1.007236 | 1.002091 | 0.997715 | 1.002085 | 1.02904 |
16 | 1.005668 | 1.004547 | 0.999433 | 0.995208 | 0.999796 | 1.02829 |
18 | 1.001295 | 1.000519 | 0.995631 | 0.991565 | 0.996885 | 1.02515 |
20 | 0.999942 | 0.998498 | 0.993783 | 0.990641 | 0.996864 | 1.02829 |
22 | 0.998218 | 0.996816 | 0.992557 | 0.991599 | 1.000824 | 1.04408 |
24 | 0.998674 | 0.995685 | 0.992663 | 0.995447 | 1.015663 | 1.10326 |
selected.lambda
The selected.lambda
is the optimal tuning parameters
from the candidate lists that has the lowest loss, which can be
confirmed with the Loss
table.
# selected optimal tuning parameters with minimum loss
QT_TUNE$selected.lambda
#> [1] -5 20
coef
The coef
shows the estimated beta coefficients at the
selected tuning parameters. It has (intercept)
, CNV
fragments (with detailed positions/type information), and covariate
effects. In this small example, we can print all coefficient estimate,
but you can modify the code to show only non-zero ones.
Here lists the coefficients for (Intercept)
and
covariates. The characteristics for CNV (CHR, CNV.start, CNV.end, and
deldup) are left as NA
s intentionally in the original
output. Here we only show the effect estimate.
##coefficients of intercept and covariates
QT_TUNE$coef[c(1, 21:22), c("Vnames", "coef") ]
#> Vnames coef
#> 1 (Intercept) -1.99855
#> 21 Sex 0.00000
#> 22 Age 0.00000
Here lists the coefficients for CNVs and the corresponding plots.
We highlight the regions with adjacent CNVs. From the coefficient estimates, the model selects several non-zero CNVs (data points). Among the data points, the red ones have stronger effect than the black dots. The black dots are likely noise.
We also zoom in on two highlighted regions with strong signals and adjacent CNVs to show the effect smoothness within the regions.
The results illustrate the variable selection and effect smoothness of the penalized regression method for CNV association analysis.
# estimated coefficents for CNV
QT_TUNE$coef[2:20, ]
#> Vnames CHR CNV.start CNV.end deldup coef
#> 2 del1 1 118956400 118956600 del 0.0000000000
#> 3 del3 1 121299300 121299500 del 0.0048659084
#> 4 del4 1 121299500 121299700 del 0.0046967844
#> 5 del5 1 121299700 121299800 del 0.0046911466
#> 6 del6 1 121299800 121300400 del 0.0051456642
#> 7 del8 1 175325200 175325400 del -0.0001740566
#> 8 del9 1 175325400 175325500 del 0.0000000000
#> 9 del10 1 175325500 175325600 del 0.0000000000
#> 10 del11 1 175325600 175325800 del 0.0000000000
#> 11 del13 1 203709300 203709400 del -0.0002731880
#> 12 del17 1 229563000 229563200 del 0.0042092931
#> 13 del18 1 229563200 229563500 del 0.0046547584
#> 14 del19 1 229563500 229563900 del 0.0048148656
#> 15 del20 1 229563900 229564400 del 0.0057050100
#> 16 del23 1 235735000 235735100 del 0.0000000000
#> 17 del25 1 238591800 238592100 del 0.0000000000
#> 18 del26 1 238592100 238592900 del 0.0000000000
#> 19 del27 1 238592900 238593100 del 0.0000000000
#> 20 dup15 1 212455200 212455300 dup 0.0000000000
# non-zero coefficients
# QT_TUNE$coef[which(abs(QT_TUNE$coef$coef)>0), ]
set.seed(12345)
BT_TUNE <- cvfit_WTSMTH(frag_data_BT,
lambda1 = seq(-5.25, -4.75, 0.25),
lambda2 = seq(2, 8, 2),
weight="eql",
family="binomial",
cv.control = list(n.fold = 5L,
n.core = 1L,
stratified = FALSE),
iter.control = list(max.iter = 8L,
tol.beta = 10^(-3),
tol.loss = 10^(-6)),
verbose = FALSE)
cvfit_WTSMTH()
The CNV association analysis for a bianry outcome has similar inputs as for a continuous outcome. The differences are
The output frag_data_BT
from the prep()
step has a binary trait Y_BT
ready for CNV association
analysis with a binary outcome.
Choose family
= “binomial” for a binary
trait.
There are a few more options specifically designed for the binary trait.
stratified
within the cv.control
list:
If one category of the binary outcome is considered “rare”,
stratified
= TRUE is recommended to make sure the data
splits are having the same proportion of cases and controls in each
fold.
iter.control
: For a binary outcome, we can also
adjust the iter.control
list with desired threshold that is
deemed converged for coefficient estimate of a binary outcome. Refer to
the user manual for more details.
cvfit_WTSMTH()
The output of the cvfit_WTSMTH()
function has the same
list object containing 3 elements: Loss
,
lambda.selected
, and coef
.
Loss
The
Loss` keeps track of the average validation loss in
CV for each pair of candidate tuning parameters \(\lambda_{1}\) and \(\lambda_{2}\). In the following table, the
minimum loss is highlighted and the corresponding \(\lambda_{1}\) and \(\lambda_{2}\) values are selected to fit a
final model.
Since the regression process for a binary trait takes longer time to converge, here we only use a short list of candidate tuning parameters for illustration purpose.
Lambda2 | -5.25 | -5 | -4.75 |
---|---|---|---|
2 | 0.281147 | 0.280297 | 0.279914 |
4 | 0.279573 | 0.278999 | 0.278696 |
6 | 0.278115 | 0.277811 | 0.277819 |
8 | 0.278173 | 0.278394 | 0.279038 |
selected.lambda
The selected.lambda
are the optimal tuning parameters
from the candidate lists that has the lowest loss, which can be
confirmed with the Loss
table.
# selected optimal tuning parameters with minimum loss
BT_TUNE$selected.lambda
#> [1] -5 6
coef
The estimated beta coefficients coef
at the selected
tuning parameters. It has (intercept)
, CNV fragments (with
detailed positions/type information), and covariate effects. In this
small data example, we can print all coefficient estimate, but you can
modify the code to show non-zero ones or the first few ones.
Here lists the coefficients for (Intercept)
and
covariates.
BT_TUNE$coef[c(1, 21:22), c("Vnames", "coef") ]
#> Vnames coef
#> 1 (Intercept) -1.953386
#> 21 Sex 0.000000
#> 22 Age 0.000000
Here lists the coefficients for CNVs and the corresponding plots.
BT_TUNE$coef[2:20, ]
#> Vnames CHR CNV.start CNV.end deldup coef
#> 2 del1 1 118956400 118956600 del 0.000000e+00
#> 3 del3 1 121299300 121299500 del 2.879470e-03
#> 4 del4 1 121299500 121299700 del 3.530621e-03
#> 5 del5 1 121299700 121299800 del 3.023385e-03
#> 6 del6 1 121299800 121300400 del 4.925100e-03
#> 7 del8 1 175325200 175325400 del 0.000000e+00
#> 8 del9 1 175325400 175325500 del 0.000000e+00
#> 9 del10 1 175325500 175325600 del 0.000000e+00
#> 10 del11 1 175325600 175325800 del 4.010457e-05
#> 11 del13 1 203709300 203709400 del 0.000000e+00
#> 12 del17 1 229563000 229563200 del 1.650097e-03
#> 13 del18 1 229563200 229563500 del 2.782514e-03
#> 14 del19 1 229563500 229563900 del 4.412988e-03
#> 15 del20 1 229563900 229564400 del 5.905259e-03
#> 16 del23 1 235735000 235735100 del 0.000000e+00
#> 17 del25 1 238591800 238592100 del 0.000000e+00
#> 18 del26 1 238592100 238592900 del 0.000000e+00
#> 19 del27 1 238592900 238593100 del 0.000000e+00
#> 20 dup15 1 212455200 212455300 dup 0.000000e+00
The user can choose the function fit_WTSMTH()
to show
the regression result with some random combination of parameters without
going through the CV process. Although, it is much faster to perform
fit_WTSMTH()
, we do recommend the user to stick with the
parameter tuning procedure with cvfit_WTSMTH()
and find the
best parameters and the best fitted model.
The fit_WTSMTH()
function and the
cvfit_WTSMTH()
function uses the same analytical methods to
perform CNV association analysis with penalized regression. Unlike the
cvfit_WTSMTH()
function that will fine-tune the parameters
and select the optimal combination of \(\lambda_{1}\) and \(\lambda_{2}\) from a series of candidates,
the fit_WTSMTH()
function takes a user-specified value for
\(\lambda_{1}\) and \(\lambda_{2}\) and estimate the coefficients
for the given pair of parameters.
If we already have a selected best pair of parameters \(\lambda_{1}\) and \(\lambda_{2}\) from the CV procedure, we can refit the regression model and the coefficient estimate is the same as in the fine-tuned model.
# we know the optimal tuning parameters and directly apply it here.
QT_fit <- fit_WTSMTH(frag_data_QT,
lambda1 = -5,
lambda2 = 20,
weight="eql",
family="gaussian")
Here lists the coefficients for (Intercept)
and
covariates.
QT_fit[c(1, 21:22), c("Vnames", "coef") ]
#> Vnames coef
#> 1 (Intercept) -1.99855
#> 21 Sex 0.00000
#> 22 Age 0.00000
Here lists the coefficients for CNVs and the corresponding plots.
QT_fit[2:20, ]
#> Vnames CHR CNV.start CNV.end deldup coef
#> 2 del1 1 118956400 118956600 del 0.0000000000
#> 3 del3 1 121299300 121299500 del 0.0048659084
#> 4 del4 1 121299500 121299700 del 0.0046967844
#> 5 del5 1 121299700 121299800 del 0.0046911466
#> 6 del6 1 121299800 121300400 del 0.0051456642
#> 7 del8 1 175325200 175325400 del -0.0001740566
#> 8 del9 1 175325400 175325500 del 0.0000000000
#> 9 del10 1 175325500 175325600 del 0.0000000000
#> 10 del11 1 175325600 175325800 del 0.0000000000
#> 11 del13 1 203709300 203709400 del -0.0002731880
#> 12 del17 1 229563000 229563200 del 0.0042092931
#> 13 del18 1 229563200 229563500 del 0.0046547584
#> 14 del19 1 229563500 229563900 del 0.0048148656
#> 15 del20 1 229563900 229564400 del 0.0057050100
#> 16 del23 1 235735000 235735100 del 0.0000000000
#> 17 del25 1 238591800 238592100 del 0.0000000000
#> 18 del26 1 238592100 238592900 del 0.0000000000
#> 19 del27 1 238592900 238593100 del 0.0000000000
#> 20 dup15 1 212455200 212455300 dup 0.0000000000
However, if we choose a random pair of tuning parameters, the
function will not have optimal variable selection and effect smoothness.
In summary, the cvfit_WTSMTH()
function is recommended for
model fitting with CV.