First, let us split the nhanes3_newborn
dataset into
training data and test data.
library(mixgb)
data("nhanes3_newborn")
set.seed(2022)
<- nrow(nhanes3_newborn)
n <- sample(1:n, size = round(0.7 * n), replace = FALSE)
idx <- nhanes3_newborn[idx, ]
train.data <- nhanes3_newborn[-idx, ] test.data
We can use the training data to generate m
imputed
datasets and save their imputation models. To achieve this, users need
to set save.models = TRUE
. By default, imputation models
for all variables with missing values in the training data will be saved
(save.vars = NULL
). However, it is possible that unseen
data may have missing values in other variables. To be thorough, users
can save models for all variables by setting
save.vars = colnames(train.data)
. Note that this may take
significantly longer as it requires training and saving a model for each
variable. In cases where users are confident that only certain variables
will have missing values in the new data, it is advisable to specify the
names or indices of these variables in save.vars
rather
than saving models for all variables.
<- list(
params max_depth = 3,
gamma = 0,
eta = 0.3,
min_child_weight = 1,
subsample = 0.7,
nthread = 2
)
# obtain m imputed datasets for train.data and save imputation models
<- mixgb(data = train.data, m = 5, xgb.params = params, save.models = TRUE, save.vars = NULL) mixgb.obj
When save.models = TRUE
, mixgb()
will
return an object containing the following:
imputed.data
: a list of m
imputed
datasets for training data
XGB.models
: a list of m
sets of XGBoost
models for variables specified in save.vars
.
params
: a list of parameters that are required for
imputing new data using impute_new()
later on.
We can access the m
imputed datasets from the saved
imputer object by using $imputed.data
.
<- mixgb.obj$imputed.data
train.imputed # the 5th imputed dataset
head(train.imputed[[5]])
#> HSHSIZER HSAGEIR HSSEX DMARACER DMAETHNR DMARETHN BMPHEAD BMPRECUM BMPSB1
#> 1: 7 2 1 1 1 3 42.1 64.9 6.8
#> 2: 4 3 2 2 3 2 42.6 67.1 8.8
#> 3: 3 9 2 2 3 2 46.5 64.3 8.6
#> 4: 3 9 2 1 3 1 46.2 68.5 10.8
#> 5: 5 4 1 1 3 1 44.7 63.0 6.0
#> 6: 5 10 1 1 3 1 45.2 72.0 5.4
#> BMPSB2 BMPTR1 BMPTR2 BMPWT DMPPIR HFF1 HYD1
#> 1: 7.8 9.0 10.0 8.45 1.701 2 1
#> 2: 8.8 13.3 12.2 8.70 0.102 2 1
#> 3: 8.0 10.4 9.2 8.00 0.359 1 3
#> 4: 10.0 16.6 16.0 8.98 0.561 1 3
#> 5: 5.8 9.0 9.0 7.60 2.379 2 1
#> 6: 5.4 9.2 9.4 9.00 2.173 2 2
To impute new data with this saved imputer object, we can use the
impute_new()
function. Users can choose whether to use new
data for initial imputation. By default, the information of training
data is used to initially impute the missing data in the new dataset
(initial.newdata = FALSE
). After this, the missing values
in the new dataset will be imputed using the saved models from the
imputer object. This process will be considerably faster because it will
not involve rebuilding the imputation models.
<- impute_new(object = mixgb.obj, newdata = test.data) test.imputed
If PMM is used in mixgb()
, predicted values of missing
entries in the new dataset will be matched with donors from the training
data. Additionally, users can set the number of donors to be used in PMM
when imputing new data. The default setting pmm.k = NULL
indicates that the same setting as the training object will be used.
Similarly, users can set the number of imputed datasets
m
in impute_new()
. Note that this value has to
be less than or equal to the m
value specified in
mixgb()
. If this value is not specified, the function will
use the same m
value as the saved object.
<- impute_new(object = mixgb.obj, newdata = test.data, initial.newdata = FALSE, pmm.k = 3, m = 4) test.imputed