Other functions in bulkreadr

Ezekiel Ogundepo and Ernest Fokoué

The bulkreadr package in R includes specialized functions beyond bulk data reading, aimed at enhancing data analysis efficiency. These functions are designed to operate on individual vectors, except for inspect_na() and fill_missing_values(), which work on data frames.

pull_out()

pull_out() is similar to [. It acts on vectors, matrices, arrays and lists to extract or replace parts. It is pleasant to use with the magrittr (⁠%>%⁠) and base(|>) operators.


library(bulkreadr)
library(dplyr)

top_10_richest_nig <- c("Aliko Dangote", "Mike Adenuga", "Femi Otedola", "Arthur Eze", "Abdulsamad Rabiu", "Cletus Ibeto", "Orji Uzor Kalu", "ABC Orjiakor", "Jimoh Ibrahim", "Tony Elumelu")

top_10_richest_nig %>% 
  pull_out(c(1, 5, 2))
#> [1] "Aliko Dangote"    "Abdulsamad Rabiu" "Mike Adenuga"
top_10_richest_nig %>% 
  pull_out(-c(1, 5, 2))
#> [1] "Femi Otedola"   "Arthur Eze"     "Cletus Ibeto"   "Orji Uzor Kalu"
#> [5] "ABC Orjiakor"   "Jimoh Ibrahim"  "Tony Elumelu"

convert_to_date()

convert_to_date() parses an input vector into POSIXct date-time object. It is also powerful to convert from excel date number like 42370 into date value like 2016-01-01.


## ** heterogeneous dates **

dates <- c(
  44869, "22.09.2022", NA, "02/27/92", "01-19-2022",
  "13-01-  2022", "2023", "2023-2", 41750.2, 41751.99,
  "11 07 2023", "2023-4"
  )

# Convert to POSIXct or Date object

convert_to_date(dates)
#>  [1] "2022-11-04" "2022-09-22" NA           "1992-02-27" "2022-01-19"
#>  [6] "2022-01-13" "2023-01-01" "2023-02-01" "2014-04-21" "2014-04-22"
#> [11] "2023-07-11" "2023-04-01"

# It can also convert date time object to date object 

convert_to_date(lubridate::now())
#> [1] "2024-05-26"

inspect_na()

inspect_na() summarizes the rate of missingness in each column of a data frame. For a grouped data frame, the rate of missingness is summarized separately for each group.


# dataframe summary

inspect_na(airquality)
#> # A tibble: 6 × 3
#>   col_name   cnt  pcnt
#>   <chr>    <int> <dbl>
#> 1 Ozone       37 24.2 
#> 2 Solar.R      7  4.58
#> 3 Wind         0  0   
#> 4 Temp         0  0   
#> 5 Month        0  0   
#> # ℹ 1 more row

Grouped dataframe summary

airquality %>% 
  group_by(Month) %>% 
  inspect_na()
#> # A tibble: 25 × 4
#> # Groups:   Month [5]
#>   Month col_name   cnt  pcnt
#>   <int> <chr>    <int> <dbl>
#> 1     5 Ozone        5  16.1
#> 2     5 Solar.R      4  12.9
#> 3     5 Wind         0   0  
#> 4     5 Temp         0   0  
#> 5     5 Day          0   0  
#> # ℹ 20 more rows

fill_missing_values()

fill_missing_values() is an efficient function that addresses missing values in a data frame. It uses imputation by function, also known as column-based imputation, to impute the missing values. It supports various imputation methods for continuous variables, including minimum, maximum, mean, median, harmonic mean, and geometric mean. For categorical variables, missing values are replaced with the mode of the column. This approach ensures accurate and consistent replacements derived from individual columns, resulting in a complete and reliable dataset for improved analysis and decision-making.


df <- tibble::tibble(
  Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
  Sepal.Width = c(4.1, 3.6, 3, 3, 2.9, 2.5, 2.4),
  Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
  Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
  Species = c("setosa", NA, "versicolor", "setosa",
    NA, "virginica", "setosa"
  )
)
df
#> # A tibble: 7 × 5
#>   Sepal_Length Sepal.Width Petal_Length Petal_Width Species   
#>          <dbl>       <dbl>        <dbl>       <dbl> <chr>     
#> 1          5.2         4.1          1.5        NA   setosa    
#> 2          5           3.6          1.4         0.2 <NA>      
#> 3          5.7         3            4.2         1.2 versicolor
#> 4         NA           3            1.4         0.2 setosa    
#> 5          6.2         2.9         NA           1.3 <NA>      
#> # ℹ 2 more rows

If you do not specify selected_variables (i.e., leave it as NULL), the function will impute missing values for all columns in the dataframe.

# Impute using the mean

result_df_mean <- fill_missing_values(df, method = "mean")

result_df_mean
#> # A tibble: 7 × 5
#>   Sepal_Length Sepal.Width Petal_Length Petal_Width Species   
#>          <dbl>       <dbl>        <dbl>       <dbl> <chr>     
#> 1         5.2          4.1          1.5        0.94 setosa    
#> 2         5            3.6          1.4        0.2  setosa    
#> 3         5.7          3            4.2        1.2  versicolor
#> 4         5.72         3            1.4        0.2  setosa    
#> 5         6.2          2.9          3          1.3  setosa    
#> # ℹ 2 more rows

If you specify column names, only those columns will be imputed. For example, impute for variables Petal_Length and Petal_Width using the geometric mean.


result_df_geomean <- fill_missing_values(df, selected_variables = c
("Petal_Length", "Petal_Width"), method = "geometric")

result_df_geomean
#> # A tibble: 7 × 5
#>   Sepal_Length Sepal.Width Petal_Length Petal_Width Species   
#>          <dbl>       <dbl>        <dbl>       <dbl> <chr>     
#> 1          5.2         4.1         1.5        0.732 setosa    
#> 2          5           3.6         1.4        0.2   <NA>      
#> 3          5.7         3           4.2        1.2   versicolor
#> 4         NA           3           1.4        0.2   setosa    
#> 5          6.2         2.9         2.22       1.3   <NA>      
#> # ℹ 2 more rows

If you specify column positions, only the columns at those positions will be imputed.

# Impute using the maximum method

result_df_max <- fill_missing_values(df, selected_variables = c
(2, 3), method = "max")

result_df_geomean
#> # A tibble: 7 × 5
#>   Sepal_Length Sepal.Width Petal_Length Petal_Width Species   
#>          <dbl>       <dbl>        <dbl>       <dbl> <chr>     
#> 1          5.2         4.1         1.5        0.732 setosa    
#> 2          5           3.6         1.4        0.2   <NA>      
#> 3          5.7         3           4.2        1.2   versicolor
#> 4         NA           3           1.4        0.2   setosa    
#> 5          6.2         2.9         2.22       1.3   <NA>      
#> # ℹ 2 more rows

Impute missing values (NAs) in a grouped data frame

You can use the fill_missing_values() in a grouped data frame by using other grouping and map functions. Here is an example of how to do this:

sample_iris <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", "setosa", "versicolor", "setosa",
          "virginica", "virginica", "setosa")
)
sample_iris
#> # A tibble: 7 × 4
#>   Sepal_Length Petal_Length Petal_Width Species   
#>          <dbl>        <dbl>       <dbl> <chr>     
#> 1          5.2          1.5         0.3 setosa    
#> 2          5            1.4         0.2 setosa    
#> 3          5.7          4.2         1.2 versicolor
#> 4         NA            1.4         0.2 setosa    
#> 5          6.2         NA           1.3 virginica 
#> # ℹ 2 more rows
sample_iris %>%
  group_by(Species) %>%
  group_split() %>%
  map_df(fill_missing_values, method = "median")
#> # A tibble: 7 × 4
#>   Sepal_Length Petal_Length Petal_Width Species   
#>          <dbl>        <dbl>       <dbl> <chr>     
#> 1          5.2          1.5         0.3 setosa    
#> 2          5            1.4         0.2 setosa    
#> 3          5.2          1.4         0.2 setosa    
#> 4          5.5          3.7         0.2 setosa    
#> 5          5.7          4.2         1.2 versicolor
#> # ℹ 2 more rows