Hi everyone, this page is to introduce the package AnalysisLin, which is my personal package for exploratory data analysis. It includes several useful functions designed to assist with exploratory data analysis (EDA). These functions are based on my learnings throughout my academic years, and I personally use them for EDA.
Table below summarize the functions that would be going over in this page
df <- data.frame(
Descriptive_Statistics = c("desc_stat()","","","","",""),
Data_Visualization = c("hist_plot()","dens_plot()", "bar_plot()","pie_plot()","qq_plot()","missing_value_plot()"),
Correlation_Analysis = c("corr_matrix()", "corr_cluster()","","","",""),
Feature_Engineering = c("missing_impute()", "pca()","","","","")
)
kable(df)
Descriptive_Statistics | Data_Visualization | Correlation_Analysis | Feature_Engineering |
---|---|---|---|
desc_stat() | hist_plot() | corr_matrix() | missing_impute() |
dens_plot() | corr_cluster() | pca() | |
bar_plot() | |||
pie_plot() | |||
qq_plot() | |||
missing_value_plot() |
Some famous and very useful pre-installed datasets, such as iris, mtcars, and airquality, would be used to demonstrate what does each function in the package do. If you have not installed the package, please do the following:
install.packages(“AnalysisLin”)
Exploratory Data Analysis, in simple words, is the process to get to know your data.
First function in package is desc_stat. This function computes numerous useful statistical metrics so that you gain a profound understanding of your data
These metrics provide valuable insights into the dataset in a deep
level. If you don’t want any of these metrics to be computed, you can
set them to FALSE
. This way, the unwanted metrics won’t
appear in the output.
Furthermore, desc_stat() can also compute Kurtosis, Skewness, Shapiro-Wilk Test, Anderson-Darling Test, Lilliefors Test,Jarque-Bera Test
To visualize histogram for all numerical variables
To visualize desnity for all numerical variables in two rows of subplots
A Quantile-Quantile (QQ) plot is a graphical tool used to assess whether a dataset follows a normal distribution. It compares the quantiles of the observed data to the quantiles of the expected distribution.
if you want to check the normality for numerical variables by drawing QQ plot.
To visualize correlation table for all variables.
if you want to visualize correlation map along with correlation table:
you may also choose type of correlation:Pearson correlation and Spearman correlation.
To visualize the percentage of missing values in each variable.
Imputing mssing value is a way to get more information from a data with missing values. However, one need to carefully choose what method to use to impute missing values in order to reach most accuracy.
mode: use most frequency value to replace missing value.
median: use median value to replace missing value.
locf: use last observation value to replace missing value.
knn: use k-nearest nerighbor to replace missing value, k needs to be chosen.
Principle Component Analysis can help you to reduce the number of variables in a dataset. To perform and visualize PCA on some selected variables
to visualize the scree plot and biplot