In this vignette, I will introduce you to the main features of the
tdata
package. I will use various datasets to demonstrate
how to perform common tasks, such as defining frequency types and
converting data between frequencies.
Please note that currently, only one section is provided in this vignette. Additional examples will be added in subsequent updates.
Let’s get started!
In the first example, I will use oil price data. The required data
can be downloaded from the Quandl
package using the
following code (Note that the end date in this example may differ from
yours):
To manipulate data using the tdata
package, we generally
need to create a variable. In this example, we’ll create a variable from
the oil price data. First, we’ll use the values in the first column to
define a frequency. Since the first column contains a list of dates,
we’ll use a ‘List-Date’ frequency:
Now that we have defined the frequency, we can create a variable using the following code:
This creates an array where each element is labeled by a date. We can
print this variable using the print
function:
## Variable:
## Name = Oil Price
## Length = 3466
## Frequency Class = List (Date): Ld
## Start Frequency = 20230608
## Fields: NULL
We can also convert the variable back to a data.frame using the
as.data.frame
function:
In this section, we’ll convert var_dl
to a daily
variable. This can be done by sorting the data and filling in any gaps.
The convert.to.daily
function can do this for us:
Using this function is more efficient than manually sorting the data
and filling in gaps because var_daily
, as a daily variable,
only stores a single date: the frequency of the first observation. Other
frequencies (or dates) are inferred from this first date (except for
‘Lists’, this is true for other types of frequencies in the
tdata
package). We can print the starting frequency using
the print function:
## Frequency: 20100104 (Daily: d)
Each frequency in the tdata
package has a string
representation and a class ID. We can get these values using the
following code:
class_id <- get.class.id(var_daily$startFrequency)
str_rep <- as.character(var_daily$startFrequency)
## [1] "class_id: d, str_rep: 20100104"
Plotting the data is straightforward. We simply convert the data to a
data.frame
using the as.data.frame
function
and then plot it. However, I won’t plot the daily variable in this
example because, since the original data was a ‘List-Date’, there are
many NA
values. In the next section, I’ll aggregate the
data and plot it.
In this section, we’ll convert the daily variable to a weekly
variable. Unlike the previous conversion, this involves aggregating the
data rather than sorting and filling in gaps. To do this, we’ll need to
use an aggregator function that takes an array of data as an argument
and returns a scalar value. Summary statistic functions such as
mean
and median
are natural choices for this
(we’ll also need to handle NA
values). In this example,
I’ll use a built-in function to get the last available data point in
each week as the representative value for that week. Here’s the
code:
The second argument, "mon"
, specifies that the week
starts on Monday. Note that the weekly frequency points to the first day
of the week. We can now convert the variable to a
data.frame
and plot it using the following code:
df_var_weekly <- as.data.frame(var_weekly)
par(las = 2, cex.axis = 0.8)
plot(factor(rownames(df_var_weekly)),
df_var_weekly$`Oil Price`,
xlab = NULL, ylab = "$",
main = "Weekly Oil Price")
There are other frequency types and conversion functions available in
the tdata
package that you can explore on your own.
In this subsection, I will talk about some other functions in
tdata
package. These are not the main functions, but just
some related subject to time and data.
In this subsection, we will discuss long-run growth and use the tdata package to calculate and plot it. First, let’s review some mathematical concepts.
A variable can change continuously or discretely over time. \[\begin{align} &y(t)=y(0)e^{g_1}e^{g_2}\ldots g^{g_t}\\ &y_t=y_0(1+g_1)(1+g_2)\ldots(1+g_t) \end{align}\]
The starting condition is represented by \(y_0\) and \(y(0)\), while \(y_t\) and \(y(t)\) represents the value of the variable \(t\) periods. The \(g_i \times 100\) for \(i=1\ldots t\) are the discrete or continuous growth rates in different periods.
Recall that we have two formulas for calculating growth rate in one period: \[\begin{align} &G_c=(\ln{\frac{y(t+1)}{y(t)}})\times 100\\ &G_d=(\frac{y_{t+1}}{y_t}-1)\times 100 \end{align}\]
for \(y_i>0\) for all \(i\). Also, recall that we have this approximation: \(ln(1+x) \approx x\) when \(x\) is small. Therefore, it might not be important which one we choose for variables such as GDP. However, assume that the growth rates are large, e.g., in one period the value increases from 1 to 2. It is clear that the growth rate is 100%. Using the continuous formula gives us 69.3% and by using the discrete one we get 100%. Does this mean the continuous formula is wrong? Definitely not. When we say “the value increases from 1 to 2”, we are assuming that the growth is discrete. Therefore, if you are wondering which of these formulas to use, think about your assumptions and your mathematical model. It is generally easier to work with the continuous formula because the derivative of \(ln\) is easier to compute.
Interest rate is a similar concept to growth rate, and let’s study these formulas from that perspective. Assume that you lend \(y_0\) amount of money at time \(t=0\). If the annual interest rate is \(r\), at the end of the first year you will have \(y_0*(1+r)\). Let’s assume that interest is paid monthly and you lend the interest too. At the end of the first month, you will have \(y_0*(1+\frac{r}{12})\). Then, at the end of the second month, you will have \(y_0*(1+\frac{r}{12})^2\). And so on. At the end of the year, you will have \(y_0*(1+\frac{r}{12})^12\). At the end of the second year, you will have \(y_0*(1+\frac{r}{12})^24\). At the end of \(t\) years, you will have \(y_0*(1+\frac{r}{12})^12t\). We can similarly split the year into smaller periods and calculate the limit when we move from 12 toward infinity in the formula. If we use L’Hopital’s rule and calculate such a limit, we will find the continuous formula: \(y_t=y_0 e^{rt}\). Compared to the previos discussion, we have one scalar value \(r\) here.
Let’s get back to the long-run formulas. We are looking for a scalar value \(g\) in which we can summarize all the other \(g_i\)s, such that: \[\begin{align} &y(t)=y(0)e^{g_c t}\\ &y_t=y_0(1+g_d)^t \end{align}\]
This single value does what all other growth rates achieve together: it take us from the starting point (\(y_0\) or \(y(0)\)) to the end point (\(y_t\) or \(y(t)\)). We call \(\bar{G}_c = g_c\times 100\) and \(\bar{G}_d=g_d \times 100\) the continuous and discrete long-run growth rates. It is important to note that \(\bar{G}_c\) and \(\bar{G}_d\) are not exactly equal (The discussion is similar to the discussion about \(G_c\) and \(G_d\) above).
Given the data \(\{y_i\}_{i=0}^t\) where \(y_i>0\) for all \(i\), the following formula calculates the long-run continuous growth rates: \[\begin{align} &\bar{G}_c = \frac{\ln{\frac{y(t)}{y(0)}}}{t}\times 100\\ &\bar{G}_c = \frac{G_{c1}+G_{c2}+\ldots+G_{ct}}{t} \end{align}\]
in which \(G_{ci}\) is the continuous growth rate at period \(i\). Recall that \(\ln{x}+ln{y}=\ln{xy}\) for \(x,y\in\mathbb{R}\).
The two similar formulas for the discrete case are:
\[\begin{align} &\bar{G}_d = (\frac{y_t}{y_0}-1)^{\frac{1}{t}}\times 100\\ &\bar{G}_d =(\sqrt[t]{\Pi_{i=1}^{t}(1+\frac{G_{di}}{100})}-1)\times 100 \end{align}\]
These are more complicated compared to the continuous case. As I said before, it is much easier to deal with continuous assumption.
It is important to note that the formulas for calculating growth rates may not be applicable when the data can be negative or \(\{y_i\}_{i=0}^t\) contains negative values. In such cases, alternative methods may be needed to measure growth or change over time. One possible approach is to add a constant to all values to make them positive before calculating the growth rates. However, this approach may not always be suitable and it is crucial to carefully evaluate the context of the data before applying any transformation.
The following code creates two tdata
variables which
contain the real GDP per capita (PPP) of Korea and Iran from 1990 to
2021. Data source is “World Development Indicators” and data code is
“NY.GDP.PCAP.PP.KD”.
y_kor <- variable(data = c(12656, 13882, 14591, 15436, 16698, 18120, 19365, 20368, 19184, 21233, 22964, 23894, 25591, 26260, 27516, 28641, 29991, 31570, 32275, 32364, 34394, 35389, 36049, 37021, 37967, 38829, 39815, 40957, 41966, 42759, 42397, 44232),
startFrequency = f.yearly(1990),
name = "Korea, GDP per capita, PPP (constant 2017 international $), World Bank")
y_iri <- variable(data = c(9442, 10240, 10331, 10114, 9904, 10007, 10504, 10495, 10548, 10590, 11026, 11098, 11879, 12786, 13127, 13329, 13781, 14690, 14526, 14474, 15099, 15302, 14542, 14113, 14539, 14011, 14969, 15163, 14629, 14084, 14432, 15005),
startFrequency = f.yearly(1990),
name = "Iran, Rep., GDP per capita, PPP (constant 2017 international $), World Bank")
We can use the get.longrun.growth
function to calculate
the long-run growth rates and plot the trends.
Data cleaning is a crucial early step in the data analytics process,
which may involve handling missing or NA observations. There are several
methods to deal with missing values, including replacing them with
substitutes from other records or datasets, estimating values based on
other available information through imputation, or using a mathematical
function to fit a curve to the available data points through
interpolation. However, sometimes it may be necessary to remove
NA
observations from the data before analysis. In this
section, we will discuss a common scenario when working with
cross-sectional data where NA values exist in some variables.
Let’s begin with a simple example. Consider the following data table, where the variables are in columns and the observations are in rows:
\[\begin{equation} \begin{array}{c|ccc} & V_1 & V_2 & V_3 \\ \hline O_1 & 1 & 2 & 3 \\ O_2 & 4 & NA & 5 \\ O_3 & 7 & NA & 9 \\ O_4 & 10 & 11 & 12 \end{array} \end{equation}\]
If we remove \(V_2\), the resulting data table will have 8 observations. However, if we remove \(O_2\) and \(O_3\) observations, the final data table will have only 6 observations. Assuming that all observations are equally important, removing \(V_2\) would be the best strategy. Now, consider the following structure:
\[\begin{equation} \begin{array}{c|ccc} & V_1 & V_2 & V_3 \\ \hline O_1 & NA & 2 & NA \\ O_2 & 4 & 5 & 5 \\ O_3 & NA & 6 & 9 \\ O_4 & NA & 11 & 12 \end{array} \end{equation}\]
In this case, the best strategy would be to remove \(V_1\) and \(O_1\). Now, consider the following data table:
\[\begin{equation} \begin{array}{c|ccc} & V_1 & V_2 & V_3 & V_4\\ \hline O_1 & NA & 2 & NA & NA \\ O_2 & 5 & 6 & 7 & 8 \\ O_3 & 9 & NA & 11 & 12 \\ O_4 & 13 & 14 & NA & NA \\ O_5 & 17 & 18 & 19 & 20 \\ \end{array} \end{equation}\]
Can you see what the best strategy would be in this case? What if it is more important to keep variables instead of observations.
The remove.na.strategies
function in the
tdata
package can help you decide on a proper order for
removing columns and rows. Here is an example using the data from the
previous example:
st <- remove.na.strategies(data = matrix(c(NA,2,NA,NA,5,6,7,8,9,NA,
11,12,13,14,NA,NA,17,18,19,20),
ncol = 4,
byrow = TRUE))
The first element of the output shows the best strategy where the
final data table has 9 observations by removing columns with indices
reported in colRemove
(which is 2) and rows with indices
reported in rowRemove
(which is 1, 4). The second-best
strategy results in a final matrix with 8 observations and suggests
removing rows with indices reported in rowRemove
(which is
1, 3, 4) and no columns.
We can use the countFun
argument to increase the weight
of columns. For example, if we use
countFun = function(nRows, nCols) nRows * nCols^2
, the best
strategy would be to remove rows with indices reported in
rowRemove
(which is 1, 3, 4).