latentcor.Rmd
R package latentcor
utilizes the powerful
semi-parametric latent Gaussian copula models to estimate latent
correlations between mixed data types. The package allows to estimate
correlations between any of continuous/binary/ternary/zero-inflated
(truncated) variable types. The underlying implementation takes
advantage of fast multi-linear interpolation scheme with a clever choice
of grid points that give the package a small memory footprint, and
allows to use the latent correlations with sub-sampling and
bootstrapping.
No R software package is currently available that allows accurate and
fast correlation estimation from mixed variable data in a unifying
manner. The R package latentcor
, introduced here, thus
represents the first stand-alone R package for computation of latent
correlation that takes into account all variable types
(continuous/binary/ordinal/zero-inflated), comes with an optimized
memory footprint, and is computationally efficient, essentially making
latent correlation estimation almost as fast as rank-based correlation
estimation.
First, we will generate a pair of variables with different types using a sample size \(n=100\) which will serve as example data. Here first variable will be ternary, and second variable will be continuous.
The output of gen_data
is a list with 2 elements:
names(simdata)
#> [1] "X" "plotX"
X
: a matrix (\(100\times
2\)), the first column is the ternary variable; the second column
is the continuous variable.
X = simdata$X
head(X, n = 6L)
#> [,1] [,2]
#> [1,] 0 -1.1452083
#> [2,] 1 -1.5583566
#> [3,] 2 -1.1125013
#> [4,] 1 0.3972036
#> [5,] 1 -0.4723548
#> [6,] 1 -0.3982294
plotX
: NULL (showplot = FALSE
, can be
changed to display the plot of generated data ingen_data
input).
simdata$plotX
#> NULL
Then we can estimate the latent correlation matrix based on these 2
variables using latentcor
function.
The output of latentcor
is a list with several
elements:
names(estimate)
#> [1] "zratios" "K" "R" "Rpointwise" "plotR"
zratios
is a list has the same length as the number of
variables. Here the first element is a (\(2\times1\)) vector indicating the
cumulative proportions for zeros and ones in the ternary variable
(e.g. first element in vector is the proportion of zeros, second element
in vector is the proportion of zeros and ones.) The second element of
the list is NA for continuous variable.
estimate$zratios
#> [[1]]
#> [1] 0.3 0.8
#>
#> [[2]]
#> [1] NA
K
: Kendall \(\tau\)
(\(\tau_{a}\)) correlation matrix for
these 2 variables.
estimate$K
#> [,1] [,2]
#> [1,] 1.0000000 0.1923232
#> [2,] 0.1923232 1.0000000
Rpointwise
: matrix of pointwise estimated correlations.
Due to pointwise estimation, Rpointwise
is not guaranteed
to be positive semi-definite
estimate$Rpointwise
#> [,1] [,2]
#> [1,] 1.0000000 0.3537794
#> [2,] 0.3537794 1.0000000
R
: estimated final latent correlation matrix, this
matrix is guaranteed to be strictly positive definite (through
nearPD
projection and parameter nu
, see
Mathematical framework for estimation) if
use.nearPD = TRUE
.
estimate$R
#> [,1] [,2]
#> [1,] 1.0000000 0.3534256
#> [2,] 0.3534256 1.0000000
plotR
: NULL by default as showplot = FALSE
in latentcor
. Otherwise displays a heatmap of latent
correlation matrix.
estimate$plotR
#> NULL
We use the build-in dataset mtcars
:
head(mtcars, n = 6L)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Let’s take a look at the unique values for each variable to determine the corresponding data type.
apply(mtcars, 2, table)
#> $mpg
#>
#> 10.4 13.3 14.3 14.7 15 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2 19.7 21
#> 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2
#> 21.4 21.5 22.8 24.4 26 27.3 30.4 32.4 33.9
#> 2 1 2 1 1 1 2 1 1
#>
#> $cyl
#>
#> 4 6 8
#> 11 7 14
#>
#> $disp
#>
#> 71.1 75.7 78.7 79 95.1 108 120.1 120.3 121 140.8 145 146.7 160
#> 1 1 1 1 1 1 1 1 1 1 1 1 2
#> 167.6 225 258 275.8 301 304 318 350 351 360 400 440 460
#> 2 1 1 3 1 1 1 1 1 2 1 1 1
#> 472
#> 1
#>
#> $hp
#>
#> 52 62 65 66 91 93 95 97 105 109 110 113 123 150 175 180 205 215 230 245
#> 1 1 1 2 1 1 1 1 1 1 3 1 2 2 3 3 1 1 1 2
#> 264 335
#> 1 1
#>
#> $drat
#>
#> 2.76 2.93 3 3.07 3.08 3.15 3.21 3.23 3.54 3.62 3.69 3.7 3.73 3.77 3.85 3.9
#> 2 1 1 3 2 2 1 1 1 1 1 1 1 1 1 2
#> 3.92 4.08 4.11 4.22 4.43 4.93
#> 3 2 1 2 1 1
#>
#> $wt
#>
#> 1.513 1.615 1.835 1.935 2.14 2.2 2.32 2.465 2.62 2.77 2.78 2.875 3.15
#> 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 3.17 3.19 3.215 3.435 3.44 3.46 3.52 3.57 3.73 3.78 3.84 3.845 4.07
#> 1 1 1 1 3 1 1 2 1 1 1 1 1
#> 5.25 5.345 5.424
#> 1 1 1
#>
#> $qsec
#>
#> 14.5 14.6 15.41 15.5 15.84 16.46 16.7 16.87 16.9 17.02 17.05 17.3 17.4
#> 1 1 1 1 1 1 1 1 1 2 1 1 1
#> 17.42 17.6 17.82 17.98 18 18.3 18.52 18.6 18.61 18.9 19.44 19.47 19.9
#> 1 1 1 1 1 1 1 1 1 2 1 1 1
#> 20 20.01 20.22 22.9
#> 1 1 1 1
#>
#> $vs
#>
#> 0 1
#> 18 14
#>
#> $am
#>
#> 0 1
#> 19 13
#>
#> $gear
#>
#> 3 4 5
#> 15 12 5
#>
#> $carb
#>
#> 1 2 3 4 6 8
#> 7 10 3 10 1 1
Then we can estimate the latent correlation matrix for all variables
of mtcars
by using latentcor
function.
estimate_mtcars = latentcor(mtcars, types = c("con", "ter", "con", "con", "con", "con", "con", "bin", "bin", "ter", "con"))
Note that the determination of variable types can also be done
automatically by latentcor
package using
get_types
function:
This function is run automatically inside latentcor
if
the types
are not supplied by the user, however the
automatic determination of types takes extra time, so we recommend to
specify types
explicitly if they are known in advance.
The output of latentcor
for mtcars
:
names(estimate_mtcars)
#> [1] "zratios" "K" "R" "Rpointwise" "plotR"
zratios
: zratios for corresponding variables in
mtcars
.
estimate_mtcars$zratios
#> [[1]]
#> [1] NA
#>
#> [[2]]
#> [1] 0.34375 0.56250
#>
#> [[3]]
#> [1] NA
#>
#> [[4]]
#> [1] NA
#>
#> [[5]]
#> [1] NA
#>
#> [[6]]
#> [1] NA
#>
#> [[7]]
#> [1] NA
#>
#> [[8]]
#> [1] 0.5625
#>
#> [[9]]
#> [1] 0.59375
#>
#> [[10]]
#> [1] 0.46875 0.84375
#>
#> [[11]]
#> [1] NA
K
: Kendall \(\tau\)
(\(\tau_{a}\)) correlation matrix for
variables in mtcars
.
estimate_mtcars$K
#> mpg cyl disp hp drat wt
#> mpg 1.0000000 -0.6431452 -0.7580645 -0.7278226 0.45564516 -0.7197581
#> cyl -0.6431452 1.0000000 0.6592742 0.6310484 -0.44354839 0.5907258
#> disp -0.7580645 0.6592742 1.0000000 0.6532258 -0.48991935 0.7358871
#> hp -0.7278226 0.6310484 0.6532258 1.0000000 -0.37298387 0.6008065
#> drat 0.4556452 -0.4435484 -0.4899194 -0.3729839 1.00000000 -0.5383065
#> wt -0.7197581 0.5907258 0.7358871 0.6008065 -0.53830645 1.0000000
#> qsec 0.3125000 -0.3649194 -0.2983871 -0.4657258 0.03225806 -0.1411290
#> vs 0.4173387 -0.4475806 -0.4274194 -0.4435484 0.26411290 -0.3467742
#> am 0.3286290 -0.2842742 -0.3649194 -0.2116935 0.40120968 -0.4314516
#> gear 0.3427419 -0.3326613 -0.3770161 -0.2197581 0.45967742 -0.4314516
#> carb -0.4395161 0.3326613 0.3608871 0.5161290 -0.08266129 0.3245968
#> qsec vs am gear carb
#> mpg 0.31250000 0.41733871 0.32862903 0.34274194 -0.43951613
#> cyl -0.36491935 -0.44758065 -0.28427419 -0.33266129 0.33266129
#> disp -0.29838710 -0.42741935 -0.36491935 -0.37701613 0.36088710
#> hp -0.46572581 -0.44354839 -0.21169355 -0.21975806 0.51612903
#> drat 0.03225806 0.26411290 0.40120968 0.45967742 -0.08266129
#> wt -0.14112903 -0.34677419 -0.43145161 -0.43145161 0.32459677
#> qsec 1.00000000 0.46774194 -0.11895161 -0.07258065 -0.44354839
#> vs 0.46774194 1.00000000 0.08467742 0.15322581 -0.36088710
#> am -0.11895161 0.08467742 1.00000000 0.43346774 -0.03629032
#> gear -0.07258065 0.15322581 0.43346774 1.00000000 0.06854839
#> carb -0.44354839 -0.36088710 -0.03629032 0.06854839 1.00000000
Rpointwise
: matrix of pointwise estimated correlations
for mtcars
.
estimate_mtcars$Rpointwise
#> mpg cyl disp hp drat wt
#> mpg 1.0000000 -0.9990000 -0.9286530 -0.9099905 0.65616525 -0.9046652
#> cyl -0.9990000 1.0000000 0.9990000 0.9900378 -0.77195772 0.9525997
#> disp -0.9286530 0.9990000 1.0000000 0.8552768 -0.69582182 0.9151697
#> hp -0.9099905 0.9900378 0.8552768 1.0000000 -0.55293425 0.8097609
#> drat 0.6561652 -0.7719577 -0.6958218 -0.5529342 1.00000000 -0.7483492
#> wt -0.9046652 0.9525997 0.9151697 0.8097609 -0.74834918 1.0000000
#> qsec 0.4713967 -0.6540431 -0.4517316 -0.6680316 0.05064917 -0.2198737
#> vs 0.8727316 -0.9623421 -0.8905658 -0.9188458 0.57685875 -0.7416377
#> am 0.7178533 -0.7124468 -0.7888268 -0.4746999 0.85723711 -0.9121559
#> gear 0.6234660 -0.7106595 -0.6786359 -0.4119442 0.80260415 -0.7617271
#> carb -0.6368382 0.6025491 0.5370028 0.7247928 -0.12947951 0.4880685
#> qsec vs am gear carb
#> mpg 0.47139674 0.8727316 0.71785333 0.6234660 -0.63683817
#> cyl -0.65404313 -0.9623421 -0.71244682 -0.7106595 0.60254911
#> disp -0.45173164 -0.8905658 -0.78882677 -0.6786359 0.53700280
#> hp -0.66803158 -0.9188458 -0.47469992 -0.4119442 0.72479279
#> drat 0.05064917 0.5768588 0.85723711 0.8026041 -0.12947951
#> wt -0.21987367 -0.7416377 -0.91215587 -0.7617271 0.48806852
#> qsec 1.00000000 0.9599123 -0.27004807 -0.1385035 -0.64170875
#> vs 0.95991229 1.0000000 0.27236999 0.4087924 -0.76863616
#> am -0.27004807 0.2723700 1.00000000 0.9941583 -0.08284094
#> gear -0.13850346 0.4087924 0.99415828 1.0000000 0.13086286
#> carb -0.64170875 -0.7686362 -0.08284094 0.1308629 1.00000000
R
: estimated final latent correlation matrix for
mtcars
.
estimate_mtcars$R
#> mpg cyl disp hp drat wt
#> mpg 1.0000000 -0.9608749 -0.9404695 -0.9153219 0.66638283 -0.9179476
#> cyl -0.9608749 1.0000000 0.9650156 0.9346973 -0.74121240 0.9125875
#> disp -0.9404695 0.9650156 1.0000000 0.8622172 -0.71261891 0.9339336
#> hp -0.9153219 0.9346973 0.8622172 1.0000000 -0.55835884 0.8154551
#> drat 0.6663828 -0.7412124 -0.7126189 -0.5583588 1.00000000 -0.7640859
#> wt -0.9179476 0.9125875 0.9339336 0.8154551 -0.76408588 1.0000000
#> qsec 0.4827478 -0.5741600 -0.4631657 -0.6785315 0.07261583 -0.2322193
#> vs 0.8541286 -0.9212463 -0.8488590 -0.9228202 0.53220265 -0.7085704
#> am 0.6957798 -0.6710066 -0.7387695 -0.4747351 0.82546004 -0.8694336
#> gear 0.6378821 -0.6636071 -0.7061776 -0.4231995 0.82301630 -0.7835243
#> carb -0.6403560 0.5944601 0.5464094 0.7269360 -0.13825503 0.4939057
#> qsec vs am gear carb
#> mpg 0.48274782 0.8541286 0.69577977 0.6378821 -0.64035597
#> cyl -0.57415996 -0.9212463 -0.67100656 -0.6636071 0.59446007
#> disp -0.46316567 -0.8488590 -0.73876946 -0.7061776 0.54640944
#> hp -0.67853152 -0.9228202 -0.47473508 -0.4231995 0.72693596
#> drat 0.07261583 0.5322026 0.82546004 0.8230163 -0.13825503
#> wt -0.23221928 -0.7085704 -0.86943361 -0.7835243 0.49390573
#> qsec 1.00000000 0.8398496 -0.21520264 -0.1246468 -0.66023665
#> vs 0.83984962 1.0000000 0.33897317 0.3619248 -0.72932209
#> am -0.21520264 0.3389732 1.00000000 0.9336415 -0.06770134
#> gear -0.12464675 0.3619248 0.93364152 1.0000000 0.11758995
#> carb -0.66023665 -0.7293221 -0.06770134 0.1175900 1.00000000
plotR
: NULL by default as showplot = FALSE
in latentcor
. Otherwise displays a heatmap of latent
correlation matrix for mtcars
(See heatmap of latent
correlation (approx) for mtcars).
estimate_mtcars$plotR
#> NULL
While latentcor
can determine the types of each variable
automatically, it is recommended to call get_types
first
and then supply types
explicitly to save the computation
time, especially when using latentcor with sub-sampling (which we
illustrate below).
First, we will generate variables with different types using a sample size \(n=100\) which will serve as an example data for subsampling.
To use the data with subsampling, we recommend to first run
get_types
on the full data
types = get_types(simdata2$X)
types
#> [1] "ter" "ter" "ter" "con" "bin" "bin" "bin"
Then, when doing subsampling, we recommend to explicitly supply
identified types to latentcor
. We illustrate using 10
subsamples, each of size 80.
start_time = proc.time()
for (s in 1:10){
# Select a random subsample of size 80
subsample = sample(1:100, 80)
# Estimate latent correlation on subsample specifying the types
Rs = latentcor(simdata2$X[subsample, ], types = types)
}
proc.time() - start_time
#> user system elapsed
#> 0.03 0.00 0.03
Compared with
start_time = proc.time()
for (s in 1:10){
# Select a random subsample of size 80
subsample = sample(1:100, 80)
# Estimate latent correlation on subsample specifying the types
Rs = latentcor(simdata2$X[subsample, ], types = get_types(simdata2$X))
}
proc.time() - start_time
#> user system elapsed
#> 0.03 0.00 0.03