Introduction

R package latentcor utilizes the powerful semi-parametric latent Gaussian copula models to estimate latent correlations between mixed data types. The package allows to estimate correlations between any of continuous/binary/ternary/zero-inflated (truncated) variable types. The underlying implementation takes advantage of fast multi-linear interpolation scheme with a clever choice of grid points that give the package a small memory footprint, and allows to use the latent correlations with sub-sampling and bootstrapping.

Statement of need

No R software package is currently available that allows accurate and fast correlation estimation from mixed variable data in a unifying manner. The R package latentcor, introduced here, thus represents the first stand-alone R package for computation of latent correlation that takes into account all variable types (continuous/binary/ordinal/zero-inflated), comes with an optimized memory footprint, and is computationally efficient, essentially making latent correlation estimation almost as fast as rank-based correlation estimation.

Getting started

A simple example with two variables

First, we will generate a pair of variables with different types using a sample size \(n=100\) which will serve as example data. Here first variable will be ternary, and second variable will be continuous.

simdata = gen_data(n = 100, types = c("ter", "con"))

The output of gen_data is a list with 2 elements:

names(simdata)
#> [1] "X"     "plotX"
  • X: a matrix (\(100\times 2\)), the first column is the ternary variable; the second column is the continuous variable.
X = simdata$X
head(X, n = 6L)
#>      [,1]       [,2]
#> [1,]    0 -1.1452083
#> [2,]    1 -1.5583566
#> [3,]    2 -1.1125013
#> [4,]    1  0.3972036
#> [5,]    1 -0.4723548
#> [6,]    1 -0.3982294
  • plotX: NULL (showplot = FALSE, can be changed to display the plot of generated data ingen_data input).
simdata$plotX
#> NULL

Then we can estimate the latent correlation matrix based on these 2 variables using latentcor function.

estimate = latentcor(X, types = c("ter", "con"))

The output of latentcor is a list with several elements:

names(estimate)
#> [1] "zratios"    "K"          "R"          "Rpointwise" "plotR"
  • zratios is a list has the same length as the number of variables. Here the first element is a (\(2\times1\)) vector indicating the cumulative proportions for zeros and ones in the ternary variable (e.g. first element in vector is the proportion of zeros, second element in vector is the proportion of zeros and ones.) The second element of the list is NA for continuous variable.
estimate$zratios
#> [[1]]
#> [1] 0.3 0.8
#> 
#> [[2]]
#> [1] NA
  • K: Kendall \(\tau\) (\(\tau_{a}\)) correlation matrix for these 2 variables.
estimate$K
#>           [,1]      [,2]
#> [1,] 1.0000000 0.1923232
#> [2,] 0.1923232 1.0000000
  • Rpointwise: matrix of pointwise estimated correlations. Due to pointwise estimation, Rpointwise is not guaranteed to be positive semi-definite
estimate$Rpointwise
#>           [,1]      [,2]
#> [1,] 1.0000000 0.3537794
#> [2,] 0.3537794 1.0000000
  • R: estimated final latent correlation matrix, this matrix is guaranteed to be strictly positive definite (through nearPD projection and parameter nu, see Mathematical framework for estimation) if use.nearPD = TRUE.
estimate$R
#>           [,1]      [,2]
#> [1,] 1.0000000 0.3534256
#> [2,] 0.3534256 1.0000000
  • plotR: NULL by default as showplot = FALSE in latentcor. Otherwise displays a heatmap of latent correlation matrix.
estimate$plotR
#> NULL

Example with mtcars dataset

We use the build-in dataset mtcars:

head(mtcars, n = 6L)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Let’s take a look at the unique values for each variable to determine the corresponding data type.

apply(mtcars, 2, table)
#> $mpg
#> 
#> 10.4 13.3 14.3 14.7   15 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2 19.7   21 
#>    2    1    1    1    1    2    1    1    1    1    1    1    1    2    1    2 
#> 21.4 21.5 22.8 24.4   26 27.3 30.4 32.4 33.9 
#>    2    1    2    1    1    1    2    1    1 
#> 
#> $cyl
#> 
#>  4  6  8 
#> 11  7 14 
#> 
#> $disp
#> 
#>  71.1  75.7  78.7    79  95.1   108 120.1 120.3   121 140.8   145 146.7   160 
#>     1     1     1     1     1     1     1     1     1     1     1     1     2 
#> 167.6   225   258 275.8   301   304   318   350   351   360   400   440   460 
#>     2     1     1     3     1     1     1     1     1     2     1     1     1 
#>   472 
#>     1 
#> 
#> $hp
#> 
#>  52  62  65  66  91  93  95  97 105 109 110 113 123 150 175 180 205 215 230 245 
#>   1   1   1   2   1   1   1   1   1   1   3   1   2   2   3   3   1   1   1   2 
#> 264 335 
#>   1   1 
#> 
#> $drat
#> 
#> 2.76 2.93    3 3.07 3.08 3.15 3.21 3.23 3.54 3.62 3.69  3.7 3.73 3.77 3.85  3.9 
#>    2    1    1    3    2    2    1    1    1    1    1    1    1    1    1    2 
#> 3.92 4.08 4.11 4.22 4.43 4.93 
#>    3    2    1    2    1    1 
#> 
#> $wt
#> 
#> 1.513 1.615 1.835 1.935  2.14   2.2  2.32 2.465  2.62  2.77  2.78 2.875  3.15 
#>     1     1     1     1     1     1     1     1     1     1     1     1     1 
#>  3.17  3.19 3.215 3.435  3.44  3.46  3.52  3.57  3.73  3.78  3.84 3.845  4.07 
#>     1     1     1     1     3     1     1     2     1     1     1     1     1 
#>  5.25 5.345 5.424 
#>     1     1     1 
#> 
#> $qsec
#> 
#>  14.5  14.6 15.41  15.5 15.84 16.46  16.7 16.87  16.9 17.02 17.05  17.3  17.4 
#>     1     1     1     1     1     1     1     1     1     2     1     1     1 
#> 17.42  17.6 17.82 17.98    18  18.3 18.52  18.6 18.61  18.9 19.44 19.47  19.9 
#>     1     1     1     1     1     1     1     1     1     2     1     1     1 
#>    20 20.01 20.22  22.9 
#>     1     1     1     1 
#> 
#> $vs
#> 
#>  0  1 
#> 18 14 
#> 
#> $am
#> 
#>  0  1 
#> 19 13 
#> 
#> $gear
#> 
#>  3  4  5 
#> 15 12  5 
#> 
#> $carb
#> 
#>  1  2  3  4  6  8 
#>  7 10  3 10  1  1

Then we can estimate the latent correlation matrix for all variables of mtcars by using latentcor function.

estimate_mtcars = latentcor(mtcars, types = c("con", "ter", "con", "con", "con", "con", "con", "bin", "bin", "ter", "con"))

Note that the determination of variable types can also be done automatically by latentcor package using get_types function:

estimate_mtcars = latentcor(mtcars, types = get_types(mtcars))

This function is run automatically inside latentcor if the types are not supplied by the user, however the automatic determination of types takes extra time, so we recommend to specify types explicitly if they are known in advance.

The output of latentcor for mtcars:

names(estimate_mtcars)
#> [1] "zratios"    "K"          "R"          "Rpointwise" "plotR"
  • zratios: zratios for corresponding variables in mtcars.
estimate_mtcars$zratios
#> [[1]]
#> [1] NA
#> 
#> [[2]]
#> [1] 0.34375 0.56250
#> 
#> [[3]]
#> [1] NA
#> 
#> [[4]]
#> [1] NA
#> 
#> [[5]]
#> [1] NA
#> 
#> [[6]]
#> [1] NA
#> 
#> [[7]]
#> [1] NA
#> 
#> [[8]]
#> [1] 0.5625
#> 
#> [[9]]
#> [1] 0.59375
#> 
#> [[10]]
#> [1] 0.46875 0.84375
#> 
#> [[11]]
#> [1] NA
  • K: Kendall \(\tau\) (\(\tau_{a}\)) correlation matrix for variables in mtcars.
estimate_mtcars$K
#>             mpg        cyl       disp         hp        drat         wt
#> mpg   1.0000000 -0.6431452 -0.7580645 -0.7278226  0.45564516 -0.7197581
#> cyl  -0.6431452  1.0000000  0.6592742  0.6310484 -0.44354839  0.5907258
#> disp -0.7580645  0.6592742  1.0000000  0.6532258 -0.48991935  0.7358871
#> hp   -0.7278226  0.6310484  0.6532258  1.0000000 -0.37298387  0.6008065
#> drat  0.4556452 -0.4435484 -0.4899194 -0.3729839  1.00000000 -0.5383065
#> wt   -0.7197581  0.5907258  0.7358871  0.6008065 -0.53830645  1.0000000
#> qsec  0.3125000 -0.3649194 -0.2983871 -0.4657258  0.03225806 -0.1411290
#> vs    0.4173387 -0.4475806 -0.4274194 -0.4435484  0.26411290 -0.3467742
#> am    0.3286290 -0.2842742 -0.3649194 -0.2116935  0.40120968 -0.4314516
#> gear  0.3427419 -0.3326613 -0.3770161 -0.2197581  0.45967742 -0.4314516
#> carb -0.4395161  0.3326613  0.3608871  0.5161290 -0.08266129  0.3245968
#>             qsec          vs          am        gear        carb
#> mpg   0.31250000  0.41733871  0.32862903  0.34274194 -0.43951613
#> cyl  -0.36491935 -0.44758065 -0.28427419 -0.33266129  0.33266129
#> disp -0.29838710 -0.42741935 -0.36491935 -0.37701613  0.36088710
#> hp   -0.46572581 -0.44354839 -0.21169355 -0.21975806  0.51612903
#> drat  0.03225806  0.26411290  0.40120968  0.45967742 -0.08266129
#> wt   -0.14112903 -0.34677419 -0.43145161 -0.43145161  0.32459677
#> qsec  1.00000000  0.46774194 -0.11895161 -0.07258065 -0.44354839
#> vs    0.46774194  1.00000000  0.08467742  0.15322581 -0.36088710
#> am   -0.11895161  0.08467742  1.00000000  0.43346774 -0.03629032
#> gear -0.07258065  0.15322581  0.43346774  1.00000000  0.06854839
#> carb -0.44354839 -0.36088710 -0.03629032  0.06854839  1.00000000
  • Rpointwise: matrix of pointwise estimated correlations for mtcars.
estimate_mtcars$Rpointwise
#>             mpg        cyl       disp         hp        drat         wt
#> mpg   1.0000000 -0.9990000 -0.9286530 -0.9099905  0.65616525 -0.9046652
#> cyl  -0.9990000  1.0000000  0.9990000  0.9900378 -0.77195772  0.9525997
#> disp -0.9286530  0.9990000  1.0000000  0.8552768 -0.69582182  0.9151697
#> hp   -0.9099905  0.9900378  0.8552768  1.0000000 -0.55293425  0.8097609
#> drat  0.6561652 -0.7719577 -0.6958218 -0.5529342  1.00000000 -0.7483492
#> wt   -0.9046652  0.9525997  0.9151697  0.8097609 -0.74834918  1.0000000
#> qsec  0.4713967 -0.6540431 -0.4517316 -0.6680316  0.05064917 -0.2198737
#> vs    0.8727316 -0.9623421 -0.8905658 -0.9188458  0.57685875 -0.7416377
#> am    0.7178533 -0.7124468 -0.7888268 -0.4746999  0.85723711 -0.9121559
#> gear  0.6234660 -0.7106595 -0.6786359 -0.4119442  0.80260415 -0.7617271
#> carb -0.6368382  0.6025491  0.5370028  0.7247928 -0.12947951  0.4880685
#>             qsec         vs          am       gear        carb
#> mpg   0.47139674  0.8727316  0.71785333  0.6234660 -0.63683817
#> cyl  -0.65404313 -0.9623421 -0.71244682 -0.7106595  0.60254911
#> disp -0.45173164 -0.8905658 -0.78882677 -0.6786359  0.53700280
#> hp   -0.66803158 -0.9188458 -0.47469992 -0.4119442  0.72479279
#> drat  0.05064917  0.5768588  0.85723711  0.8026041 -0.12947951
#> wt   -0.21987367 -0.7416377 -0.91215587 -0.7617271  0.48806852
#> qsec  1.00000000  0.9599123 -0.27004807 -0.1385035 -0.64170875
#> vs    0.95991229  1.0000000  0.27236999  0.4087924 -0.76863616
#> am   -0.27004807  0.2723700  1.00000000  0.9941583 -0.08284094
#> gear -0.13850346  0.4087924  0.99415828  1.0000000  0.13086286
#> carb -0.64170875 -0.7686362 -0.08284094  0.1308629  1.00000000
  • R: estimated final latent correlation matrix for mtcars.
estimate_mtcars$R
#>             mpg        cyl       disp         hp        drat         wt
#> mpg   1.0000000 -0.9608749 -0.9404695 -0.9153219  0.66638283 -0.9179476
#> cyl  -0.9608749  1.0000000  0.9650156  0.9346973 -0.74121240  0.9125875
#> disp -0.9404695  0.9650156  1.0000000  0.8622172 -0.71261891  0.9339336
#> hp   -0.9153219  0.9346973  0.8622172  1.0000000 -0.55835884  0.8154551
#> drat  0.6663828 -0.7412124 -0.7126189 -0.5583588  1.00000000 -0.7640859
#> wt   -0.9179476  0.9125875  0.9339336  0.8154551 -0.76408588  1.0000000
#> qsec  0.4827478 -0.5741600 -0.4631657 -0.6785315  0.07261583 -0.2322193
#> vs    0.8541286 -0.9212463 -0.8488590 -0.9228202  0.53220265 -0.7085704
#> am    0.6957798 -0.6710066 -0.7387695 -0.4747351  0.82546004 -0.8694336
#> gear  0.6378821 -0.6636071 -0.7061776 -0.4231995  0.82301630 -0.7835243
#> carb -0.6403560  0.5944601  0.5464094  0.7269360 -0.13825503  0.4939057
#>             qsec         vs          am       gear        carb
#> mpg   0.48274782  0.8541286  0.69577977  0.6378821 -0.64035597
#> cyl  -0.57415996 -0.9212463 -0.67100656 -0.6636071  0.59446007
#> disp -0.46316567 -0.8488590 -0.73876946 -0.7061776  0.54640944
#> hp   -0.67853152 -0.9228202 -0.47473508 -0.4231995  0.72693596
#> drat  0.07261583  0.5322026  0.82546004  0.8230163 -0.13825503
#> wt   -0.23221928 -0.7085704 -0.86943361 -0.7835243  0.49390573
#> qsec  1.00000000  0.8398496 -0.21520264 -0.1246468 -0.66023665
#> vs    0.83984962  1.0000000  0.33897317  0.3619248 -0.72932209
#> am   -0.21520264  0.3389732  1.00000000  0.9336415 -0.06770134
#> gear -0.12464675  0.3619248  0.93364152  1.0000000  0.11758995
#> carb -0.66023665 -0.7293221 -0.06770134  0.1175900  1.00000000
estimate_mtcars$plotR
#> NULL

Example using latentcor with subsampling

While latentcor can determine the types of each variable automatically, it is recommended to call get_types first and then supply types explicitly to save the computation time, especially when using latentcor with sub-sampling (which we illustrate below).

First, we will generate variables with different types using a sample size \(n=100\) which will serve as an example data for subsampling.

simdata2 = gen_data(n = 100, types = c(rep("ter", 3), "con", rep("bin", 3)))

To use the data with subsampling, we recommend to first run get_types on the full data

types = get_types(simdata2$X)
types
#> [1] "ter" "ter" "ter" "con" "bin" "bin" "bin"

Then, when doing subsampling, we recommend to explicitly supply identified types to latentcor. We illustrate using 10 subsamples, each of size 80.

start_time = proc.time()
for (s in 1:10){
  # Select a random subsample of size 80
  subsample = sample(1:100, 80)
  # Estimate latent correlation on subsample specifying the types
  Rs = latentcor(simdata2$X[subsample, ], types = types)
}
proc.time() - start_time
#>    user  system elapsed 
#>    0.03    0.00    0.03

Compared with

start_time = proc.time()
for (s in 1:10){
  # Select a random subsample of size 80
  subsample = sample(1:100, 80)
  # Estimate latent correlation on subsample specifying the types
  Rs = latentcor(simdata2$X[subsample, ], types = get_types(simdata2$X))
}
proc.time() - start_time
#>    user  system elapsed 
#>    0.03    0.00    0.03

References

Croux, Christophe, Peter Filzmoser, and Heinrich Fritz. 2013. “Robust Sparse Principal Component Analysis.” Technometrics 55 (2): 202–14.
Filzmoser, Peter, Heinrich Fritz, and Klaudius Kalcher. 2021. pcaPP: Robust PCA by Projection Pursuit. https://CRAN.R-project.org/package=pcaPP.
Fox, John. 2019. Polycor: Polychoric and Polyserial Correlations. https://CRAN.R-project.org/package=polycor.
Liu, Han, John Lafferty, and Larry Wasserman. 2009. “The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs.” Journal of Machine Learning Research 10 (10).