Estimation of latent correlation matrix from observed data of (possibly) mixed types (continuous/binary/truncated/ternary) based on the latent Gaussian copula model. Missing values (NA) are allowed. The estimation is based on pairwise complete observations.

latentcor(
  X,
  types = NULL,
  method = c("approx", "original"),
  use.nearPD = TRUE,
  nu = 0.001,
  tol = 1e-08,
  ratio = 0.9,
  showplot = FALSE
)

Arguments

X

A numeric matrix or numeric data frame (n by p), where n is number of samples, and p is number of variables. Missing values (NA) are allowed, in which case the estimation is based on pairwise complete observations.

types

A vector of length p indicating the type of each of the p variables in X. Each element must be one of "con" (continuous), "bin" (binary), "ter" (ternary) or "tru" (truncated). If the vector has length 1, then all p variables are assumed to be of the same type that is supplied. The default value is NULL, and the variable types are determined automatically using function get_types. As automatic determination of variable types takes extra time, it is recommended to supply the types explicitly when they are known in advance.

method

The calculation method for latent correlations. Either "original" or "approx". If method = "approx", multilinear approximation method is used, which is much faster than the original method, see Yoon et al. (2021) for timing comparisons for various variable types. If method = "original", optimization of the bridge inverse function is used. The default is "approx".

use.nearPD

Logical indicator. use.nearPD = TRUE gets nearest positive definite matrix for the estimated latent correlation matrix with shrinkage adjustment by nu. Output R is the same as Rpointwise if use.nearPD = FALSE. Default value is TRUE.

nu

Shrinkage parameter for the correlation matrix, must be between 0 and 1. Guarantees that the minimal eigenvalue of returned correlation matrix is greater or equal to nu. When nu = 0, no shrinkage is performed, the returned correlation matrix will be semi-positive definite but not necessarily strictly positive definite. When nu = 1, the identity matrix is returned (not recommended). The default (recommended) value is 0.001.

tol

When method = "original", specifies the desired accuracy of the bridge function inversion via uniroot optimization and is passed to optimize. The default value is 1e-8. When method = "approx", this parameter is ignored.

ratio

When method = "approx", specifies the boundary value for multilinear interpolation, must be between 0 and 1. When ratio = 0, no linear interpolation is performed (the slowest execution) which is equivalent to method = "original". When ratio = 1, linear interpolation is always performed (the fastest execution) but may lead to high approximation errors. The default (recommended) value is 0.9 which controls the approximation error and has fast execution, see Yoon et al. (2021) for details. When method = "original", this parameter is ignored.

showplot

Logical indicator. showplot = TRUE generates a ggplot object plotR with the heatmap of latent correlation matrix R. plotR = NULL if showplot = FALSE. Default value is FALSE.

Value

latentcor returns

  • zratios: A list of of length p corresponding to each variable. Returns NA for continuous variable; proportion of zeros for binary/truncated variables; the cumulative proportions of zeros and ones (e.g. first value is proportion of zeros, second value is proportion of zeros and ones) for ternary variable.

  • K: (p x p) Kendall Tau (Tau-a) Matrix for X

  • R: (p x p) Estimated latent correlation matrix for X

  • Rpointwise: (p x p) Point-wise estimates of latent correlations for X. This matrix is not guaranteed to be semi-positive definite. This is the original estimated latent correlation matrix without adjustment for positive-definiteness.

  • plotR: Heatmap plot of latent correlation matrix R, NULL if showplot = FALSE

Details

The function estimates latent correlation by calculating inverse bridge function (Fan et al., 2017) evaluated at the value of sample Kendall's tau (\(\hat \tau\)). The bridge function F connects Kendall's tau to latent correlation r so that \(F(r) = E(\hat \tau)\). The form of function F depends on the variable types (continuous/binary/truncated/ternary), but is exact. The exact form of inverse is not available, so has to be evaluated numerically for each pair of variables leading to Rpointwise.

When method = "original", the inversion is done numerically by solving $$minimize_r (F(r) - \hat \tau)^2$$ using optimize. The parameter tol is used to control the accuracy of the solution.

When method = "approx", the inversion is done approximately by interpolating previously calculated and stored values of \(F^{-1}(\hat \tau)\). This is significantly faster than the original method (Yoon et al., 2021) for binary/ternary/truncated cases, however the approximation errors may be non-negligible on some regions of the space. The parameter ratio controls the region where the interpolation is performed with default recommended value of 0.9 giving a good balance of accuracy and computational speed . Increasing the value of ratio may improve speed (but possibly sacrifice the accuracy), whereas decreasing the value of ratio my improve accuracy (but possibly sacrifice the speed). See Yoon et al. 2021 and vignette for more details.

In case the pointwise estimator Rpointwise is has negative eigenvalues, it is projected onto the space of positive semi-definite matrices using nearPD. The parameter nu further allows to perform additional shrinkage towards identity matrix (desirable in cases where the number of variables p is very large) as $$R = (1 - \nu) \tilde R + \nu I,$$ where \(\tilde R\) is Rpointwise after projection by nearPD.

References

Fan J., Liu H., Ning Y. and Zou H. (2017) "High dimensional semiparametric latent graphical model for mixed data" doi:10.1111/rssb.12168 .

Yoon G., Carroll R.J. and Gaynanova I. (2020) "Sparse semiparametric canonical correlation analysis for data of mixed types" doi:10.1093/biomet/asaa007 .

Yoon G., Müller C.L., Gaynanova I. (2021) "Fast computation of latent correlations" doi:10.1080/10618600.2021.1882468 .

Examples

# Example 1 - truncated data type, same type for all variables
# Generate data
X = gen_data(n = 300, types = rep("tru", 5))$X
# Estimate latent correlation matrix with original method and check the timing
start_time = proc.time()
R_org = latentcor(X = X, types = "tru", method = "original")$R
proc.time() - start_time
#>    user  system elapsed 
#>    0.52    0.00    0.52 
# Estimate latent correlation matrix with approximation method and check the timing
start_time = proc.time()
R_approx = latentcor(X = X, types = "tru", method = "approx")$R
proc.time() - start_time
#>    user  system elapsed 
#>    0.01    0.00    0.01 
# Heatmap for latent correlation matrix.
Heatmap_R_approx = latentcor(X = X, types = "tru", method = "approx",
                             showplot = TRUE)$plotR
#> Warning: `gather_()` was deprecated in tidyr 1.2.0.
#> Please use `gather()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

# Example 2 - ternary/continuous case
X = gen_data()$X
# Estimate latent correlation matrix with original method
R_nc_org = latentcor(X = X, types = c("ter", "con"), method = "original")$R
# Estimate latent correlation matrix with aprroximation method
R_nc_approx = latentcor(X = X, types = c("ter", "con"), method = "approx")$R