Generates data of mixed types from the latent Gaussian copula model.

gen_data(
  n = 100,
  types = c("ter", "con"),
  rhos = 0.5,
  copulas = "no",
  XP = NULL,
  showplot = FALSE
)

Arguments

n

A positive integer indicating the sample size. The default value is 100.

types

A vector indicating the type of each variable, could be "con" (continuous), "bin" (binary), "tru" (truncated) or "ter" (ternary). The number of variables is determined based on the length of types, that is p = length(types). The default value is c("ter", "con") which creates two variables: the first one is ternary, the second one is continuous.

rhos

A vector with lower-triangular elements of desired correlation matrix, e.g. rhos = c(.3, .5, .7) means the correlation matrix is matrix(c(1, .3, .5, .3, 1, .7, .5, .7, 1), 3, 3). If only a scalar is supplied (length(rhos) = 1), then equi-correlation matrix is assumed with all pairwise correlations being equal to rhos. The default value is 0.5 which means correlations between any two variables are 0.5.

copulas

A vector indicating the copula transformation f for each of the p variables, e.g. U = f(Z). Each element can take value "no" (f is identity), "expo" (exponential transformation) or "cube" (cubic transformation). If the vector has length 1, then the same transformation is applied to all p variables. The default value is "no": no copula transformation for any of the variables.

XP

A list of length p indicating proportion of zeros (for binary and truncated), and proportions of zeros and ones (for ternary) for each of the variables. For continuous variable, NA should be supplied. If NULL, the following values are automatically generated as elements of XP list for the corresponding data types: For continuous variable, the corresponding value is NA; for binary or truncated variable, the corresponding value is a number between 0 and 1 representing the proportion of zeros, the default value is 0.5; for ternary variable, the corresponding value is a pair of numbers between 0 and 1, the first number indicates the the proportion of zeros, the second number indicates the proportion of ones. The sum of a pair of numbers should be between 0 and 1, the default value is c(0.3, 0.5).

showplot

Logical indicator. If TRUE, generates the plot of the data when number of variables p is no more than 3. The default value is FALSE.

Value

gen_data returns a list containing

  • X: Generated data matrix (n by p) of observed variables.

  • plotX: Visualization of the data matrix X. Histogram if p=1. 2D Scatter plot if p=2. 3D scatter plot if p=3. Returns NULL if showplot = FALSE.

References

Fan J., Liu H., Ning Y. and Zou H. (2017) "High dimensional semiparametric latent graphicalmodel for mixed data" doi:10.1111/rssb.12168 .

Yoon G., Carroll R.J. and Gaynanova I. (2020) "Sparse semiparametric canonical correlation analysis for data of mixed types" doi:10.1093/biomet/asaa007 .

Examples

# Generate single continuous variable with exponential transformation (always greater than 0)
# and show histogram.
simdata = gen_data(n = 100, copulas = "expo", types = "con", showplot = FALSE)
X = simdata$X; plotX = simdata$plotX
# Generate a pair of variables (ternary and continuous) with default proportions
# and without copula transformation.
simdata = gen_data()
X = simdata$X
# Generate 3 variables (binary, ternary and truncated)
# corresponding copulas for each variables are "no" (no transformation),
# "cube" (cube transformation) and "cube" (cube transformation).
# binary variable has 30% of zeros, ternary variable has 20% of zeros
# and 40% of ones, truncated variable has 50% of zeros.
# Then show the 3D scatter plot (data points project on either 0 or 1 on Axis X1;
# on 0, 1 or 2 on Axas X2; on positive domain on Axis X3)
simdata = gen_data(n = 100, rhos = c(.3, .4, .5), copulas = c("no", "cube", "cube"),
          types = c("bin", "ter", "tru"), XP = list(.3, c(.2, .4), .5), showplot = TRUE)
X = simdata$X; plotX = simdata$plotX
# Check the proportion of zeros for the binary variable.
sum(simdata$X[ , 1] == 0)
#> [1] 30
# Check the proportion of zeros and ones for the ternary variable.
sum(simdata$X[ , 2] == 0); sum(simdata$X[ , 2] == 1)
#> [1] 20
#> [1] 40
# Check the proportion of zeros for the truncated variable.
sum(simdata$X[ , 3] == 0)
#> [1] 50