GitXplorerGitXplorer
a

generalizedPCA

public
23 stars
2 forks
8 issues

Commits

List of commits on branch master.
Unverified
36bccdd4bf82757b4a69c09cc763736a9d4aa4cd

bug fix in EFH

aandland committed 6 years ago
Unverified
9cf22cd3636da40a11103477ba52c15421014313

Updates to pass R CMD CHECK

aandland committed 7 years ago
Unverified
b957777ec78bddd7bc0a0049c1c34b3ec26fbc31

updated irlba to RSpectra for convex

aandland committed 7 years ago
Unverified
9fe946890e7c3ec0638b0d05f6bcc848a3da6292

Added EFH methods and simplified optimization

aandland committed 7 years ago
Unverified
f499b1907808b6128049286162404cd299c95925

RSpectra for partial decomposition

aandland committed 7 years ago
Unverified
e79c34f8c0524b7a33915f3093fcb562d6709198

added more optimization and missing to EFH

aandland committed 7 years ago

README

The README file for this repository.

Generalized PCA

generalizedPCA is an R package which extends principal component analysis to other types of data, much like generalized linear models extends linear regression. The package logisticPCA contains the extension to binary data, among other methods, and this package intends to generalize it to all exponential family distributions. Please note that it is still in the very early stages of development and the conventions will possibly change in the future.

Installation

To install R, visit r-project.org/.

To install the package, first install devtools from CRAN. Then run the following commands.

# install.packages("devtools")
library("devtools")
install_github("andland/generalizedPCA")

Use

The main function is generalizedPCA(). Like in generalized linear models, you must specify the distribution of your data. generalizedPCA() currently supports "gaussian", "binomial", "poisson", or "multinomial" data. Unlike standard PCA, it can incorporate weights and missing data. If your data are proportions, you can use family = "binomial" with weights being a matrix of the number of opportunities. If your data is a multinomial variable with d levels, the input matrix should have d - 1 columns.

The function returns mu, the variable main effects vector of length d, and U, the d x k loadings matrix.

Details

generalizedPCA() estimates the natural parameters of an exponential family distribution in a lower dimensional space. This is done by projecting the natural parameters from the saturated model. A rank-k projection matrix, or equivalently a d x k orthogonal matrix, is solved for to minimize the deviance.

For some distributions, the natural parameters from the saturated model are either negative or positive infinity, and an additional tuning parameter M is needed to approximate them. This occurs when family = "binomial" and your data include 0's or 1's or when family = "poisson" and your data include 0's. You can use cv.gpca() to select M by cross validation. Typical values are in the range of 3 to 10.

A manuscript describing generalized PCA applied to binary data can be found here.

Methods

The generalizedPCA class, gpca, has several methods to make data analysis easier.

  • print(): Prints a summary of the fitted model.
  • fitted(): Fits the low dimensional matrix of either natural parameters or response.
  • predict(): Predicts the PCs on new data. Can also predict the low dimensional matrix of natural parameters or response on new data.
  • plot(): Plots either the deviance trace by the number of iterations, the first two PC loadings, or the first two PC scores using the package ggplot2.

In addition, there are functions for performing cross-validation.

  • cv.gpca(): Run cross-validation over the rows of the matrix to assess the fit of M and/or k.
  • plot.cv(): Plots the results of the cv.gpca() function using the package ggplot2.