Factorial Methods: Overview

This article is Part 1 in a 3-Part series called Factorial Analysis Intro.

Introduction

Background

For some time now, I have been interested in the principal component analysis (PCA) technique as a technique for spotting patterns in data and determining which variables are most useful for data modelling and prediction. However, as with all techniques, PCA has both a context and usage limitations that govern its use. Therefore, a wise user must come to terms with in order to avoid misapplication. Remember folks… “garbage in, garbage out” :smile:. This encouraged me to dig deeper to understand where PCA fits in the grand scheme of things. This lead me to the discovery of related techniques such as multiple correspondence analysis (MCA) and factorial analysis of mixed data (FAMD), which are related to PCA. These techniques seem to belong to the same “family” of methods. Therefore the aim of this series of posts is to understand how these methods are properly applied.

Rationale

The broad purpose of this series of posts is to give me a good general understanding of the factorial anaylsis techniques in a simple, useful way that enables me to properly use them. As such, this is a conceptual overview that does not cover the specific implementation details of these techniques. Firstly, I am not (presently) well versed in linear algebra, which is a technique that I would like to tackle in future. I first stumbled upon this concept in the Coursera regression models class that I took as part of their Data Science specialisation. My present approach is to learn factorial methods in the same way by mastering the essential concepts relating to the selection, application and interpretation of these methods without needing to understand their derivation. Therefore, the overall aim of my study strategy is to effectively use the technique, which precedes my ability to effectively implement the technique.

That said, I found a potentially useful Wikibook that I hope will enable me to learn this technique in time. While I am at it, I also want to recover and improve upon the calculus and trigonometry that I essentially left behind in high school. I want to go deeper in these, but prudence and time dictate that I understand how and when to use the implementation of these methods before I learn their technical derivation.

Importantly, I don’t think that I have a perfect understanding of these techniques and naturally, I will correct my understanding as errors come to light in future.

Purpose Overview

a) Motivation

The general purpose of the techniques described in this post is to transform a dataset consisting of n observations (rows) and p variables (columns) into another set of variables that are ≤ p, which I will generally refer to as principal components (PCs). Some of the characteristics of these PCs are that they are:

  • linear combinations of the original variables
  • linearly uncorrelated with each other
  • ranked according to the amount of the variability that each PC captures from the original dataset.

b) Dimension reduction

This has the important effect of reducing the number of variables required to explain the variations observed in the dataset while retaining the maximum amount of information about the data, a process called dimension reduction.

c) Elucidation of hidden influences

Another potentially useful feature of this process is that examining the composition of the PCs, particularly the most informative ones, may reveal the underlying processes or variables that may ultimately be responsible for the patterns observed in the measured variables.

I suspect that this would be possible by looking at which combination of variables contribute substantially to the PCs which capture the most information (as variance) about the dataset. I think that this is a similar idea to that of latent variables. The exploration of such latent variables is a goal of Factor Analysis, which is a related (but not interchangeable) set of techniques.

This kind of analysis could potentially identify groups or categories of observations that can be suitably labelled according to the results of the analysis. Incidentally, an alternative approach is the use of clustering techniques.

Method Overview

This section gives a broad overview of the classical factorial analyses as I understand them. This is intended to provide a simple comparative overview of when and how to use these techniques. Note that the “Analysis scope” section of each technique is an extension of the general scope outlined above.

a) Principal component analysis (PCA)

  • Analysis scope:

    PCA is computed using only quantitative (continuous numerical values) variables, and is NOT suitable for qualitative (categorical) variables. As a rough guide, this means that the data must be logically and practically consistent with the numerical data transformations required for data preparation.

  • Input data:
    • a dataset comprised of individuals (observations or rows) described by a set of quantitative variables (columns).
  • Data prep: Raw data must be prepared to avoid distorted results due to the effect of different magnitudes and/or units of input variables on the contribution to the variance of principal components (see here also).
    • mean-centring (or mean-centering): subtract the mean of each variable from the respecive data points of that variable.
    • standard normalisation: mean-centred variables divided by the respective standard deviation.
  • Applications:

b) Multiple correspondence analysis (MCA)

  • Analysis scope:

    MCA can be thought of as PCA for qualitative variables and is therefore cannot handle quantitative variables inits computation.

  • Input data:
    • a dataset comprised of individuals described by a set of qualitative variables.
  • Data prep:
    • Presumably one needs to ensure that the category labels of the qualitative variables are appropriately standardised for analysis, i.e. avoid typos as they may be construed as a separate category.
  • Applications:

c) Factorial analysis of mixed data (FAMD)

  • Analysis scope:

    FAMD can be applied to the analysis of datasets described by a mix of quantitative and qualitative variables. I understand this to be a combination of the functionality of PCA and MCA. A general use case is described in the scope section of the FAMD Wikipedia page.

  • Input data:
    • a dataset comprised of individuals (observations) described by combinations of quantitative and qualitative variables.
  • Data prep:
    • quantitative variables: as per PCA
    • quanlitative variables: as per MCA
  • Applications:

d) Multiple factor analysis (MFA)

  • Analysis scope:

    From what I understand, MFA is basically FAMD applied to datasets described by groups of quantitative and/or qualitative variables. The analysis is performed such that variables from one group don’t dominate the analyis by having an unreasonably large influence over the result. Larger groups of variables potentially account for more of the variability with in the dataset by virtue of their size alone, and thus may need to be appropriately weighted to remove this potential bias.

  • Input data:
    • Essentially as per FAMD
  • Data prep:
    • Essentially as per FAMD
  • Applications:

FactoMineR Implementation

The page for the FactoMineR R package, which specialises on factorial analyes, goes into more detail about what it refers to as classical and advanced methods, providing a nice summary of the different techniques that are implemented and use case scenarios. I have taken the liberty of summarising the various methods implemented by FactoMineR to get a feel for what is available. The simple reason is that I will likely be using this package in combination with the FactoExtra in order to perform and visualise factorial analyses. The four analyses summarised above are highlighted below for clarity as I expect to be focusing on the application of these methods first before exploring others.

Groups of Observations Groups of Variables Variable Type Technique
one one quant PCA
one one quant HCPC
one one qual CA
one one qual MCA
one one mixed FAMD
one many mixed MFA
one many mixed HMFA
one many quant GPA
many one quant Dual MFA

One think to note, at the bottom of the advanced methods page page the description of FAMD (emphasis mine) is:

“When one set of individuals is described by one set of variables that may be continuous and/or categorical, the analysis proposed is an particular case of MFA called Factor Analysis of Mixed Data.”

Note: To me, this contradicts the description of MFA at Wikipedia and also seems to contradict the description of this method in context of what they have implemented (see table above). Admittedly, this is my summary of their information. However, the main point is that I need to take care when selecting, using and interpreting the output of any analysis method implementation. This is critical to ensure that I comply with the requirements of both the factorial method under consideration and the specific implementation of said method.

Conclusion

This post has been a rewarding one to compile :smile:. I now feel that I have a much better handle of how to use PCA and related techniques with clarity. I am now in a position to get my hands dirty with some of these techniques and understand:

  • their input parameters
  • data preparation requirements
  • comprehensive results interpretation

Onward to greater things :smile:

This article is Part 1 in a 3-Part series called Factorial Analysis Intro.

Written on April 1, 2017