Sliced Inverse Regression

MyImage src

Sometimes, you read something unexpected. I was looking through this paper on statistical perspectives on representation learning and came across this family of methods: “sliced inverse regression.” Sliced inverse regression? What the heck? I think I have a reasonable general knowledge of statistical learning methods, so it’s exciting to come across something new. [update: do not try to find a paper who’s only keywords you remember are deep learning, representations and statistics.]

It’s a sort of supervised dimensionality reduction developed by statisticians. Suppose you’re trying to do a supervised regression problem in high dimensions \(X^m\) to predict \(y\). So, the idea is to reduce \(X^m\) to \(X^n\) while preserving as much information about \(y\) as possible. This seems sensible, right? You’d usually use PCA for this or some sort of stupid unsupervised dimensionality reduction. However, this is wasteful because you don’t care about fully preserving the information in \(X^m\); you only need the info relevant to \(y\).

There are other sorts of things that you can do for this, e.g. principal components regression; you can even do this with kernel PCA + kernel regression, for which machine learning gives the name spectral regression. According to Wikipedia, people have been doing principal components regression in the econometric literature since at least the 1970s. Still, if you’ve tried to read the econometrics literature, you will understand why it had to be rediscovered.

Here’s the OG paper (one thing I do like about some of these JASA papers is that they have invited commentaries. ). There’s a Python package, Sliced. There is not a heap going on, but he’s implemented two variations on the algorithm:

Also there’s a tutorial which might give you a better sense of what’s going on.

The SIR regression model assumes the following setup:

$$ y = f(b_1 \cdot x, b_2 \cdot x, \ldots, b_k\cdot x, \epsilon) $$

Here, we have unknown row vectors \(b = (b_1, b_2, \ldots, b_k)\), \(\epsilon\) is independent of \(x\), and \(f\) is an arbitrary unknown function on \(R^K\). If the assumptions are valid, we captured all the information we need to know about y by projecting the p-dimensional explanatory variable \(x\) onto the \(k\) dimensional subspace \((b_1 \cdot x, \ldots, b_k\cdot x)\). We can reduce data by efficiently estimating the \(b\)s when \(k\) is small. Any linear combination of the \(b\)s is called a reduced direction, and the linear space \(B\) generated by the \(b\)s is called the reduced space for convenience.

The algorithm sliced inverse regression (SIR) to estimate the reduced directions using inverse regression. We start by standardizing the input variables (\(x\)) and getting a piece-wise estimate of the inverse regression curve \(E(x | y)\) (this is one-dimensional). This is done using the slice mean of \(x\): the data is partitioned into several slices based on the value of \(y\) (like literally equally sized slices), and then the mean of \(x\) for each slice is found. Then, we apply PCA to these sliced means of \(x\), use some distributional assumptions to discard components, and tada: we have an approximation to the \(k\)-dimensional subspace for tracking the inverse regression curve \(E(x | y)\). Finally, we get the SIR output by transforming the identified components back to the original scale with an affine retransformation.

In the Sliced Average Variance Estimation (SAVE) method, we divide the data into equal-sized groups and use eigenvectors corresponding to the larger eigenvalues to estimate the reduced subspace. This is done using the formula:

$$ SAVE = \sum h (I - var(z | y \in I_h)) $$

Here, \(I\) is the slice indicator. Weisberg and Cook make a few other important points regarding SAVE and its comparison with SIR. SIR is helpful in picking up on any dependence of \(y\) on \(x\) through \(E(y | x)\), \(var(y | x)\), or any other moment However, SIR has problems with symmetry, which SAVE does not have. Correspondingly, SAVE has problems with linearity, whereas SIR does not. Both methods yield practical graphical diagnostic information. Go read the Cook and Weisberg 1991 companion piece that has an excellent discussion. Also, there’s an R package, dr, which probably has more up-to-date references to the literature.

An addendum: I then thought lightning was striking twice because I came across the Reverse-Cuthill McKee algorithm. However, this algorithm is much less attractive because it uses a breadth-first search on the sparse matrix graph to approximately solve the (NP-complete) bandwidth reduction problem.