softImpute Vignette

softImpute is a package for matrix completion using nuclear norm regularization. It offers two algorithms:

One iteratively computes the soft-thresholded SVD of a filled in matrix - an algorithm described in Mazumder et al (2010). This is option type="svd" in the call to softImpute().
The other uses alternating ridge regression, at each stage filling in the missing entries with the latest estimates. This we believe is the faster option, and is option type="als" in the call to softImpute.

The package can deal with both small and very large matrices; the latter are stored in sparse-matrix format using a new S4 class "Incomplete". For example, softImpute can happily fit a rank 100 SVD to the netflix data (480,189 x 17,770, 99% missing) using a machine with about 32Gb of memory. For smaller matrices with missing data, the usual full matrix with NA suffices.

The package has two other notable features. It can compute a (low-rank) SVD of a large sparse matrix, stored in sparse matrix format. This can be done either with nuclear-norm regularization or not, the former being somewhat faster.

There is a function biScale() that generalizes the scale() function in R. It can simultaneously scale a matrix to have row and column means zero, and row and column variances one. Of course, any subset of these can be chosen as well. This function can deal with the missing values, if present. It is also smart enough to deal with large sparse matrices. In this case, it does not actually apply the centering operation and destroy the sparsity; instead it stores the resulting matrix in the new matrix class "SparseplusLowRank". This is a special class used by softImpute, originally created to allow it to avoid ever storing a large, filled in matrix, when the original matrix was stored in sparse matrix format via class "Incomplete".

This vignette is a simple guide to using the package.

What softImpute solves

Here we briefly describe the problem solved. Suppose $X$ is a large $m\times n$ matrix, with many missing entries. Let $\Omega$ contain the pairs of indices $(i,j)$ where $X$ is observed, and let $P_\Omega(X)$ denote a matrix with the entries in $\Omega$ left alone, and all other entries set to zero. So when $X$ has missing entries in $\Omega^\perp$, $P_\Omega(X)$ would set the missing values to zero.

If $\widehat M$ solves this convex-cone problem, then it satisfies the following stationarity condition: \[ {\widehat M}=S_\lambda(Z)\] where \[Z=P_\Omega(X)+P_{\Omega^\perp}(\widehat M).\] Hence $Z$ is the “filled in”" version of $X$. The operator $S_\lambda(Z)$ applied to matrix $Z$ does the following:

Compute the SVD of $Z=UDV^T$, and let $d_i$ be the singular values in $D$.
Soft-threshold the singular values: $d_i^*= (d_i-\lambda)_+$.
Reconstruct: $S_\lambda(Z)=UD^*V^T$. We call this operation the “soft-thresholded SVD”. Notice that for sufficiently large $\lambda$, $D^*$ will be rank-reduced, and hence so will be $UD^*V^T$.

This suggests the obvious iterative algorithm: using the current estimate for $M$, create $Z$, and update $M$ by the soft-thresholded SVD of $Z$.

This is exactly what softImpute does on (small) matrices with missing values stored as NAs. By small we mean small enough that the SVD can be computed by R in a small amount of time.

This is not tenable for very large matrices, like those stored as class "Incomplete". Here we make two very important changes to the recipe:

Re-express $Z$ at each iteration as as \[Z=P_\Omega(X)-P_\Omega(\widehat M) + \widehat M.\] This is of the form "SparseplusLowRank" (assuming $\widehat M$ is low rank), and hence can be stored. Left and right matrix multiplies by skinny matrices can also be efficiently performed.
Anticipating that the solution $\widehat M$ will have low rank, compute only a low-rank SVD of $Z$, using alternating subspace methods.

Indeed, softImpute has a rank argument that allows one to limit the rank of the solution; if the algorithm returns a solution with rank lower than the specified rank $r$, you know it has solved the unrestricted problem.

Consider the alternative criterion \[\min_{A,B}\frac12\|P_\Omega(X)-P_\Omega(AB^T)\|^2_F+\frac{\lambda}2(\|A\|_F^2 +\|B\|_F^2),\] where $A$ and $B$ have each $r$ columns, and let us suppose that $r$ is bigger than or equal to the solution rank of the earlier criterion. This problem is not convex, but remarkably, it has a solution that satisfies ${\widehat A}{\widehat B}^T=\widehat M$!

We can once again characterize the stationarity conditions, now in terms of $\widehat A$ and $\widehat B$. Let \[Z=P_\Omega(X)+P_{\Omega^\perp}({\widehat A}{\widehat B}^T),\] the filled in version of $X$. Then \[\widehat B= ({\widehat A}^T{\widehat A}+\lambda I)^{-1}{\widehat A}^T Z.\] We get $\widehat B$ by ridge regressions of the columns of $Z$ on $\widehat A$. For $\widehat A$ its the same, with the roles of $A$ and $B$ reversed. This again suggests an obvious alternation, now by ridged regressions. After each regression, we update the component $A$ or $B$, and the filled in $Z$. If $r$ is sufficiently large, this again solves the same problem as before.

This last algorithm (softImpute ALS) can be seen as combining the alternating subspace SVD algorithm for computing the SVD with the iterative filling in and SVD calculation. It turns out that this interweaving leads to computational savings, and allows for a very efficient distributed implementation (not covered here).

A simple example

We will start with a small and simple example. Lets generate a small matrix and make some values missing.

require(softImpute)

## Loading required package: softImpute
## Loading required package: Matrix
## Loaded softImpute 1.3

set.seed(1011)
x=matrix(rnorm(30),6,5)
x[sample(1:30,10,replace=FALSE)]=NA
x

##         [,1]     [,2]    [,3]     [,4]     [,5]
## [1,]  0.8655  0.01565  0.1748       NA       NA
## [2,] -0.6004       NA -0.2119       NA       NA
## [3,] -0.7169       NA      NA  0.06437 -0.09754
## [4,]  0.6966 -0.50332  0.5585  1.54376       NA
## [5,]  1.2312 -0.34232 -0.8103 -0.82006 -0.13257
## [6,]  0.2664  0.14486      NA       NA -2.24088

fits=softImpute(x,trace=TRUE,type="svd")

## 1 : obj 0.09323 ratio 0.03955 
## 2 : obj 0.08281 ratio 0.03044 
## 3 : obj 0.07372 ratio 0.0556 
## 4 : obj 0.05961 ratio 0.06921 
## 5 : obj 0.03804 ratio 0.03375 
## 6 : obj 0.02178 ratio 0.01187 
## 7 : obj 0.01407 ratio 0.004026 
## 8 : obj 0.01094 ratio 0.00147 
## 9 : obj 0.00965 ratio 0.0006177 
## 10 : obj 0.00906 ratio 0.0003083 
## 11 : obj 0.00874 ratio 0.0001828 
## 12 : obj 0.00854 ratio 0.000125 
## 13 : obj 0.0084 ratio 9.43e-05 
## 14 : obj 0.00829 ratio 7.553e-05 
## 15 : obj 0.0082 ratio 6.259e-05 
## 16 : obj 0.00812 ratio 5.286e-05 
## 17 : obj 0.00805 ratio 4.515e-05 
## 18 : obj 0.008 ratio 3.883e-05 
## 19 : obj 0.00795 ratio 3.356e-05 
## 20 : obj 0.00791 ratio 2.91e-05 
## 21 : obj 0.00787 ratio 2.532e-05 
## 22 : obj 0.00784 ratio 2.209e-05 
## 23 : obj 0.00781 ratio 1.932e-05 
## 24 : obj 0.00779 ratio 1.695e-05 
## 25 : obj 0.00777 ratio 1.49e-05 
## 26 : obj 0.00775 ratio 1.314e-05 
## 27 : obj 0.00773 ratio 1.162e-05 
## 28 : obj 0.00771 ratio 1.031e-05 
## 29 : obj 0.0077 ratio 9.168e-06

fits

## $u
##          [,1]    [,2]
## [1,] -0.35651 -0.2680
## [2,]  0.30892  0.1716
## [3,] -0.00601  0.3682
## [4,] -0.60584 -0.1107
## [5,]  0.05672 -0.8263
## [6,] -0.63810  0.2610
## 
## $d
## [1] 4.753 2.056
## 
## $v
##          [,1]    [,2]
## [1,] -0.21291 -0.7897
## [2,]  0.04905  0.2236
## [3,] -0.23499  0.4351
## [4,] -0.56011  0.3394
## [5,]  0.76375  0.1482
## 
## attr(,"lambda")
## [1] 0
## attr(,"call")
## softImpute(x = x, type = "svd", trace.it = TRUE)

Since this is a small matrix, it has solved it using repeated SVDs. There is no penalization here ($\lambda=0$), and by default the rank was taken to be 2. Since there is no penalization, if the rank was given to be $\min(m,n)$, then there is no restriction, and any values for the missing data would give the same minimum loss of 0. In other words, either penalization, or a rank restriction (or both) are needed for sensible imputation.

We could use ALS instead here (the default for type= argument)

fita=softImpute(x,trace=TRUE)

## 1 : obj 0.2409 ratio 5.681 
## 2 : obj 0.07848 ratio 0.1434 
## 3 : obj 0.05745 ratio 0.06019 
## 4 : obj 0.035 ratio 0.04943 
## 5 : obj 0.01739 ratio 0.02011 
## 6 : obj 0.01077 ratio 0.004749 
## 7 : obj 0.0091 ratio 0.001136 
## 8 : obj 0.00862 ratio 0.0003569 
## 9 : obj 0.00841 ratio 0.0001602 
## 10 : obj 0.00829 ratio 9.816e-05 
## 11 : obj 0.00819 ratio 7.211e-05 
## 12 : obj 0.00812 ratio 5.752e-05 
## 13 : obj 0.00805 ratio 4.752e-05 
## 14 : obj 0.008 ratio 3.987e-05 
## 15 : obj 0.00795 ratio 3.372e-05 
## 16 : obj 0.00791 ratio 2.866e-05 
## 17 : obj 0.00788 ratio 2.445e-05 
## 18 : obj 0.00785 ratio 2.092e-05 
## 19 : obj 0.00782 ratio 1.796e-05 
## 20 : obj 0.0078 ratio 1.546e-05 
## 21 : obj 0.00778 ratio 1.335e-05 
## 22 : obj 0.00777 ratio 1.158e-05 
## 23 : obj 0.00776 ratio 1.007e-05 
## 24 : obj 0.00774 ratio 8.804e-06

The objectives are different! At this point we are playing with non-convex optimization, and so the solutions can be local minima. Lets use some regularization now, choosing a value for lambda that will give a rank 2 solution (this required trial and error to get it right).

fits2=softImpute(x,rank.max=3,lambda=1.9,trace=TRUE,type="svd")

## 1 : obj 0.3263 ratio 0.004981 
## 2 : obj 0.3263 ratio 0.0005115 
## 3 : obj 0.3263 ratio 0.0002366 
## 4 : obj 0.3263 ratio 0.0001715 
## 5 : obj 0.3263 ratio 0.0001295 
## 6 : obj 0.3263 ratio 9.812e-05 
## 7 : obj 0.3263 ratio 7.43e-05 
## 8 : obj 0.3263 ratio 5.622e-05 
## 9 : obj 0.3263 ratio 4.251e-05 
## 10 : obj 0.3263 ratio 3.21e-05 
## 11 : obj 0.3263 ratio 2.422e-05 
## 12 : obj 0.3263 ratio 1.825e-05 
## 13 : obj 0.3263 ratio 1.375e-05 
## 14 : obj 0.3263 ratio 1.034e-05 
## 15 : obj 0.3263 ratio 7.773e-06

fita2=softImpute(x,rank.max=3,lambda=1.9,trace=TRUE)

## 1 : obj 0.3708 ratio 1.161 
## 2 : obj 0.3342 ratio 0.09962 
## 3 : obj 0.3294 ratio 0.04512 
## 4 : obj 0.328 ratio 0.02393 
## 5 : obj 0.3273 ratio 0.01262 
## 6 : obj 0.3269 ratio 0.006533 
## 7 : obj 0.3267 ratio 0.003423 
## 8 : obj 0.3266 ratio 0.001866 
## 9 : obj 0.3266 ratio 0.001082 
## 10 : obj 0.3265 ratio 0.0006755 
## 11 : obj 0.3265 ratio 0.0004539 
## 12 : obj 0.3264 ratio 0.0003257 
## 13 : obj 0.3264 ratio 0.0002465 
## 14 : obj 0.3264 ratio 0.0001943 
## 15 : obj 0.3264 ratio 0.0001579 
## 16 : obj 0.3264 ratio 0.0001313 
## 17 : obj 0.3264 ratio 0.0001112 
## 18 : obj 0.3264 ratio 9.542e-05 
## 19 : obj 0.3264 ratio 8.281e-05 
## 20 : obj 0.3263 ratio 7.254e-05 
## 21 : obj 0.3263 ratio 6.403e-05 
## 22 : obj 0.3263 ratio 5.69e-05 
## 23 : obj 0.3263 ratio 5.084e-05 
## 24 : obj 0.3263 ratio 4.565e-05 
## 25 : obj 0.3263 ratio 4.116e-05 
## 26 : obj 0.3263 ratio 3.725e-05 
## 27 : obj 0.3263 ratio 3.382e-05 
## 28 : obj 0.3263 ratio 3.079e-05 
## 29 : obj 0.3263 ratio 2.811e-05 
## 30 : obj 0.3263 ratio 2.572e-05 
## 31 : obj 0.3263 ratio 2.358e-05 
## 32 : obj 0.3263 ratio 2.166e-05 
## 33 : obj 0.3263 ratio 1.993e-05 
## 34 : obj 0.3263 ratio 1.837e-05 
## 35 : obj 0.3263 ratio 1.695e-05 
## 36 : obj 0.3263 ratio 1.566e-05 
## 37 : obj 0.3263 ratio 1.449e-05 
## 38 : obj 0.3263 ratio 1.342e-05 
## 39 : obj 0.3263 ratio 1.244e-05 
## 40 : obj 0.3263 ratio 1.155e-05 
## 41 : obj 0.3263 ratio 1.072e-05 
## 42 : obj 0.3263 ratio 9.971e-06 
## final SVD: obj 0.3263

fits2$d

## [1] 0.44478 0.06991 0.00000

These two are the same (modulo convergence criterion). Because the smallest singular value is zero, we know we are in the convex regime, and so both algorithms give the same solution. We can impute the missing values using complete(), which returns the full matrix:

complete(x,fits2)

##         [,1]       [,2]     [,3]       [,4]     [,5]
## [1,]  0.8655  0.0156518  0.17479 -0.0031477 -0.06441
## [2,] -0.6004  0.0013211 -0.21191  0.0004573  0.04221
## [3,] -0.7169  0.0006881  0.00301  0.0643736 -0.09754
## [4,]  0.6966 -0.5033181  0.55848  1.5437566  0.01520
## [5,]  1.2312 -0.3423237 -0.81027 -0.8200643 -0.13257
## [6,]  0.2664  0.1448639 -0.05393 -0.0732347 -2.24088

We can first double center our matrix before completion

xc=biScale(x,col.scale=FALSE,row.scale=FALSE,trace=TRUE)

## Iter 1 Total Changes 1.663 
## Iter 2 Total Changes 0.07349 
## Iter 3 Total Changes 0.002381 
## Iter 4 Total Changes 0.000153 
## Iter 5 Total Changes 1.083e-05 
## Iter 6 Total Changes 7.749e-07 
## Iter 7 Total Changes 5.552e-08 
## Iter 8 Total Changes 3.979e-09 
## Iter 9 Total Changes 2.852e-10

xc

##         [,1]     [,2]    [,3]   [,4]    [,5]
## [1,]  0.1390 -0.09270 -0.0463     NA      NA
## [2,] -0.4470       NA  0.4470     NA      NA
## [3,] -0.8253       NA      NA  0.116  0.7093
## [4,] -0.1982 -0.77993  0.1691  0.809      NA
## [5,]  0.9662  0.01088 -0.5698 -0.925  0.5177
## [6,]  0.3652  0.86175      NA     NA -1.2269
## attr(,"biScale:row")
## attr(,"biScale:row")$center
## [1]  0.38624 -0.49371 -0.23189  0.55450 -0.07532 -0.43900
## 
## attr(,"biScale:row")$scale
## [1] 1 1 1 1 1 1
## 
## attr(,"biScale:column")
## attr(,"biScale:column")$center
## [1]  0.3402 -0.2779 -0.1652  0.1803 -0.5749
## 
## attr(,"biScale:column")$scale
## [1] 1 1 1 1 1
## 
## attr(,"critmat")
##       iter      crit
##  [1,]    1 1.663e+00
##  [2,]    2 7.349e-02
##  [3,]    3 2.381e-03
##  [4,]    4 1.530e-04
##  [5,]    5 1.083e-05
##  [6,]    6 7.749e-07
##  [7,]    7 5.552e-08
##  [8,]    8 3.979e-09
##  [9,]    9 2.852e-10

fits3=softImpute(xc,rank.max=3,lambda=1,type="svd")
fits3$d

## [1] 1.0938 0.6484 0.0000

complete(x,fits3,unscale=TRUE)

##         [,1]     [,2]    [,3]     [,4]     [,5]
## [1,]  0.8655  0.01565  0.1748  0.53515 -0.16774
## [2,] -0.6004 -0.85835 -0.2119 -0.14764 -1.04932
## [3,] -0.7169 -0.78692 -0.3101  0.06437 -0.09754
## [4,]  0.6966 -0.50332  0.5585  1.54376  0.16450
## [5,]  1.2312 -0.34232 -0.8103 -0.82006 -0.13257
## [6,]  0.2664  0.14486 -0.6315 -0.37088 -2.24088

Notice that we completed x with fits3, which was run on the centered version xc. The scaling info is stored on the SVD object as an attribute, and with unscale=TRUE (actually the default), the centering is reversed.

Using the sparse matrix version

So far we have not been worried about matrix size, because x is small. We can convert it to a sparse matrix object

xs=as(x,"Incomplete")
xs

## 6 x 5 sparse Matrix of class "Incomplete"
##                                                
## [1,]  0.8655  0.01565  0.1748  .        .      
## [2,] -0.6004  .       -0.2119  .        .      
## [3,] -0.7169  .        .       0.06437 -0.09754
## [4,]  0.6966 -0.50332  0.5585  1.54376  .      
## [5,]  1.2312 -0.34232 -0.8103 -0.82006 -0.13257
## [6,]  0.2664  0.14486  .       .       -2.24088

Notice that it stores the missing entries as “zeros”, but because it is class "Incomplete", the object “knows”" this is not really the case. In practice, we would not run as() on a huge matrix with tons of missing values, because we probably could not fit it in memory. So we would need a way of getting this matrix into R. This is typically stored on disk in what is known as “market matrix” format, as the triples (i,j,value). We can reverse engineer this here just for a demo:

i=row(x)[!is.na(x)]
j=col(x)[!is.na(x)]
value=x[!is.na(x)]
cbind(i,j,value)

##       i j    value
##  [1,] 1 1  0.86549
##  [2,] 2 1 -0.60042
##  [3,] 3 1 -0.71693
##  [4,] 4 1  0.69656
##  [5,] 5 1  1.23116
##  [6,] 6 1  0.26644
##  [7,] 1 2  0.01565
##  [8,] 4 2 -0.50332
##  [9,] 5 2 -0.34232
## [10,] 6 2  0.14486
## [11,] 1 3  0.17479
## [12,] 2 3 -0.21191
## [13,] 4 3  0.55848
## [14,] 5 3 -0.81027
## [15,] 3 4  0.06437
## [16,] 4 4  1.54376
## [17,] 5 4 -0.82006
## [18,] 3 5 -0.09754
## [19,] 5 5 -0.13257
## [20,] 6 5 -2.24088

For the Netflix data dimensions, there are 99 million non-missing entries, much less than 8.5 trillion entries in the full matrix. We would input the data via the (i,j,value) representation, and then construct xs

Incomplete(i=i,j=j,x=value)

## 6 x 5 sparse Matrix of class "Incomplete"
##                                                
## [1,]  0.8655  0.01565  0.1748  .        .      
## [2,] -0.6004  .       -0.2119  .        .      
## [3,] -0.7169  .        .       0.06437 -0.09754
## [4,]  0.6966 -0.50332  0.5585  1.54376  .      
## [5,]  1.2312 -0.34232 -0.8103 -0.82006 -0.13257
## [6,]  0.2664  0.14486  .       .       -2.24088

Lets pretend this object is huge, for the purposes of our demonstration. We can double center it just as before, and run softImpute()

xsc=biScale(xs,col.scale=FALSE,row.scale=FALSE)
fitss=softImpute(xsc,rank.max=3,lambda=1,trace=TRUE,type="svd")

## Ssvd.als: 1 : obj 0.1248 ratio 1.4 
## Ssvd.als: 2 : obj 0.107 ratio 0.03347 
## Ssvd.als: 3 : obj 0.1065 ratio 0.00196 
## Ssvd.als: 4 : obj 0.1065 ratio 0.0002632 
## Ssvd.als: 5 : obj 0.1065 ratio 7.162e-05 
## Ssvd.als: 6 : obj 0.1064 ratio 3.543e-05 
## Ssvd.als: 7 : obj 0.1064 ratio 2.193e-05 
## Ssvd.als: 8 : obj 0.1064 ratio 1.466e-05 
## Ssvd.als: 9 : obj 0.1064 ratio 1.021e-05 
## Ssvd.als: 10 : obj 0.1064 ratio 7.332e-06 
## Ssvd.als: 1 : obj 0.1049 ratio 0.007423 
## Ssvd.als: 2 : obj 0.1048 ratio 0.0004016 
## Ssvd.als: 3 : obj 0.1047 ratio 2.523e-05 
## Ssvd.als: 4 : obj 0.1047 ratio 1.678e-06 
## 1 : obj 0.1576 ratio 0.01121 
## Ssvd.als: 1 : obj 0.1046 ratio 0.0009217 
## Ssvd.als: 2 : obj 0.1046 ratio 6.32e-05 
## Ssvd.als: 3 : obj 0.1046 ratio 4.108e-06 
## 2 : obj 0.157 ratio 0.00151 
## Ssvd.als: 1 : obj 0.1046 ratio 0.0002208 
## Ssvd.als: 2 : obj 0.1046 ratio 1.726e-05 
## Ssvd.als: 3 : obj 0.1046 ratio 1.192e-06 
## 3 : obj 0.1568 ratio 0.0003757 
## Ssvd.als: 1 : obj 0.1045 ratio 7.01e-05 
## Ssvd.als: 2 : obj 0.1045 ratio 5.495e-06 
## 4 : obj 0.1568 ratio 0.0001153 
## Ssvd.als: 1 : obj 0.1045 ratio 2.461e-05 
## Ssvd.als: 2 : obj 0.1045 ratio 1.919e-06 
## 5 : obj 0.1568 ratio 4.043e-05 
## Ssvd.als: 1 : obj 0.1045 ratio 8.88e-06 
## 6 : obj 0.1568 ratio 1.305e-05 
## Ssvd.als: 1 : obj 0.1045 ratio 3.579e-06 
## 7 : obj 0.1568 ratio 4.999e-06

fitss$d

## [1] 1.0936 0.6485 0.0000

fits3$d

## [1] 1.0938 0.6484 0.0000

Notice here that we get additional trace info with trace=TRUE. Since xs is “huge”, the SVD is computed using alternating subspace methods (with warm starts), and so we are seeing that as inner loops as well.

With huge matrices we would not use the complete() function, but rather request individual predictions. For example entries (2,2), and (3,2) could be imputed via

impute(fitss,i=c(2,3),j=c(2,2))

## [1] -0.8585 -0.7869

Again, the unscale=TRUE default for impute() means that the centering (stored on the object fitss) was reversed.

Reduced rank SVD of large sparse matrices

This is almost an aside, but the tools we developed for large matrix completion problems are also useful for working with large (but sparse) matrices. Suppose we had such a beast, and we wanted to compute its principal components. We would need to column-center the matrix, which would render it no longer sparse. We would also want to compute a few of the largest singular vectors for this beast. Lets see how we do that.

First we will read in our large matrix, again in market-matrix format. For simplicity we use our same matrix x, except now the missing values are zeros.

x0=sparseMatrix(i=i,j=j,x=value)
x0

## 6 x 5 sparse Matrix of class "dgCMatrix"
##                                                
## [1,]  0.8655  0.01565  0.1748  .        .      
## [2,] -0.6004  .       -0.2119  .        .      
## [3,] -0.7169  .        .       0.06437 -0.09754
## [4,]  0.6966 -0.50332  0.5585  1.54376  .      
## [5,]  1.2312 -0.34232 -0.8103 -0.82006 -0.13257
## [6,]  0.2664  0.14486  .       .       -2.24088

x0c=biScale(x0,col.scale=FALSE,row.scale=FALSE,row.center=FALSE)
x0c

## An object of class "SparseplusLowRank"
## Slot "x":
## 6 x 5 sparse Matrix of class "dgCMatrix"
##                                                
## [1,]  0.8655  0.01565  0.1748  .        .      
## [2,] -0.6004  .       -0.2119  .        .      
## [3,] -0.7169  .        .       0.06437 -0.09754
## [4,]  0.6966 -0.50332  0.5585  1.54376  .      
## [5,]  1.2312 -0.34232 -0.8103 -0.82006 -0.13257
## [6,]  0.2664  0.14486  .       .       -2.24088
## 
## Slot "a":
##      [,1] [,2]
## [1,]    0    1
## [2,]    0    1
## [3,]    0    1
## [4,]    0    1
## [5,]    0    1
## [6,]    0    1
## 
## Slot "b":
##      [,1]     [,2]
## [1,]   -1 -0.29038
## [2,]   -1  0.11419
## [3,]   -1  0.04815
## [4,]   -1 -0.13134
## [5,]   -1  0.41183

So the column centered matrix is still stored in sparse format, but now it has class "SparseplusLowRank", with slots a and b, and slot x the original matrix (x0). In fact, the centered matrix is $x+ab^T$.

Now we compute the SVD of this matrix, using our function svd.als() ( a workhorse for softImpute())

svdx0c=svd.als(x0c,rank=3,trace=TRUE)

## Ssvd.als: 1 : obj 0.07908 ratio 4.008 
## Ssvd.als: 2 : obj 0.01221 ratio 0.2448 
## Ssvd.als: 3 : obj 0.00414 ratio 0.00377 
## Ssvd.als: 4 : obj 0.00409 ratio 2.136e-05 
## Ssvd.als: 5 : obj 0.00409 ratio 1.215e-07

svdx0c$d

## [1] 2.106 1.928 1.767

We can compare this to the SVD of the full matrix version of this

x02=as.matrix(x0)
svd(scale(x02,TRUE,FALSE))$d

## [1] 2.10581 1.92845 1.76687 0.48909 0.07725

One can actually use regularization here as well. For large problems, this can speed up convergence, and biases the convergence criterion toward the larger singular values. Note that if $Z=S_\lambda(X)$ has rank $r$, then the rank-$r$ SVD of $X$ will have the same left and right singular vectors as $Z$, and singular values $\lambda$ units higher.

Warm starts and regularization paths

Typically we don’t have a clue about what values of $\lambda$ are reasonable. One useful function is lambda0(), which identifies the smallest value of $\lambda$ for which the rank of the solution is zero.

lam0=lambda0(xs)
lam0

## [1] 2.317

fit0=softImpute(xs,lambda=lam0+.2)
fit0$d

## [1] 0

(If we had used lam0 itself, we would have to increase the number of iterations and decrease the threshold for softImpute to achieve an exact zero with ALS)

This value is actually the largest singular value for the version of xs with the missing values replaced by zeros. Lets check:

xs0=as(xs,"sparseMatrix")
fit0=svd.als(xs0)
fit0$d

## [1] 2.317 1.979

Now, armed with lam0 we could make a sequence of lambda values, decreasing toward zero.

lamseq=exp(seq(from=log(lam0),to=log(1),length=10))
lamseq

##  [1] 2.317 2.110 1.922 1.751 1.595 1.453 1.323 1.205 1.098 1.000

Now the idea is to fit a sequence of models, using these values of lambda, and warms starts. For large matrices, we also want to be clever with the rank, because we could not fit models with very large rank. Here is an example of what we could do.

fits=as.list(lamseq)
ranks=as.integer(lamseq)
rank.max=2
warm=NULL
for( i in seq(along=lamseq)){
  fiti=softImpute(xs,lambda=lamseq[i],rank=rank.max,warm=warm)
  ranks[i]=sum(round(fiti$d,4)>0)
  rank.max=min(ranks[i]+2,4)
  warm=fiti
  fits[[i]]=fiti
  cat(i,"lambda=",lamseq[i],"rank.max",rank.max,"rank",ranks[i],"\n")
  }

## Warning: Convergence not achieved by 100 iterations

## 1 lambda= 2.317 rank.max 3 rank 1 
## 2 lambda= 2.11 rank.max 3 rank 1 
## 3 lambda= 1.922 rank.max 4 rank 2 
## 4 lambda= 1.751 rank.max 4 rank 3 
## 5 lambda= 1.595 rank.max 4 rank 3 
## 6 lambda= 1.453 rank.max 4 rank 3 
## 7 lambda= 1.323 rank.max 4 rank 3 
## 8 lambda= 1.205 rank.max 4 rank 3 
## 9 lambda= 1.098 rank.max 4 rank 3 
## 10 lambda= 1 rank.max 4 rank 3

Notes:

The warning message is for the first fit; it needs more iterations to get convergence to the NULL solution, and in fact we see that it didn’t quite get there, since the rank is 1 (fits[[1]]$d)
We added 2 to the rank achieved for the previous lambda. That is fine for this small problem, but for large problems (Netflix size), you don’t want to be too cavalier with large ranks. Ideally, the rank you use should be sufficiently large to achieve a solution with some soft-thresholded singular values set to zero. This would guarantee that you have solved the regularized problem.