{
"cells": [
{
"cell_type": "raw",
"id": "7c91029c",
"metadata": {},
"source": [
"---\n",
"title: \"Ch12-unsup-lab\"\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "67f91564",
"metadata": {},
"source": [
"\n",
"# Lab: Unsupervised Learning\n",
"## Principal Components Analysis\n",
"\n",
"In this lab, we perform PCA on the `USArrests` data set, which is part of the base `R` package.\n",
"The rows of the data set contain the 50 states, in alphabetical order."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "176a8d7e",
"metadata": {
"name": "chunk1"
},
"outputs": [],
"source": [
"states <- row.names(USArrests)\n",
"states"
]
},
{
"cell_type": "markdown",
"id": "09ead673",
"metadata": {},
"source": [
"The columns of the data set contain the four variables."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "eb5f80aa",
"metadata": {
"name": "chunk2"
},
"outputs": [],
"source": [
"names(USArrests)"
]
},
{
"cell_type": "markdown",
"id": "09ae38b1",
"metadata": {},
"source": [
"We first briefly examine the data. We notice that the variables have vastly different means."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f0bc8b67",
"metadata": {
"name": "chunk3"
},
"outputs": [
{
"data": {
"text/html": [
"254.668652883752"
],
"text/latex": [
"254.668652883752"
],
"text/markdown": [
"254.668652883752"
],
"text/plain": [
"[1] 254.6687"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"apply(USArrests, 2, mean)"
]
},
{
"cell_type": "markdown",
"id": "d0cd124e",
"metadata": {},
"source": [
"Note that the `apply()` function allows us to apply a function---in this case, the `mean()` function---to each row or column of the data set. The second input here denotes whether we wish to compute the mean of the rows, $1$, or the columns, $2$. We see that there are on average three times as many rapes as murders, and more than eight times as many assaults as rapes.\n",
"We can also examine the variances of the four variables using the `apply()` function."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "39555396",
"metadata": {
"name": "chunk4"
},
"outputs": [],
"source": [
"apply(USArrests, 2, var)"
]
},
{
"cell_type": "markdown",
"id": "8d0549e0",
"metadata": {},
"source": [
"Not surprisingly, the variables also have vastly different variances:\n",
" the `UrbanPop` variable measures the percentage of the population in each state living in an urban area, which is not a comparable number to the number of rapes\n",
"in each state per 100,000 individuals.\n",
"If we failed to scale the variables before performing PCA, then most of the principal components that we observed would be driven by the `Assault` variable, since it has by far the largest mean and variance.\n",
"Thus, it is important to standardize the variables to have mean zero and standard deviation one before performing PCA.\n",
"\n",
"We now perform principal components analysis using the `prcomp()` function, which is one of several functions in `R` that perform PCA."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "259e0074",
"metadata": {
"name": "chunk5"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Loading required package: Matrix\n",
"\n",
"Loaded glmnet 4.1-2.1\n",
"\n"
]
},
{
"data": {
"text/html": [
"252.299368270238"
],
"text/latex": [
"252.299368270238"
],
"text/markdown": [
"252.299368270238"
],
"text/plain": [
"[1] 252.2994"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pr.out <- prcomp(USArrests, scale = TRUE)"
]
},
{
"cell_type": "markdown",
"id": "e1b1497c",
"metadata": {},
"source": [
"By default, the `prcomp()` function centers the variables to have\n",
"mean zero. By using the option `scale = TRUE`, we scale the\n",
"variables to have standard deviation one. The output from\n",
"`prcomp()` contains a number of useful quantities."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "99306afa",
"metadata": {
"name": "chunk6"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"The downloaded binary packages are in\n",
"\t/var/folders/f4/c1t2d_5d0lb6z9g3pp8gs5_m0000gq/T//RtmppimaFU/downloaded_packages\n"
]
}
],
"source": [
"names(pr.out)"
]
},
{
"cell_type": "markdown",
"id": "fa276cba",
"metadata": {},
"source": [
"The `center` and `scale` components correspond to the means and standard deviations of the variables that were used for scaling prior to implementing PCA."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7329aa48",
"metadata": {
"name": "chunk7"
},
"outputs": [],
"source": [
"pr.out$center\n",
"pr.out$scale"
]
},
{
"cell_type": "markdown",
"id": "f8cd2f2e",
"metadata": {},
"source": [
"The `rotation` matrix provides the principal component loadings;\n",
"each column of `pr.out$rotation` contains the corresponding\n",
"principal component loading vector. ( *This function names it the rotation matrix, because when we matrix-multiply the $\\bf X$ matrix by `pr.out$rotation`, it gives us the coordinates of the data in the rotated coordinate system. These coordinates are the principal component scores.* )\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "ab94cca8",
"metadata": {
"name": "chunk8"
},
"outputs": [],
"source": [
"pr.out$rotation"
]
},
{
"cell_type": "markdown",
"id": "3a0c00b3",
"metadata": {},
"source": [
"We see that there are four distinct principal components. This is to\n",
"be expected because there are in general $\\min(n-1,p)$ informative\n",
"principal components in a data set with $n$ observations and $p$\n",
"variables.\n",
"\n",
"Using the `prcomp()` function, we do not need to explicitly multiply the data by the principal component loading vectors in order to obtain the principal component score vectors. Rather the $50 \\times 4$ matrix `x` has as its columns the principal component score vectors. That is, the $k$th column is the $k$th principal component score vector."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "1800b6ec",
"metadata": {
"name": "chunk9"
},
"outputs": [],
"source": [
"dim(pr.out$x)"
]
},
{
"cell_type": "markdown",
"id": "dfeb812d",
"metadata": {},
"source": [
"We can plot the first two principal components as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "162e0807",
"metadata": {
"name": "chunk10"
},
"outputs": [],
"source": [
"biplot(pr.out, scale = 0)"
]
},
{
"cell_type": "markdown",
"id": "99ba225a",
"metadata": {},
"source": [
"The `scale = 0` argument to `biplot()` ensures that the arrows are scaled to represent the loadings; other values for `scale` give slightly different biplots with different interpretations.\n",
"\n",
"Notice that this figure is a mirror image of Figure 12.1. Recall that the principal components are only unique up to a sign change, so we can reproduce Figure 12.1 by making a few small changes:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "3291832f",
"metadata": {
"lines_to_next_cell": 0,
"name": "chunk11"
},
"outputs": [],
"source": [
"pr.out$rotation = -pr.out$rotation\n",
"pr.out$x = -pr.out$x\n",
"biplot(pr.out, scale = 0)"
]
},
{
"cell_type": "markdown",
"id": "e4c7480c",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "713fc509",
"metadata": {},
"source": [
"The `prcomp()` function also outputs the standard deviation of each principal component. For instance, on the `USArrests` data set, we can access these standard deviations as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ca555fac",
"metadata": {
"name": "chunk12"
},
"outputs": [],
"source": [
"pr.out$sdev"
]
},
{
"cell_type": "markdown",
"id": "415beff8",
"metadata": {},
"source": [
"The variance explained by each principal component is obtained by squaring these:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b60f1d00",
"metadata": {
"name": "chunk13"
},
"outputs": [],
"source": [
"pr.var <- pr.out$sdev^2\n",
"pr.var"
]
},
{
"cell_type": "markdown",
"id": "85eb1490",
"metadata": {},
"source": [
"To compute the proportion of variance explained by each principal component, we simply divide the variance explained by each principal component by the total variance explained by all four principal components:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21ade5c0",
"metadata": {
"name": "chunk14"
},
"outputs": [],
"source": [
"pve <- pr.var / sum(pr.var)\n",
"pve"
]
},
{
"cell_type": "markdown",
"id": "212a7d24",
"metadata": {},
"source": [
"We see that the first principal component explains $62.0\\,\\%$ of the variance in the data, the next principal component explains $24.7\\,\\%$ of the variance, and so forth.\n",
" We can plot the PVE explained by each component, as well as the cumulative PVE, as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63baa25d",
"metadata": {
"name": "chunk15"
},
"outputs": [],
"source": [
"par(mfrow = c(1, 2))\n",
"plot(pve, xlab = \"Principal Component\",\n",
" ylab = \"Proportion of Variance Explained\", ylim = c(0, 1),\n",
" type = \"b\")\n",
"plot(cumsum(pve), xlab = \"Principal Component\",\n",
" ylab = \"Cumulative Proportion of Variance Explained\",\n",
" ylim = c(0, 1), type = \"b\")"
]
},
{
"cell_type": "markdown",
"id": "726ce33d",
"metadata": {},
"source": [
" The result is shown in Figure 12.3.\n",
"Note that the function `cumsum()` computes the cumulative sum of the elements of a numeric vector. For instance:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "053eed07",
"metadata": {
"lines_to_next_cell": 0,
"name": "chunk16"
},
"outputs": [],
"source": [
"a <- c(1, 2, 8, -3)\n",
"cumsum(a)"
]
},
{
"cell_type": "markdown",
"id": "db2158e9",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "292345ef",
"metadata": {},
"source": [
"## Matrix Completion \n",
"We now re-create the analysis carried out on the \\USArrests\\ data in\n",
"Section 12.3. We turn the data frame into a\n",
"matrix, after centering and scaling each column to have mean zero and\n",
"variance one."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5b2918f5",
"metadata": {
"name": "chunk17"
},
"outputs": [],
"source": [
"X <- data.matrix(scale(USArrests))\n",
"pcob <- prcomp(X)\n",
"summary(pcob)"
]
},
{
"cell_type": "markdown",
"id": "9e006556",
"metadata": {},
"source": [
"We see that the first principal component explains $62\\%$ of the\n",
"variance. \n",
"\n",
"We saw in Section 12.2.2 that solving the optimization\n",
"problem~(12.6) on a centered data matrix $\\bf X$ is\n",
"equivalent to computing the first $M$ principal\n",
"components of the data. The \n",
"(SVD) is a general algorithm for solving (12.6)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9898f442",
"metadata": {
"name": "chunk18"
},
"outputs": [],
"source": [
"sX <- svd(X)\n",
"names(sX)\n",
"round(sX$v, 3)"
]
},
{
"cell_type": "markdown",
"id": "cfb1a8f3",
"metadata": {},
"source": [
"The `svd()` function returns three components, `u`, `d`, and `v`. The matrix `v` is equivalent to the\n",
"loading matrix from principal components (up to an unimportant sign flip)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2de0edc2",
"metadata": {
"name": "chunk19"
},
"outputs": [],
"source": [
"pcob$rotation"
]
},
{
"cell_type": "markdown",
"id": "a3a1d0ab",
"metadata": {},
"source": [
"The matrix `u` is equivalent to the matrix of *standardized*\n",
"scores, and the standard deviations are in the vector `d`. We can recover the score vectors using the output of `svd()`.\n",
"They are identical to the score vectors output by `prcomp()`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a37a5339",
"metadata": {
"name": "chunk20"
},
"outputs": [],
"source": [
"t(sX$d * t(sX$u))\n",
"pcob$x"
]
},
{
"cell_type": "markdown",
"id": "c4df2c8a",
"metadata": {},
"source": [
"While it would be possible to carry out this lab using the `prcomp()` function,\n",
"here we use the `svd()` function in order to illustrate its use."
]
},
{
"cell_type": "markdown",
"id": "2ea3308d",
"metadata": {},
"source": [
"We now omit 20 entries in the $50\\times 2$ data matrix at random. We do so\n",
"by first selecting 20 rows (states) at random, and then selecting one\n",
"of the four entries in each row at random. This ensures that every row has\n",
"at least three observed values.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b2b132ac",
"metadata": {
"name": "chunk21"
},
"outputs": [],
"source": [
"nomit <- 20\n",
"set.seed(15)\n",
"ina <- sample(seq(50), nomit)\n",
"inb <- sample(1:4, nomit, replace = TRUE)\n",
"Xna <- X\n",
"index.na <- cbind(ina, inb)\n",
"Xna[index.na] <- NA"
]
},
{
"cell_type": "markdown",
"id": "da1518b1",
"metadata": {},
"source": [
"Here, `ina`\n",
"contains 20 integers from 1 to 50; this represents the states that are selected to contain missing values. And `inb` contains\n",
"20 integers from 1 to 4, representing the features that contain the missing values for each of the selected states.\n",
"To perform the final indexing, we create `index.na`, a two-column\n",
"matrix whose columns are `ina` and `inb`. We have indexed a matrix with a\n",
"matrix of indices!\n",
"\n",
"We now write some code to implement\n",
"Algorithm 12.1.\n",
"We first write a function that takes in a matrix, and returns an approximation to the matrix using the `svd()` function.\n",
" This will be needed in Step 2 of Algorithm 12.1. As mentioned earlier, we could do this using the `prcomp()` function, but instead we use the `svd()` function for illustration."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f42b5f46",
"metadata": {
"name": "chunk22"
},
"outputs": [],
"source": [
"fit.svd <- function(X, M = 1) {\n",
" svdob <- svd(X)\n",
" with(svdob,\n",
" u[, 1:M, drop = FALSE] %*%\n",
" (d[1:M] * t(v[, 1:M, drop = FALSE]))\n",
" )\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "93005b0b",
"metadata": {},
"source": [
"Here, we did not bother to explicitly call the `return()` function to return a value from `fit.svd()`; however,\n",
"the computed quantity is automatically returned by `R`. We use the `with()` function to\n",
"make it a little easier to index the elements of `svdob`. As an alternative to using `with()`, we could have written"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ef9fbc4d",
"metadata": {
"name": "chunk23"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "9e50e645",
"metadata": {},
"source": [
"inside the `fit.svd()` function.\n",
"\n",
"To conduct Step 1 of the algorithm, we initialize `Xhat` --- this is $\\tilde{\\bf X}$ in Algorithm 12.1 --- by replacing\n",
"the missing values with the column means of the non-missing entries."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35b22a70",
"metadata": {
"lines_to_next_cell": 2,
"name": "chunk24"
},
"outputs": [],
"source": [
"Xhat <- Xna\n",
"xbar <- colMeans(Xna, na.rm = TRUE)\n",
"Xhat[index.na] <- xbar[inb]"
]
},
{
"cell_type": "markdown",
"id": "ee9573f9",
"metadata": {},
"source": [
"Before we begin Step 2, we set ourselves up to measure the progress of our\n",
"iterations:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c7c2293e",
"metadata": {
"name": "chunk25"
},
"outputs": [],
"source": [
"thresh <- 1e-7\n",
"rel_err <- 1\n",
"iter <- 0\n",
"ismiss <- is.na(Xna)\n",
"mssold <- mean((scale(Xna, xbar, FALSE)[!ismiss])^2)\n",
"mss0 <- mean(Xna[!ismiss]^2)"
]
},
{
"cell_type": "markdown",
"id": "8c0c6b84",
"metadata": {},
"source": [
"Here `ismiss` is a new logical matrix with the same dimensions as `Xna`; a given element equals `TRUE` if the corresponding matrix element is missing. This is useful\n",
"because it allows us to access both the missing and non-missing entries. We store the mean of the squared non-missing elements in `mss0`.\n",
"We store the mean squared error of the non-missing elements of the old version of `Xhat` in `mssold`. We plan to store the mean squared error of the non-missing elements of the current version of `Xhat` in `mss`, and will then iterate Step 2 of Algorithm 12.1 until the *relative error*, defined as \n",
"\n",
"`(mssold - mss) / mss0`, falls below `thresh = 1e-7`. ( *Algorithm 12.1 tells us to iterate Step 2 until (12.14) is no longer decreasing. Determining whether (12.14) is decreasing requires us only to keep track of `mssold - mss`. However, in practice, we keep track of `(mssold - mss) / mss0` instead: this makes it so that the number of iterations required for Algorithm 12.1 to converge does not depend on whether we multiplied the raw data $\\bf X$ by a constant factor.* )"
]
},
{
"cell_type": "markdown",
"id": "91516a1e",
"metadata": {},
"source": [
"In Step 2(a) of Algorithm 12.1, we approximate `Xhat` using `fit.svd()`; we call this `Xapp`. In Step 2(b), we use `Xapp` to update the estimates for elements in `Xhat` that are missing in `Xna`. Finally, in Step 2(c), we compute the relative error. These three steps are contained in this `while()` loop:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "238b1219",
"metadata": {
"name": "chunk26"
},
"outputs": [],
"source": [
"while(rel_err > thresh) {\n",
" iter <- iter + 1\n",
" # Step 2(a)\n",
" Xapp <- fit.svd(Xhat, M = 1)\n",
" # Step 2(b)\n",
" Xhat[ismiss] <- Xapp[ismiss]\n",
" # Step 2(c)\n",
" mss <- mean(((Xna - Xapp)[!ismiss])^2)\n",
" rel_err <- (mssold - mss) / mss0\n",
" mssold <- mss\n",
" cat(\"Iter:\", iter, \"MSS:\", mss,\n",
" \"Rel. Err:\", rel_err, \"\\n\")\n",
" }"
]
},
{
"cell_type": "markdown",
"id": "bffcba24",
"metadata": {},
"source": [
"We see that after eight iterations, the relative error has fallen below `thresh = 1e-7`, and so the algorithm terminates. When this happens, the mean squared error of the non-missing elements equals $0.369$.\n",
"\n",
"Finally, we compute the correlation between the 20 imputed values\n",
"and the actual values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53035f28",
"metadata": {
"lines_to_next_cell": 0,
"name": "chunk27"
},
"outputs": [],
"source": [
"cor(Xapp[ismiss], X[ismiss])"
]
},
{
"cell_type": "markdown",
"id": "18d9e883",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "e948a613",
"metadata": {},
"source": [
"In this lab, we implemented Algorithm 12.1 ourselves for didactic purposes. However, a reader who wishes to apply matrix completion to their data should use the `softImpute` package on `CRAN`, which provides a very efficient implementation of a generalization of this algorithm.\n",
"\n",
"## Clustering\n",
"\n",
"### $K$-Means Clustering"
]
},
{
"cell_type": "markdown",
"id": "2b478399",
"metadata": {},
"source": [
"The function `kmeans()` performs $K$-means clustering in\n",
"`R`. We begin with a simple simulated example in which there\n",
"truly are two clusters in the data: the first 25 observations have a\n",
"mean shift relative to the next 25 observations."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "afe2d077",
"metadata": {
"lines_to_next_cell": 0,
"name": "chunk28"
},
"outputs": [],
"source": [
"set.seed(2)\n",
"x <- matrix(rnorm(50 * 2), ncol = 2)\n",
"x[1:25, 1] <- x[1:25, 1] + 3\n",
"x[1:25, 2] <- x[1:25, 2] - 4"
]
},
{
"cell_type": "markdown",
"id": "ba9b8069",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"id": "f8c1bd30",
"metadata": {},
"source": [
"We now perform $K$-means clustering with $K=2$."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d56c68e7",
"metadata": {
"name": "chunk29"
},
"outputs": [],
"source": [
"km.out <- kmeans(x, 2, nstart = 20)"
]
},
{
"cell_type": "markdown",
"id": "89463e0f",
"metadata": {},
"source": [
"The cluster assignments of the 50 observations are contained in `km.out$cluster`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "017e4667",
"metadata": {
"name": "chunk30"
},
"outputs": [],
"source": [
"km.out$cluster"
]
},
{
"cell_type": "markdown",
"id": "644fbf89",
"metadata": {},
"source": [
"The $K$-means clustering perfectly separated the observations into two clusters even though we did not supply any group information to `kmeans()`. We can plot the data, with each observation\n",
"colored according to its cluster assignment."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a8b963d",
"metadata": {
"name": "chunk31"
},
"outputs": [],
"source": [
"plot(x, col = (km.out$cluster + 1),\n",
" main = \"K-Means Clustering Results with K = 2\",\n",
" xlab = \"\", ylab = \"\", pch = 20, cex = 2)"
]
},
{
"cell_type": "markdown",
"id": "a0d40fa5",
"metadata": {},
"source": [
"Here the observations can be easily plotted because they are two-dimensional. If there were more than two\n",
"variables then we could instead perform PCA and plot the first two principal components score vectors.\n",
"\n",
"In this example, we knew that there really were two clusters because we generated the data. However, for real data, in general we do not\n",
"know the true number of clusters. We could instead have performed $K$-means clustering on this example with $K=3$."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b4bcbb58",
"metadata": {
"name": "chunk32"
},
"outputs": [],
"source": [
"set.seed(4)\n",
"km.out <- kmeans(x, 3, nstart = 20)\n",
"km.out\n",
"plot(x, col = (km.out$cluster + 1),\n",
" main = \"K-Means Clustering Results with K = 3\",\n",
" xlab = \"\", ylab = \"\", pch = 20, cex = 2)"
]
},
{
"cell_type": "markdown",
"id": "513126a7",
"metadata": {},
"source": [
"When $K=3$, $K$-means clustering splits up the two clusters.\n",
"\n",
"To run the `kmeans()` function in `R` with multiple initial cluster assignments, we use the\n",
"`nstart` argument. If a value of `nstart` greater than one is used, then $K$-means clustering will be performed using\n",
"multiple random assignments in Step~1 of Algorithm 12.2, and the `kmeans()` function will report only the best results. Here we compare using `nstart = 1` to `nstart = 20`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "814e5dfb",
"metadata": {
"name": "chunk33"
},
"outputs": [],
"source": [
"set.seed(4)\n",
"km.out <- kmeans(x, 3, nstart = 1)\n",
"km.out$tot.withinss\n",
"km.out <- kmeans(x, 3, nstart = 20)\n",
"km.out$tot.withinss"
]
},
{
"cell_type": "markdown",
"id": "f1267492",
"metadata": {},
"source": [
"Note that `km.out$tot.withinss` is the total within-cluster sum of squares, which we seek to minimize by performing $K$-means clustering (Equation 12.17). The individual within-cluster sum-of-squares are contained in the vector `km.out$withinss`.\n",
"\n",
"We *strongly* recommend always running $K$-means clustering with\n",
"a large value of `nstart`, such as 20 or 50, since otherwise an\n",
"undesirable local optimum may be obtained.\n",
"\n",
"When performing $K$-means clustering, in addition to using multiple initial cluster assignments, it is\n",
"also important to set a random seed using the `set.seed()` function. This way, the\n",
"initial cluster assignments in Step~1 can\n",
"be replicated, and the $K$-means output will be fully reproducible.\n",
"\n",
"### Hierarchical Clustering\n",
"\n",
"The `hclust()` function implements hierarchical clustering in `R`. In the following example we use the data from the previous lab to\n",
" plot the hierarchical clustering dendrogram using complete, single, and average linkage clustering, with Euclidean distance as the dissimilarity measure.\n",
"We begin by clustering observations using complete linkage. The `dist()` function is used to compute the $50 \\times 50$ inter-observation Euclidean distance matrix."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a66e92f6",
"metadata": {
"name": "chunk34"
},
"outputs": [],
"source": [
"hc.complete <- hclust(dist(x), method = \"complete\")"
]
},
{
"cell_type": "markdown",
"id": "ca3cf79b",
"metadata": {},
"source": [
"We could just as easily perform hierarchical clustering with average or single linkage instead:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5c94bdb1",
"metadata": {
"name": "chunk35"
},
"outputs": [],
"source": [
"hc.average <- hclust(dist(x), method = \"average\")\n",
"hc.single <- hclust(dist(x), method = \"single\")"
]
},
{
"cell_type": "markdown",
"id": "85863d99",
"metadata": {},
"source": [
"We can now plot the dendrograms obtained using the usual `plot()` function. The numbers at the bottom of the plot identify each observation."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46428894",
"metadata": {
"lines_to_next_cell": 2,
"name": "chunk36"
},
"outputs": [],
"source": [
"par(mfrow = c(1, 3))\n",
"plot(hc.complete, main = \"Complete Linkage\",\n",
" xlab = \"\", sub = \"\", cex = .9)\n",
"plot(hc.average, main = \"Average Linkage\",\n",
" xlab = \"\", sub = \"\", cex = .9)\n",
"plot(hc.single, main = \"Single Linkage\",\n",
" xlab = \"\", sub = \"\", cex = .9)"
]
},
{
"cell_type": "markdown",
"id": "c0c14da6",
"metadata": {},
"source": [
"To determine the cluster labels for each observation associated with a given cut of the dendrogram, we can use the `cutree()` function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "548fe381",
"metadata": {
"name": "chunk37"
},
"outputs": [],
"source": [
"cutree(hc.complete, 2)\n",
"cutree(hc.average, 2)\n",
"cutree(hc.single, 2)"
]
},
{
"cell_type": "markdown",
"id": "30c02099",
"metadata": {},
"source": [
"The second argument to `cutree()` is the number of clusters we wish to obtain.\n",
"For this data, complete and average linkage generally separate the observations into their correct groups. However, single linkage identifies one point as belonging to its own cluster. A more sensible answer is obtained when four clusters are selected, although there are still two singletons."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a9ea1788",
"metadata": {
"lines_to_next_cell": 2,
"name": "chunk38"
},
"outputs": [],
"source": [
"cutree(hc.single, 4)"
]
},
{
"cell_type": "markdown",
"id": "3823ee34",
"metadata": {},
"source": [
"To scale the variables before performing hierarchical clustering of the observations, we use the `scale()` function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "74acd000",
"metadata": {
"lines_to_next_cell": 2,
"name": "chunk39"
},
"outputs": [],
"source": [
"xsc <- scale(x)\n",
"plot(hclust(dist(xsc), method = \"complete\"),\n",
" main = \"Hierarchical Clustering with Scaled Features\")"
]
},
{
"cell_type": "markdown",
"id": "2f93ef80",
"metadata": {},
"source": [
"Correlation-based distance can be computed using the `as.dist()` function, which converts an arbitrary square symmetric matrix into a form that the `hclust()` function recognizes as a distance matrix. However, this only makes sense for data with at least three features since the absolute correlation between any two observations\n",
"with measurements on two features is always 1. Hence, we will cluster a three-dimensional data set. This data set does not contain any true clusters."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1e8ae731",
"metadata": {
"lines_to_next_cell": 0,
"name": "chunk40"
},
"outputs": [],
"source": [
"x <- matrix(rnorm(30 * 3), ncol = 3)\n",
"dd <- as.dist(1 - cor(t(x)))\n",
"plot(hclust(dd, method = \"complete\"),\n",
" main = \"Complete Linkage with Correlation-Based Distance\",\n",
" xlab = \"\", sub = \"\")"
]
},
{
"cell_type": "markdown",
"id": "43b96f6a",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"id": "88b90c6d",
"metadata": {},
"source": [
"## NCI60 Data Example"
]
},
{
"cell_type": "markdown",
"id": "e263b013",
"metadata": {},
"source": [
"Unsupervised techniques are often used in the analysis of genomic data. In particular, PCA and hierarchical clustering are popular tools.\n",
" We illustrate these techniques on the `NCI` cancer cell line microarray data, which consists of $6{,}830$ gene expression measurements on $64$ cancer cell lines."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5947bd71",
"metadata": {
"name": "chunk41"
},
"outputs": [],
"source": [
"library(ISLR2)\n",
"nci.labs <- NCI60$labs\n",
"nci.data <- NCI60$data"
]
},
{
"cell_type": "markdown",
"id": "4e5a9590",
"metadata": {},
"source": [
"Each cell line is labeled with a cancer type, given in `nci.labs`. We do not make use of the cancer types in performing PCA and clustering, as these are unsupervised techniques. But\n",
"after performing PCA and clustering, we will\n",
"check to see the extent to which these cancer types agree with the results of these unsupervised techniques.\n",
"\n",
"The data has $64$ rows and $6{,}830$ columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7c3fcc73",
"metadata": {
"lines_to_next_cell": 2,
"name": "chunk42"
},
"outputs": [],
"source": [
"dim(nci.data)"
]
},
{
"cell_type": "markdown",
"id": "80f06a85",
"metadata": {},
"source": [
"We begin by examining the cancer types for the cell lines."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "519c2a04",
"metadata": {
"lines_to_next_cell": 2,
"name": "chunk43"
},
"outputs": [],
"source": [
"nci.labs[1:4]\n",
"table(nci.labs)"
]
},
{
"cell_type": "markdown",
"id": "709b4f52",
"metadata": {},
"source": [
"### PCA on the NCI60 Data\n",
"\n",
"We first perform PCA on the data after scaling the variables (genes) to have standard deviation one, although one could reasonably argue that it is better not to scale the genes."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46fb2096",
"metadata": {
"name": "chunk44"
},
"outputs": [],
"source": [
"pr.out <- prcomp(nci.data, scale = TRUE)"
]
},
{
"cell_type": "markdown",
"id": "61ec5600",
"metadata": {},
"source": [
"We now plot the first few principal component score vectors, in order to visualize the data. The observations (cell lines) corresponding to a given cancer type will be plotted in the same color, so that we can see to what extent the observations within a cancer type are similar to each other. We first create a simple function that assigns a distinct color to each element of a numeric vector.\n",
"The function will be used to assign a color to each of the $64$ cell lines, based on the cancer type to which it corresponds."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f453033a",
"metadata": {
"name": "chunk45"
},
"outputs": [],
"source": [
"Cols <- function(vec) {\n",
" cols <- rainbow(length(unique(vec)))\n",
" return(cols[as.numeric(as.factor(vec))])\n",
" }"
]
},
{
"cell_type": "markdown",
"id": "1ad61612",
"metadata": {},
"source": [
"Note that the `rainbow()` function takes as its argument a positive integer, and returns a vector containing that number of distinct colors. We now can plot the principal component score vectors."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "02ff1588",
"metadata": {
"name": "chunk46"
},
"outputs": [],
"source": [
"par(mfrow = c(1, 2))\n",
"plot(pr.out$x[, 1:2], col = Cols(nci.labs), pch = 19,\n",
" xlab = \"Z1\", ylab = \"Z2\")\n",
"plot(pr.out$x[, c(1, 3)], col = Cols(nci.labs), pch = 19,\n",
" xlab = \"Z1\", ylab = \"Z3\")"
]
},
{
"cell_type": "markdown",
"id": "60c6f1fa",
"metadata": {},
"source": [
"The resulting plots are shown in Figure 12.17. On the whole, cell lines corresponding to a single cancer type do tend to have similar values on the first few\n",
"principal component score vectors. This indicates that cell lines from the same cancer type tend to have pretty similar gene expression levels.\n",
"\n",
"\\begin{figure}[t]\n",
"\\centering\n",
"\\includegraphics[width=\\textwidth]{Figures/Chapter12/12_17.pdf} \n",
"On the whole, observations belonging to a single cancer type tend to lie near each other\n",
"in this low-dimensional space. It would not have been possible to visualize the data without using a dimension reduction method such as PCA, since based on the full data set there are\n",
" $6{,}830 \\choose 2$ possible scatterplots, none of which would have\n",
"been particularly informative.}\n",
"\\end{figure}\n",
"\n",
"We can obtain a summary of the proportion of variance explained (PVE) of the first few principal components using the `summary()` method for a `prcomp` object (we have truncated the printout):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "41bafbcf",
"metadata": {
"name": "chunk47"
},
"outputs": [],
"source": [
"summary(pr.out)"
]
},
{
"cell_type": "markdown",
"id": "f5dffa4a",
"metadata": {},
"source": [
"Using the `plot()` function, we can also plot the variance explained by the first few principal components."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d12593e5",
"metadata": {
"name": "chunk48"
},
"outputs": [],
"source": [
"plot(pr.out)"
]
},
{
"cell_type": "markdown",
"id": "8878408d",
"metadata": {},
"source": [
"Note that the height of each bar in the bar plot is given by squaring the corresponding element of `pr.out$sdev`.\n",
"However, it is more informative to plot the PVE of each principal component (i.e. a scree plot) and the cumulative PVE of each principal component. This can be done with just a little work."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc62cd78",
"metadata": {
"name": "chunk49"
},
"outputs": [],
"source": [
"pve <- 100 * pr.out$sdev^2 / sum(pr.out$sdev^2)\n",
"par(mfrow = c(1, 2))\n",
"plot(pve, type = \"o\", ylab = \"PVE\",\n",
" xlab = \"Principal Component\", col = \"blue\")\n",
"plot(cumsum(pve), type = \"o\", ylab = \"Cumulative PVE\",\n",
" xlab = \"Principal Component\", col = \"brown3\")"
]
},
{
"cell_type": "markdown",
"id": "09c0af89",
"metadata": {},
"source": [
"(Note that the elements of `pve` can also be computed directly from the summary, `summary(pr.out)$importance[2, ]`, and the elements of\n",
"`cumsum(pve)` are given by\n",
"`summary(pr.out)$importance[3, ]`.)\n",
"The resulting plots are shown in Figure 12.18.\n",
"\\begin{figure}[t]\n",
"\\centering\n",
"\\includegraphics[width=\\textwidth]{Figures/Chapter12/12_18.pdf}\n",
"*Right:* the cumulative PVE of the principal components is shown. Together, all principal components explain $100\\,\\%$ of the variance.}\n",
"\\end{figure}\n",
"We see that together,\n",
"the first seven principal components explain around $40\\,\\%$ of the variance in the data. This is not a huge\n",
"amount of the variance. However, looking at the scree plot, we see that while each of the first seven principal components explain a substantial amount of variance, there\n",
"is a marked decrease in the variance explained by further principal components. That is, there is an *elbow* in the plot after approximately the seventh principal component.\n",
"This suggests that there may be little benefit to examining more than seven or so principal components (though even examining seven principal components may be difficult).\n"
]
},
{
"cell_type": "markdown",
"id": "8a0d53e8",
"metadata": {},
"source": [
"### Clustering the Observations of the NCI60 Data\n",
"\n",
"We now proceed to hierarchically cluster the cell lines in the `NCI` data, with the goal of finding out whether or not the observations cluster into distinct types of cancer. To begin, we standardize the variables to have mean\n",
" zero and standard deviation one. As mentioned earlier, this step is optional and should be performed only if we want each gene to be on the same *scale*."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "abf3f090",
"metadata": {
"name": "chunk50"
},
"outputs": [],
"source": [
"sd.data <- scale(nci.data)"
]
},
{
"cell_type": "markdown",
"id": "3711073c",
"metadata": {},
"source": [
"We now perform hierarchical clustering of the observations using complete, single, and average linkage. Euclidean distance is used as the dissimilarity measure."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6955e868",
"metadata": {
"name": "chunk51"
},
"outputs": [],
"source": [
"par(mfrow = c(1, 3))\n",
"data.dist <- dist(sd.data)\n",
"plot(hclust(data.dist), xlab = \"\", sub = \"\", ylab = \"\",\n",
" labels = nci.labs, main = \"Complete Linkage\")\n",
"plot(hclust(data.dist, method = \"average\"),\n",
" labels = nci.labs, main = \"Average Linkage\",\n",
" xlab = \"\", sub = \"\", ylab = \"\")\n",
"plot(hclust(data.dist, method = \"single\"),\n",
" labels = nci.labs, main = \"Single Linkage\",\n",
" xlab = \"\", sub = \"\", ylab = \"\")"
]
},
{
"cell_type": "markdown",
"id": "3a190944",
"metadata": {},
"source": [
"\\begin{figure}[p]\n",
"\\centering\n",
"\\includegraphics[width=\\textwidth]{Figures/Chapter12/12_19.pdf}\n",
"as the dissimilarity measure. Complete and average linkage tend to yield\n",
"evenly sized clusters whereas single linkage tends to yield extended clusters to which single leaves are fused one by one.}\n",
"\\end{figure}\n",
"The results are shown in Figure 12.19.\n",
"We see that the choice of linkage certainly does affect the results obtained. Typically, single linkage will tend to yield *trailing* clusters: very large clusters onto which\n",
" individual observations attach one-by-one. On the other hand, complete and average linkage tend to yield more balanced, attractive clusters. For this reason, complete\n",
"and average linkage are generally preferred to single linkage.\n",
"Clearly cell lines within a single cancer type do tend to cluster together, although the clustering is not perfect. We will use complete linkage hierarchical clustering for the analysis that follows."
]
},
{
"cell_type": "markdown",
"id": "587418c3",
"metadata": {},
"source": [
"We can cut the dendrogram at the height that will yield a particular number of clusters, say four:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5d20298d",
"metadata": {
"name": "chunk52"
},
"outputs": [],
"source": [
"hc.out <- hclust(dist(sd.data))\n",
"hc.clusters <- cutree(hc.out, 4)\n",
"table(hc.clusters, nci.labs)"
]
},
{
"cell_type": "markdown",
"id": "131f88d8",
"metadata": {},
"source": [
"There are some clear patterns. All the leukemia cell lines fall in cluster $3$, while the breast cancer cell lines are spread out over three different clusters. We can plot the cut on the dendrogram that produces these four clusters:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9aeb191c",
"metadata": {
"name": "chunk53"
},
"outputs": [],
"source": [
"par(mfrow = c(1, 1))\n",
"plot(hc.out, labels = nci.labs)\n",
"abline(h = 139, col = \"red\")"
]
},
{
"cell_type": "markdown",
"id": "915d9aeb",
"metadata": {},
"source": [
"The `abline()` function draws a straight line on top of any existing plot in~`R`. The argument `h = 139` plots a horizontal line at height $139$ on the\n",
"dendrogram; this is the height that results in four distinct clusters. It is easy to verify that the resulting clusters are the same as the ones we obtained using `cutree(hc.out, 4)`."
]
},
{
"cell_type": "markdown",
"id": "2891d7d0",
"metadata": {},
"source": [
"Printing the output of `hclust` gives a useful brief summary of the object:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b55e6fa2",
"metadata": {
"lines_to_next_cell": 2,
"name": "chunk54"
},
"outputs": [],
"source": [
"hc.out"
]
},
{
"cell_type": "markdown",
"id": "30a5f54b",
"metadata": {},
"source": [
"We claimed earlier in Section 12.4.2 that $K$-means clustering and hierarchical clustering with the dendrogram cut to obtain the same number of clusters can yield very different results.\n",
"How do these `NCI` hierarchical clustering results compare to what we get if we perform $K$-means clustering with $K=4$?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8270dd43",
"metadata": {
"name": "chunk55"
},
"outputs": [],
"source": [
"set.seed(2)\n",
"km.out <- kmeans(sd.data, 4, nstart = 20)\n",
"km.clusters <- km.out$cluster\n",
"table(km.clusters, hc.clusters)"
]
},
{
"cell_type": "markdown",
"id": "c37aeae0",
"metadata": {},
"source": [
"We see that the four clusters obtained using hierarchical clustering and $K$-means clustering are somewhat different. Cluster~$4$ in $K$-means clustering is identical to cluster~$3$\n",
"in hierarchical clustering. However, the other clusters differ: for instance, cluster~$2$ in $K$-means clustering contains a portion of the observations assigned to\n",
"cluster 1 by hierarchical clustering, as well as all of the observations assigned to cluster~$2$ by hierarchical clustering.\n",
"\n",
"Rather than performing hierarchical clustering on the entire data matrix, we can simply perform hierarchical clustering on the first few principal component score vectors,\n",
"as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "67423fc7",
"metadata": {
"name": "chunk56"
},
"outputs": [],
"source": [
"hc.out <- hclust(dist(pr.out$x[, 1:5]))\n",
"plot(hc.out, labels = nci.labs,\n",
" main = \"Hier. Clust. on First Five Score Vectors\")\n",
"table(cutree(hc.out, 4), nci.labs)"
]
},
{
"cell_type": "markdown",
"id": "490069ef",
"metadata": {},
"source": [
" Not surprisingly, these results are different from the ones that we\n",
" obtained when we performed hierarchical clustering on the full data\n",
" set. Sometimes performing clustering on the first few principal\n",
" component score vectors can give better results than performing\n",
" clustering on the full data. In this situation, we might view the principal\n",
"component step as one of denoising the data.\n",
"We could also perform $K$-means\n",
" clustering on the first few principal component score vectors rather\n",
" than the full data set.\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"metadata": {
"celltoolbar": "Edit Metadata",
"jupytext": {
"formats": "ipynb,Rmd"
},
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "4.0.5"
},
"substitutions": {
"ISLPmod": "`ISLP`",
"Rlang": "`R`",
"mpl": "`matplotlib`",
"numpy": "`numpy`",
"pandas": "`pandas`",
"pylang": "`python`",
"smlib": "`statsmodels`"
}
},
"nbformat": 4,
"nbformat_minor": 5
}