Principal Component Analysis from Scratch (YouTube Video Transcript)

↔

Title: Principal Component Analysis from Scratch

Duration: 01:17:32

Total Correct Answers:

Dictation Mode:

Current Caption

Correct

Show All Captions

Learning Modes

Dictation

YouTube Video Transcript Hide

Ask AI Result

The ask AI result will appear here..

Show timestamps

Display as lines

(00:00:00) Your YouTube transcript will appear here (00:00:03) all right um YouTube tells me we should (00:00:05) be live by now so um you're still seeing (00:00:09) the intro screen which is (00:00:11) fine um sound check if anyone in chat (00:00:14) can hear me um would be nice to just (00:00:18) throw in a message saying we can hear (00:00:20) you um but uh let me just put myself up (00:00:24) as well so that people can see me (00:00:26) there's a little bit of a delay at least (00:00:28) that's what I noticed and uh there might (00:00:31) also be a commercial playing at least (00:00:34) I'm getting a commercial so I'm just (00:00:36) going to going to skip that on my phone (00:00:38) just using my phone to watch it so I (00:00:41) hope people can hear me um otherwise (00:00:45) this is going to be a very interesting (00:00:47) stream with me talking and no one being (00:00:50) able to hear me but uh should be fine (00:00:52) all right so for today principal (00:00:54) component analysis it's one of the most (00:00:57) used techniques in molecular biology not (00:01:00) just in molecular biology but also in (00:01:02) biomatics and well I think any field (00:01:06) where people look at data and do data (00:01:08) analysis um they they have a principal (00:01:11) component (00:01:12) analysis so without further Ado um all (00:01:17) right cool chat says that they can hear (00:01:19) me perfect all right so um then we'll (00:01:22) just jump into it so this is what we (00:01:26) want to talk about today or this is what (00:01:28) I want to talk about today because I (00:01:29) think that you you need to understand (00:01:31) all of these different subsections (00:01:33) before you can start understanding (00:01:35) principal component analysis um so we'll (00:01:37) be starting off with uh (00:01:40) autoscaling um it's one of these things (00:01:42) that you have to do with your data um (00:01:44) and then uh we'll be talking about the (00:01:46) covariance Matrix igen vectors and igen (00:01:49) values uh variance (00:01:53) explained little bit of a frog in my (00:01:55) throat um and then we'll talk about PCA (00:01:58) projections um because PCA is nothing (00:02:00) more than a reprojection of your (00:02:02) original data um and then we'll compare (00:02:04) to back to uh PR comp um hello Arno (00:02:08) thank you for joining the chat um for (00:02:11) some reason I can't see the chat (00:02:12) messages on my phone so I have to look (00:02:14) at the screen there but uh it's uh it (00:02:18) it'll work out and then some something (00:02:20) about interpretation um so have (00:02:23) basically principal component analysis (00:02:25) we'll we'll start off by talking a (00:02:27) little bit about PCA uh what it means (00:02:29) how you can use it um how many principal (00:02:32) components you should take um and then (00:02:35) we we just go through so uh if anyone (00:02:38) has questions during the live stream um (00:02:41) just throw your questions in chat and I (00:02:43) will get to them um if I see them (00:02:46) because um like I said can't see it on (00:02:49) my phone but can see it over there all (00:02:51) right so first off what is principal (00:02:54) component analysis so principal (00:02:55) component analysis is a dimensionality (00:02:57) reduction method um so instead of (00:03:01) having to look at 50 different features (00:03:04) on 500 different um samples for example (00:03:09) U what we can do using PCA is reduce the (00:03:11) dimensionality so look at the the two (00:03:14) main components against each other right (00:03:17) so this will allow us to find like large (00:03:20) scale um clustering in data or at least (00:03:23) that's generally what we look at um so (00:03:26) it is also used to compress and reduce (00:03:28) the dimensionality of large data so this (00:03:30) compression is part of the PCA method um (00:03:36) so um so it works uh very basically so (00:03:40) compression and reduction of data right (00:03:42) so you can use for example uh a couple (00:03:44) of principal component analysis to (00:03:46) represent kind of the major vectors in (00:03:48) your data see and now I can't read chat (00:03:51) properly because it's so freaking small (00:03:54) Autos scaling works like normally or we (00:03:56) have to separate work on the data so (00:03:57) that uh yeah no that it if you do it (00:04:00) from scratch you need to UND scale your (00:04:01) data first um because otherwise The (00:04:04) covariance Matrix goes (00:04:06) Haywire um but like I said it's a way to (00:04:09) reduce dimensionality so it allows you (00:04:11) to in a single plot look at the things (00:04:14) which have like the major effect on your (00:04:16) data not only that but if you take the (00:04:18) first principal component and you (00:04:20) correlate it back to the features that (00:04:22) you have um you can also find which (00:04:24) features are contributing to one or two (00:04:27) or to the first or to the second (00:04:28) principal component so so it's a very (00:04:30) old methodology um it was invented in (00:04:33) 1901 by Carl Pearson um the guy that (00:04:36) also brought us correlation Co variant (00:04:39) and most of our linear modeling toolkit (00:04:42) um so it's a very old method 123 years (00:04:45) now um but it's still used every day um (00:04:49) although there's newer methods nowaday (00:04:50) like TSN um which are different types of (00:04:54) method uh but the nice thing I like (00:04:56) about PCA is that PCA draws from your (00:04:59) data so it's it's based on your data um (00:05:02) and it doesn't use any of the labels (00:05:04) that you assign to the different samples (00:05:07) um so basically I have a little example (00:05:09) here so here we we kind of see a normal (00:05:11) axis right so we have like a variable (00:05:14) which is plotted on the x-axis we have (00:05:16) another variable plotted on the Y AIS um (00:05:19) and then we want to know which are the (00:05:20) main components of the data um so the (00:05:23) main components of the data is of course (00:05:25) the first component you can you can draw (00:05:27) a line the red line here so this red (00:05:29) line line here um this line here um this (00:05:34) gives you the most variance uh captures (00:05:36) the most variance in your data right (00:05:38) because the the the variance in the (00:05:40) x-axis so in this direction is larger (00:05:43) than the variance we see like this um so (00:05:46) the second principal component is always (00:05:49) going to be um orthogonal so it's always (00:05:52) going to be directly on top of the first (00:05:55) principal component so principal (00:05:58) component alls with two variable is (00:06:00) nothing more um than rotating the AIS (00:06:04) that you have right so if we would take (00:06:06) the x-axis and we would start rotating (00:06:08) the x-axis then the y- axis would rotate (00:06:10) as well so at some point one once the (00:06:13) x-axis is very is is through most of the (00:06:16) data points right so that the the sum of (00:06:18) squares of the data points to the x-axis (00:06:21) is minimized um then those are the (00:06:23) principal components so when we do (00:06:24) principal component analysis with just (00:06:26) two features It's relatively easy um (00:06:29) because it's not nothing more than just (00:06:31) rotating the axis um to fit the data (00:06:35) best so principal components C they are (00:06:39) linear combinations of the original data (00:06:42) right so we have the original data and (00:06:44) then when we do PCA analysis the first (00:06:47) principal component is just a linear (00:06:49) component of the original data uh that (00:06:51) means that if we take all of the (00:06:53) principal components that we have which (00:06:55) is generally the same number as the (00:06:57) number of features then the data that we (00:06:59) describe is a exactly identical to the (00:07:01) original data that we have so there is (00:07:04) there is no loss in that sense right so (00:07:07) there's no information being lost by (00:07:09) doing PCA um but we can look at subsets (00:07:13) of the data so if we look at the first (00:07:15) five principal components in a system (00:07:17) where we have 20 features um then of (00:07:20) course we are only capturing a certain (00:07:22) amount of variant but if we then would (00:07:24) look at all 20 components themselves (00:07:27) then we would capture the whole original (00:07:29) data set (00:07:30) so PCS are orthogonal so orthogonal (00:07:34) means that they are having a 90° (00:07:37) relative to each other um like the (00:07:39) x-axis and the y- AIS in a normal plot (00:07:42) and this is important because this will (00:07:43) come back and this also helps us to (00:07:46) interpret our data right because (00:07:47) generally when we have our data and our (00:07:50) data consists of um 60 70 different (00:07:53) features um then all of these features (00:07:55) are collinear to each other um in some (00:07:58) way it's very very common to have (00:08:00) features um for example the the size of (00:08:03) an animal and the length of it the tail (00:08:05) of an animal uh to be not correlated to (00:08:08) each other so the thing that principal (00:08:10) component analysis does is it gives us (00:08:12) vectors representing the original data (00:08:15) but each of these vectors (00:08:18) are uncorrelated to each other so the (00:08:21) correlation between any two principal (00:08:23) components is always going to be (00:08:26) zero so the variance explained by (00:08:29) principal components decreases the first (00:08:31) principle component explains the most (00:08:33) variance the second a little bit less (00:08:35) the third even less and so on until you (00:08:38) hit the last principal component and at (00:08:40) that point if you would sum up the (00:08:41) variance explained across all of your (00:08:43) components you would have 100% of the (00:08:46) variance explained right because the (00:08:48) original data is represented by a linear (00:08:52) combination of all principal (00:08:54) components so what we can do is is of (00:08:57) course H the number of principal (00:08:58) components that we can compute is always (00:09:01) equal to the number of features that we (00:09:03) have um but the number of principal (00:09:05) components can also be smaller right if (00:09:07) we have two features who are exactly (00:09:09) cinear with each other um so there's (00:09:11) there's no difference between them for (00:09:13) example if we would calculate the the (00:09:16) body weight um in kilograms and we would (00:09:19) calculate or measure the body weight in (00:09:21) in in grams then these two things would (00:09:23) be the same right so at that point we (00:09:26) will end up with one less princi (00:09:28) component comp compared to the original (00:09:31) data so people always ask me how many (00:09:33) principal components should I keep and (00:09:35) the question or the answer to that is (00:09:38) that it depends on your application (00:09:40) right so uh depending on what you want (00:09:42) to do you might want to um describe your (00:09:45) data so you want to get the main two (00:09:47) directions in your data then if you want (00:09:49) to only get like the main two components (00:09:51) in your data then of course you only (00:09:53) have to use two principal components um (00:09:56) but you can use a lot more um hey Dion (00:09:58) welcome to the Stream (00:10:00) um so had the number of principal (00:10:02) components that you have and the number (00:10:03) of principal components that you are (00:10:05) going to use later on uh varies or (00:10:08) depends a lot on on what you want to do (00:10:10) so generally if you think about qtl (00:10:12) mapping or genomewide Association uh you (00:10:15) would take one two or three principal (00:10:18) components to describe kind of the (00:10:19) overall population structure um and you (00:10:22) would regress those out of your your (00:10:25) fener type generally U but it really (00:10:27) depends on your application there's (00:10:29) nothing really to say how many you (00:10:31) should use right if you want to um just (00:10:34) get your data to be completely (00:10:37) orthogonal um then of course you want to (00:10:39) keep all of them because otherwise you (00:10:40) start losing information so the more um (00:10:44) the more principal components you (00:10:46) include the better you capture the (00:10:48) original data um but the the harder it (00:10:51) again becomes to interpret it so the it (00:10:54) it really (00:10:56) depends all right so for today we're (00:10:58) going to do um from scratch so we're (00:11:00) going to write our own principal (00:11:02) component analysis um and we're not (00:11:05) going to use any libraries we're not (00:11:06) going to use any packages um so we're (00:11:08) just going to use R basic R (00:11:12) um all right question in chat um I did (00:11:15) not understand what you meant by (00:11:16) colinearity of the data um if those two (00:11:18) features have different levels of (00:11:20) measurement the PCA can be applied of (00:11:22) them height and weight yeah so height (00:11:23) and weight they are collinear right (00:11:26) because the the generally the the bigger (00:11:29) person is the more he will weigh but (00:11:32) this is not a perfect collinearity right (00:11:34) so there is still um if you would take (00:11:37) both of the measurements then still some (00:11:39) people who are a little bit smaller than (00:11:41) other ones will still be bigger um than (00:11:44) people that are that are or they will (00:11:46) still weigh a little bit more right so (00:11:48) collinearity is when there is a more (00:11:52) than a nonzero correlation between two (00:11:54) things right but of course if we measure (00:11:57) body weight in kilograms and we measure (00:11:59) body weight in grams then these two (00:12:01) things are identical right so their (00:12:04) correlation is going to be exactly one (00:12:07) so at that point um a single principal (00:12:10) component can more or less describe both (00:12:12) of these features so in that sense we (00:12:15) will end up with the number of features (00:12:17) that we have without (00:12:19) one all right so from scratch um I think (00:12:23) that yeah okay perfect so from scratch (00:12:27) so we're going to use a data set so this (00:12:29) is the data set that everyone uses in R (00:12:32) um I like the data set it's a Edgar (00:12:34) Anderson's Iris data set um and it's a (00:12:37) very very well commonly used data set in (00:12:40) R it has uh 150 measurements um so it (00:12:44) measured 50 different flowers from three (00:12:46) species of irises um so you have the (00:12:49) Sosa the vericolor and and the the (00:12:51) vinica and just to make things clear I I (00:12:54) plotted them here so the satoa has this (00:12:56) blue one uh the versy color like the (00:12:59) name can be many many different colors (00:13:01) but I took a like nice purple one um (00:13:04) which has these like yellow and white (00:13:06) accents on them um and then you have the (00:13:08) virginica and the virginica is a is a is (00:13:10) a white one with a little bit of a (00:13:12) bluish U on top of it so very beautiful (00:13:15) flowers um very interesting flowers um (00:13:19) and this is the standard data set that (00:13:21) almost everyone uses when they do (00:13:23) something in R so it's just available (00:13:25) you can type data Iris um and you'll (00:13:28) have the data available in R so it has (00:13:32) four features um so they have measured (00:13:35) from each of these 50 flowers that they (00:13:38) had they measured uh The Petal length (00:13:41) and width and the seel length and width (00:13:44) right so basically if you have a flower (00:13:47) um then the petals are the the the (00:13:49) leaves that surround the flower which (00:13:51) are the nice colorful ones um and the (00:13:54) SLE is where the stem um transforms into (00:13:58) the flower and these are generally green (00:14:00) leaves underneath the flower to support (00:14:03) the the more or less the flower shape um (00:14:06) so again like petal and seel um they are (00:14:10) relatively collinear with each other um (00:14:12) and of course the length and the width (00:14:14) of the petal and the SLE are also (00:14:16) collinear with each other a little bit (00:14:19) right but but in the end it's just 150 (00:14:22) flowers of three different species um (00:14:24) and they measured four features so (00:14:26) that's the data set very basically um (00:14:28) and of course when we load the data in R (00:14:31) we can type data Iris um and we have to (00:14:33) get our values from it um so I want to (00:14:36) transform these as a matrix so I take (00:14:38) the first four columns which is petal (00:14:41) length petal width SLE length and SLE (00:14:43) width um transform it into a numeric (00:14:46) Matrix and store this as values um and (00:14:49) then there's the fifth column of the (00:14:50) Matrix which contains the name of the (00:14:52) specie so I'm just going to say as (00:14:54) Factor because it's it's a factorial so (00:14:56) a categorical variable with three levels (00:15:00) um be it satoa verol and (00:15:03) virginica all (00:15:05) right so once we have loaded our data we (00:15:08) need to autoscale our data so (00:15:10) autoscaling in PCA is done to remove the (00:15:14) effect of uh scale right you can imagine (00:15:17) that if petals are between 50 cm and 1 (00:15:23) meter um there's a lot of variance in (00:15:25) the size of the petals right but if (00:15:27) another variable has a small range so (00:15:31) for example from 3 cm to 4 and 1/2 CM (00:15:35) then of course the first one is going to (00:15:38) show a lot more variant than the second (00:15:40) one so to remove that we need to more or (00:15:44) less standardize our data and in PCA (00:15:47) standardization is always done uh using (00:15:51) uh autoscaling um this is also called (00:15:53) Unit variance scaling right so basically (00:15:56) it works like this um you Center each (00:15:58) column round zero so you take a column (00:16:01) of the Matrix and you subtract the mean (00:16:04) of the column and then you normalize (00:16:08) your variance so normalizing your (00:16:10) variance is done by dividing by the (00:16:13) standard deviation right so if you now (00:16:15) have a phenotype which is a large amount (00:16:17) of variance it will have a large (00:16:19) standard deviation so dividing by it (00:16:21) will mean that um you end up with values (00:16:24) which are generally almost all values (00:16:27) are between minus1 and POS POS one right (00:16:30) so and of course since you centered your (00:16:33) data um all the data becomes equally in (00:16:36) in size of of of units more or less all (00:16:40) right so this is the function that you (00:16:41) can use to do autoscaling um so the (00:16:43) autoscale function is is relatively (00:16:46) simple so it's a function um it it has a (00:16:49) capital x as input so Capital X's so (00:16:53) generally in R when you use small (00:16:55) letters like this small letter X it (00:16:58) means that you are looking at a vector (00:17:01) and uh capital letters um you are (00:17:05) denoted matrices right so it gets a (00:17:08) matrix as input this Matrix is called x (00:17:11) uh what are we going to do to it well (00:17:12) we're going to apply to x to the columns (00:17:16) a new function so this is an anonymous (00:17:18) function we're not going to name the (00:17:20) function um but what this function will (00:17:22) do it will take the column that we're (00:17:24) currently looking at subtract the mean (00:17:27) of the column and then divide by the (00:17:29) standard deviation of the column and (00:17:31) then we're going to return this whole (00:17:33) thing back to the caller right so that's (00:17:36) how this function work so we apply to (00:17:39) the Matrix to the columns this function (00:17:41) to standardize every column by (00:17:44) substracting the mean and dividing by (00:17:46) the standard (00:17:49) deviation so the next step is going to (00:17:52) be Computing The co-variance Matrix so (00:17:56) co-variance means uh or the idea behind (00:17:59) covariances is that if you have two (00:18:02) features right if one feature is higher (00:18:04) than the average and the (00:18:07) corresponding feature or the (00:18:09) corresponding value of of the other (00:18:11) feature is also above the average then (00:18:14) that means that these two things are (00:18:16) more or less similar right because they (00:18:18) are above the mean um same thing for (00:18:20) stuff which is below the mean right so (00:18:23) so the idea is is that if you if you (00:18:25) look at a vector of values for one Fe (00:18:29) feature and you have another Vector of (00:18:30) values for the other feature after (00:18:32) you've autoscaled them um values above (00:18:36) mean that you are larger than average (00:18:38) values below mean that you're smaller (00:18:40) than average so co-variance Works in (00:18:43) this way by just saying well okay so we (00:18:45) have the mean of the column so the mean (00:18:48) of the column in our case is always (00:18:49) going to be zero because we autoscaled (00:18:51) it so the mean of X is going to be zero (00:18:54) the mean of Y is going to be zero so (00:18:56) what are we going to do well we're going (00:18:58) to take the values of X then subtract (00:19:01) the mean of X and then multiply this by (00:19:04) y minus the mean of Y right so we have (00:19:07) two different vectors for example the (00:19:10) SLE length and the SLE width um and now (00:19:12) we want to see how they co-vary so if (00:19:15) one of them is high and the other one is (00:19:17) high then this positively contributes to (00:19:20) the co-variance while if one is low so (00:19:23) below the average it's a negative value (00:19:25) and the other one is also below the the (00:19:28) the mean uh it also is a negative value (00:19:31) so you're multiplying two negative (00:19:32) values together which again becomes a (00:19:34) positive value right so the only way (00:19:37) that covariant cancels each other out if (00:19:39) is one of the measurements of a plant is (00:19:42) above and the other measurement is below (00:19:44) because then you get a negative value (00:19:46) which then subtracts from the covariance (00:19:49) right so the covariance function is here (00:19:51) um implementing it in R you can do it (00:19:54) like this um so what we say again we (00:19:56) have covariance it takes a matrix as (00:19:59) input um and the first thing that we do (00:20:01) is we build a new Matrix filled with (00:20:04) Naas um and this is because we want to (00:20:07) have a matrix to store our covariance (00:20:10) values in right so we have four features (00:20:13) so the features are in the columns so (00:20:15) that means that we have a matrix which (00:20:18) is 4X 4 so it's four number of columns (00:20:22) of X number of columns of X so it's a (00:20:24) 4x4 Matrix and I'm going to give the (00:20:27) names on the rows and I'm going to give (00:20:29) the names on the column and that is just (00:20:31) going to be SLE length petal length SLE (00:20:34) width so those are the column names of X (00:20:37) um I repeat them twice because they are (00:20:39) both the row names as well as the column (00:20:41) names right so in the first line I'm (00:20:43) making just an output Matrix a 4x4 (00:20:46) output Matrix which has the proper names (00:20:49) on the rows proper names on the column (00:20:52) all right and then I'm going to go (00:20:53) through the column names of X I'm going (00:20:55) to go through the column names of Y um (00:20:57) and of course um then I need to compute (00:21:01) this term so I'm going to call this VX (00:21:05) right so it is from Matrix X take column (00:21:10) number X and subtract the mean of this (00:21:14) column I'm going to do the same thing (00:21:16) for VY so I'm going to take the Y column (00:21:19) the Y column from Matrix X and subtract (00:21:22) the mean of the Y column and then I'm (00:21:25) going to just multiply VX and v y (00:21:28) together and then sum all of the values (00:21:31) up and then I'm going to divide by the (00:21:33) number of rows of x minus one so the (00:21:36) dividing here is standardizing based on (00:21:38) the number of values that you have um it (00:21:41) is minus one um because you lose one (00:21:43) degree of Freedom so that's that's kind (00:21:46) of You could argue about why not divide (00:21:49) by n um it doesn't matter too much but (00:21:51) the covariance definition is to divide (00:21:53) by n minus one because of the fact that (00:21:56) you lose one degree of freedom because (00:21:58) of the mean (00:22:00) calculations so we're going to do this (00:22:02) for all of the columns so I'm going to (00:22:04) compare column one to column one column (00:22:06) one to column two column one to three (00:22:09) and so on so I'm going to just iterate (00:22:11) through all four possible or through all (00:22:13) 16 possible combinations I'm going to (00:22:16) compute the co-variance and then I'm (00:22:18) just going to re return the co-variance (00:22:20) Matrix once I'm done right so it's a a (00:22:23) very basic function um to compute (00:22:26) co-variance um there's a function in our (00:22:29) that does it as well um but since we're (00:22:31) building it from scratch I thought why (00:22:32) not just show you the formula explain (00:22:34) you how the formula works right so if (00:22:36) you have above average and above average (00:22:39) it adds to the co-variance if one of (00:22:41) them is above average the other one is (00:22:42) below average um it's it it takes away (00:22:46) from the co-variance value um and then (00:22:48) in the end we divide by the number of (00:22:50) comparisons that we did minus (00:22:54) one all right so this is going to be the (00:22:58) hardest part because we need a third (00:23:01) function so the third function is going (00:23:04) to be the Matrix decomposition function (00:23:08) because PCS are orthogonals right so we (00:23:12) first have our data we autoscale our (00:23:15) data and then we need to compute the (00:23:18) orthogonal bases so the igen values and (00:23:21) the igen vectors of our principal (00:23:24) component analysis right so we need to (00:23:27) transform our co-variance made matx into (00:23:30) an orthogonal Matrix because our (00:23:32) covariance Matrix still is not (00:23:35) uncorrelated right so we can do that by (00:23:38) decomposition of a matrix into a product (00:23:40) of matrices right so we have Matrix a (00:23:44) which is our co-variance Matrix and now (00:23:46) we want to define a into two separate (00:23:49) matrices so we want to split it into (00:23:51) Matrix Q which is orthogonal and then we (00:23:55) have Matrix R which is the upper (00:23:57) triangular Matrix so by multiplying q (00:24:00) and R together I get back my co-variance (00:24:03) Matrix but I can use then this Matrix Q (00:24:07) to then have the um to then multiply my (00:24:13) data to right because Matrix Q is (00:24:16) orthogonal so that means that if I would (00:24:19) take my autoscale data multiply it by (00:24:22) the Q Matrix then now I end up with a (00:24:25) score Matrix and the score Matrix is the (00:24:28) the principal component Matrix because (00:24:31) every the first principal component is (00:24:33) on the First Column the second principal (00:24:35) component is on the second coln right (00:24:37) because Q is a big Matrix it will have (00:24:42) uh igon vectors that is how we comput it (00:24:45) um so we can take the autoscale data and (00:24:48) then we can project it using the Q (00:24:50) Matrix and now we have our principle (00:24:53) components right so the igen vectors of (00:24:55) the Q Matrix you can think about them as (00:25:00) the axis rotations right so we have if (00:25:04) we have two orthogonal AES and we have (00:25:08) two features um then the Q Matrix is (00:25:11) nothing more than a rotation of the two (00:25:15) axes so a rotation of the first axis to (00:25:18) more or less change the the the (00:25:22) perspective at which we are looking at (00:25:24) the (00:25:25) data so that is how principal components (00:25:28) analysis works it is nothing more than (00:25:31) modifying and tweaking the axis at which (00:25:34) we are looking at our data so orthogonal (00:25:38) means that everything is independent of (00:25:41) each other so like I said we need to (00:25:43) decompose our Matrix into a product of (00:25:45) matrices so we take our covariance (00:25:48) Matrix we then compute an orthogonal (00:25:51) Matrix which multiply together by this (00:25:54) upper triangular Matrix um causes (00:25:58) these two matrices combined to be (00:26:01) identical to a we then take our Q Matrix (00:26:05) which now has the igen Valu so the (00:26:07) different rotational vectors for the (00:26:10) different axes if we then multiply that (00:26:13) with our autoscale data we end up with a (00:26:17) score Matrix of which each individual (00:26:20) column of the score Matrix is going to (00:26:22) be the Principal component so this looks (00:26:27) like a lot of magic um and it kind of is (00:26:30) because linear algebra (00:26:33) or Matrix multiplications generally are (00:26:36) relatively magic um but this is (00:26:39) something that you just have to well I (00:26:43) would say deal with now it's not deal (00:26:45) with because you can learn how it works (00:26:47) but it takes a lot of time to read up on (00:26:50) how to do decomposition right because in (00:26:53) the end um it is something which is um (00:26:57) very similar to kind of projection (00:26:59) matrices in computer Graphics right so (00:27:01) if you do computer Graphics um you also (00:27:04) have like a a triangle which you then (00:27:07) multiply the vector or the the qu (00:27:09) quarteron of the triangle so each (00:27:12) individual point you multiply that by a (00:27:14) projection Matrix uh to put it somewhere (00:27:17) in the 3D world that you are creating so (00:27:21) PCA is the same um but it just doesn't (00:27:23) do it in three principle comp or in in (00:27:26) three dimensions it does it in x (00:27:28) Dimensions where X is the number of (00:27:30) features that you are looking (00:27:34) at so we compute the igen vectors here (00:27:37) through the gr Smith process because (00:27:40) igen vectors cannot be well you for a 2X (00:27:43) two and a 3X3 there's always going to be (00:27:46) a single unique answer um however for (00:27:51) larger problems you can only numerically (00:27:54) approximate the igen factors so that is (00:27:58) the way that it works so we need to (00:28:00) compute uh Q the igen vectors through (00:28:03) something which is called the grum (00:28:04) Schmid process so you take an initial (00:28:06) estimate and then you refine the (00:28:09) estimate the further you go (00:28:11) along so how do we compute that so well (00:28:15) we compute the igen vectors through QR (00:28:18) decomposition using the gram Smith (00:28:20) process so what we do is we take a (00:28:22) function called IG vectors we Define it (00:28:25) as a function which as an input takes a (00:28:28) Matrix X so in this case X is not our (00:28:32) data Matrix it is our co-variance Matrix (00:28:35) so it takes the 4 by4 covariance Matrix (00:28:38) and then we set a number of iterations (00:28:41) right so the more iterations we do the (00:28:44) the better our estimation of the of the (00:28:48) proper uh igen Vector (00:28:52) Matrix but if we do too many iterations (00:28:55) this process will slow down quite (00:28:57) tremendously so so there is actually a (00:28:59) way to make it stop because at some (00:29:02) point um it won't improve further right (00:29:06) so at some point we have the best basis (00:29:08) the the most accurate representation of (00:29:11) the pr or not of the principal component (00:29:13) but of the igen vectors and from that (00:29:15) point on uh The Matrix won't improve (00:29:17) anymore um so in theory you could have (00:29:19) an if state if statement here which (00:29:22) quits um after the Matrix doesn't (00:29:25) improve anymore so what are we going to (00:29:27) do we're going to set P of Q as the (00:29:30) identity Matrix so we take a diagonal (00:29:32) matrix um which means that it's just a (00:29:35) matrix which is composed all zeros but (00:29:38) the diagonal is just ones right so this (00:29:41) is going to be um our Q Matrix (00:29:46) eventually right so we start off with a (00:29:48) Q Matrix which is a diagonal matrix um (00:29:52) and then we are going to compute both (00:29:54) the Q Matrix and the r Matrix based on (00:29:57) our input value value X so how are we (00:30:00) going to do that well we're going to go (00:30:02) and we are going to iterate in this case (00:30:04) 100 times so we're going to make a 100 (00:30:07) updates every time we're going to (00:30:09) compute the QR decomposition so we're (00:30:12) going to decompose Matrix X into a q (00:30:16) Matrix and an R Matrix right so a q (00:30:20) Matrix and an R Matrix so we take X we (00:30:24) compute the QR (00:30:27) decomposition we're going to take the Q (00:30:29) Matrix and call it q and we're going to (00:30:32) multiply this Q Matrix created from X to (00:30:36) our identity Matrix PQ right so PQ is (00:30:40) the Matrix that will be updated time and (00:30:43) time again so we are going to Matrix (00:30:45) multiply PQ * Q so we have q and then (00:30:50) what are we going to do well we're now (00:30:51) going to reconstruct what is left of X (00:30:55) so what are we going to do we're going (00:30:57) to take the r Matrix of the original QR (00:31:00) decomposition and we're going to (00:31:01) multiply that by the Q that we also (00:31:04) decomposed right so it's the it's (00:31:06) basically nothing more than multiplying (00:31:09) the two decompositions together to (00:31:11) reconstruct our X Matrix and then we're (00:31:14) going to do this again so the only thing (00:31:17) which is really or the only thing that (00:31:19) is really changing is the X and the p (00:31:21) and the Q um so the PQ is going to (00:31:25) contain our igen vectors while the X (00:31:28) Matrix after we're done is going to (00:31:31) contain our diagonal so this is going to (00:31:35) be the r Matrix eventually right so once (00:31:38) we are done we take the diagonal of X (00:31:42) which is our which are our igen values (00:31:44) and then the igen vectors are The (00:31:47) Columns of p and Q which we can now use (00:31:50) to multiply with our standardized (00:31:55) data all right so three functions I hope (00:31:58) hope it's clear how we kind of build (00:32:00) these functions why we need this (00:32:02) function so the autoscaling is there to (00:32:04) make sure that all of the data is (00:32:06) normalized so that we can compute our (00:32:09) co-variance Matrix our covariance Matrix (00:32:12) tells us how features are related to (00:32:15) each other if they are if one of the (00:32:17) features is high and the other one is (00:32:19) high as well it contributes if they are (00:32:21) low then it also contributes but if they (00:32:23) are unequal then it doesn't contribute (00:32:25) so it gives us a a measurement of how (00:32:28) related the different features are right (00:32:31) so a high number for covariant means (00:32:33) that two things are very similar when it (00:32:36) comes to um looking at them while then (00:32:40) the QR decomposition will give us more (00:32:43) or less the basis for computing our (00:32:47) principal (00:32:50) components all right so let's put it all (00:32:52) together so I'm going to switch to R and (00:32:55) just Chuck in the um different um (00:32:58) different matrices so let me open up (00:33:00) notepad++ um so here we have the three (00:33:02) functions uh we have the autoscale (00:33:05) function we have the covariance function (00:33:08) and we have the igen vector (00:33:10) decomposition function uh to compute the (00:33:12) igen vectors and the igen values all (00:33:16) right so I'm going to copy this and I'm (00:33:18) going to go and show you guys R so I'm (00:33:21) just going to copy paste this in right (00:33:25) so very basically we have now an (00:33:27) autoscale function fun right so if we (00:33:29) have a vector so let's just do a vector (00:33:32) 5 6 8 52 32 (00:33:37) 926 um oh those are not commas right so (00:33:41) I'm just going to make a little (00:33:43) Vector right so what this is going to (00:33:46) do I'm going to call this X and I'm (00:33:50) going to say (00:33:55) autoscale as Matrix (00:33:59) X and then I need to transpose it (00:34:02) because I think if I do as (00:34:04) Matrix no other way around so it does (00:34:08) create a matrix the way that I want (00:34:09) right so here you see how autoscaling (00:34:11) works so it tells you that well okay so (00:34:15) this is the value these are the numbers (00:34:17) that we inputed so it's going to be the (00:34:19) column of our Matrix so what I'm doing (00:34:21) is or what it does it computes the mean (00:34:24) and then it computes the standard (00:34:26) deviation and then it substracted so it (00:34:28) means that the original number five is (00:34:31) negative 0.7 standard deviations away (00:34:34) from the mean value while this large (00:34:37) value 95 is 1.93 standard deviations (00:34:42) above the average of the vector right so (00:34:45) that is how autoscaling works so (00:34:48) co-variance so let's load the data set (00:34:50) of Iris so that we can actually do that (00:34:53) so I'm going to load data Iris and I'm (00:34:55) going to make sure that we have our (00:34:58) values so let's look at the first 10 (00:35:00) right so 1 to 10 so first 10 rows so (00:35:03) this is how the values look like so here (00:35:05) you can see the four vectors are the (00:35:08) four columns that we have so the four (00:35:10) features that were measured on these (00:35:12) Iris plans so we have SLE length SLE (00:35:14) width petal length and petal width um (00:35:17) and then we can just um standardize them (00:35:20) right so it could could take the SLE (00:35:22) length column and then say Auto (00:35:26) scale right so again it would transform (00:35:30) oh um (00:35:31) sorry we need to do at least two columns (00:35:34) in this case right so I'm taking the (00:35:36) first two (00:35:38) columns so the first two columns are 5.1 (00:35:41) 3.5 so if I'm autoscaling it it tells me (00:35:43) that 5.1 for stle length is actually (00:35:47) below um zero or 0.9 standard deviations (00:35:51) below the mean while a SLE width of (00:35:54) three and a half is actually one (00:35:56) standard deviation above the mean right (00:35:58) so a value of zero means that it's (00:36:01) exactly similar to the mean and any (00:36:03) positive value means that it's that many (00:36:06) standard deviations above the mean and (00:36:08) negative values that many standard (00:36:09) deviations underneath the mean um I have (00:36:12) my values right so let's look at values (00:36:15) 1 to 10 again so that's how it looks and (00:36:18) then I also have my labels um I think I (00:36:21) called it labels um so let's look at the (00:36:24) first 10 labels um so the first 10 are (00:36:27) all sat plans so these 10 measurements (00:36:30) all come from (00:36:33) satas all right so we have our data (00:36:35) loaded um so then the next step is going (00:36:38) to be to um look at the or do the (00:36:42) computations right so I'm going to take (00:36:44) my values I'm going to autoscale them (00:36:46) and I'm going to call this STD so for (00:36:49) standardized values um then I'm going to (00:36:52) compute The co-variance Matrix um and (00:36:54) then I'm going to compute the igen (00:36:56) vectors and the igen Val values um using (00:36:59) our igen Vector so like I said we're (00:37:02) doing it from scratch so it's good to (00:37:04) see all of the functions and and how (00:37:06) they exactly tie into each other um but (00:37:08) all of these functions have built in (00:37:10) functions in R um so co-variance um you (00:37:13) can use the cuof function which is the (00:37:16) built-in covariance function and igen (00:37:18) vectors um they have uh the igen (00:37:21) function so you can use igen as well but (00:37:23) we're going to use igen vectors for our (00:37:25) own all right so let's check back into (00:37:28) to R so I'm going to just do the (00:37:30) computation right so I'm going to load (00:37:32) our data I'm going to subset it and then (00:37:35) I'm going to compute the STD so these (00:37:38) are the standardized values which we (00:37:39) already saw right so first 10 Again (00:37:42) minus 0.9 standard deviation one (00:37:45) standard deviation above the mean under (00:37:48) the mean under the mean right so if I (00:37:50) look at my covariance U Matrix um so if (00:37:53) I look at my covariance Matrix the (00:37:56) covariance of two things (00:37:58) will always be one right because SLE (00:38:00) length and SLE length uh they are (00:38:04) 100% the same thing so their co-variance (00:38:07) is going to be one um of course it (00:38:09) doesn't matter which direction I take so (00:38:12) the upper triangle of the Matrix is (00:38:14) going to be equal to the lower triangle (00:38:16) right because it doesn't matter if I (00:38:18) compute the SLE length versus SLE width (00:38:21) covariance it's going to be the same (00:38:23) when I do the reverse or when I do the (00:38:25) SLE width computation (00:38:28) against the SLE length computation right (00:38:31) so here we see that SLE width and SLE (00:38:33) length are actually negatively co- (00:38:35) varying with each other so that means (00:38:38) that the larger the uh SLE the smaller (00:38:42) the uh width of so the larger or the (00:38:45) larger the length of the SLE the smaller (00:38:48) the width of the SLE so that kind of (00:38:51) that if you have a flower um you can (00:38:53) think about well you can make a very (00:38:55) long petal or SLE and and then you will (00:38:58) have very very small uh the width of (00:39:01) them um while the SLE length and the (00:39:04) petal length they are really correlated (00:39:07) with each other or I'm saying correlated (00:39:09) but they co-vary with each other quite a (00:39:11) lot right so um 0.87 that's pretty high (00:39:15) um so that means that the longer the SLE (00:39:18) of a flower the longer the petal of a (00:39:20) flower and the funny thing is is that (00:39:23) you can see that the SLE length is also (00:39:25) positively co-varying with The Petal (00:39:28) width um so it just means that the (00:39:30) longer your SLE is the larger the the (00:39:33) the the petal that it can support on the (00:39:36) on the (00:39:37) flower um petal length and petal width (00:39:40) are even higher correlated to each other (00:39:42) um so they are very very similar so but (00:39:45) you can see that for SEL it doesn't hold (00:39:48) so SEL are not the the length of the SLE (00:39:50) is not related to the width of the SLE (00:39:53) but when you look at pedals then the (00:39:55) pedal width and the pedal length are (00:39:57) highly correlated to each (00:39:59) other all right so then when we look at (00:40:01) our igen vectors I call those (00:40:04) evf so if I look at my igen vectors then (00:40:07) it computes here the IG values so those (00:40:10) are the diagonal of the r Matrix after (00:40:13) de composition and this is my igon (00:40:16) Vector Matrix so the first igon Vector (00:40:19) is here so it it assigns a value of 0.52 (00:40:24) to the SLE length um it assigns a value (00:40:27) of - (00:40:28) 0.26 to the SLE width 58 for um SLE (00:40:34) petal length versus SLE length right so (00:40:36) the The Columns here are unitless um but (00:40:39) they do kind of relate to how you should (00:40:42) rotate it right so the nice thing about (00:40:44) igen vectors which I Told You So if I (00:40:47) would take the vectors out and I would (00:40:49) just compute the correlation of them um (00:40:52) then the correlation of them um should (00:40:56) be zero which it is not which is strange (00:41:00) l no let me see I'm going to need to (00:41:02) transpose (00:41:04) those so you can see that they're not (00:41:06) exactly zero which is oh no sorry sorry (00:41:09) sorry I'm I'm it's not the igen vectors (00:41:13) that are uncorrelated it is the (00:41:15) principal components that are (00:41:16) uncorrelated so I'm I'm messing up (00:41:18) myself here I'm I'm confusing igen (00:41:21) vectors for principal (00:41:23) components anyway let's go back to the (00:41:26) presentation (00:41:28) right so we can we can autoscale ourself (00:41:31) compute the co-variance and then compute (00:41:34) the igen vectors um and we can use the (00:41:36) igen vectors to make the data (00:41:39) uncorrelated into principal (00:41:43) components all right so if we want to (00:41:46) know how much variance can be explained (00:41:49) by each of the principal components that (00:41:51) we're going to calculate we can actually (00:41:53) use the igen values um so the variance (00:41:56) explained by the individual principal (00:41:58) components can be computed by the nth (00:42:01) igen value divided by the sum of all of (00:42:04) the IG values so we can compute the (00:42:07) variance explained like this so we can (00:42:09) say I take my igen values and I divide (00:42:12) those by the sum of the igen values (00:42:15) right so then I get a vector um where (00:42:18) each of the values is divided by the sum (00:42:21) um I multiply it with 100 and I round (00:42:24) down to one digit behind the comma if I (00:42:26) then want to see the Comm ative sum so (00:42:28) so the communative sum should always be (00:42:31) one after the number of components right (00:42:34) so in this case we have four features we (00:42:37) have four components four vectors um so (00:42:41) if I sum them up um they should be 100 (00:42:45) um and then I can visualize it it leads (00:42:47) to a plot which looks like this but (00:42:49) let's very quickly do that in R right so (00:42:52) if I compute the variance explained so (00:42:56) I'm going to just say VAR explained (00:42:58) right so I round so I take the igen (00:43:00) values so that's 2.9 divided by the sum (00:43:04) of this so that's not it's 2.9 3.8 4.2 (00:43:10) 4.2 so it's like uh 2.9 divided by 4.2 (00:43:15) uh you see that the first principal (00:43:16) component is going to explain 73% of the (00:43:20) data um the second principal component (00:43:23) is going to explain (00:43:25) 22.9% uh the next one 3 7 and the last (00:43:28) one half a percent of variance explained (00:43:32) so using the commum function we can do (00:43:34) the cative sum so if we do cve ctive sum (00:43:38) we see that just it doesn't add up to (00:43:41) 100 it add up to 101 but that is because (00:43:44) we rounded them right if we would not (00:43:46) round the variance as explained so if we (00:43:48) would just say don't round them at all (00:43:50) just multiply them by 100 right so (00:43:53) variance explained do the ctive sum um (00:43:56) then now you see that it will add up (00:43:58) exactly to 100 so by rounding it we kind (00:44:01) of introduce a little bit of a of a skew (00:44:03) um but that's how it works all right so (00:44:06) we can basically just plot the (00:44:08) communative variance explained and then (00:44:10) we get our graph which looks like this (00:44:12) so we can see that the first component (00:44:14) explains around (00:44:15) 73% uh the next one so the first two (00:44:18) components combined explain (00:44:21) 95.8% of the variance seen in the data (00:44:25) and then the third component combined we (00:44:27) are already up to (00:44:29) 99.5% of the variance explained right so (00:44:32) the variance explained in this case (00:44:34) tells us how many components we should (00:44:37) or can take uh to represent our data so (00:44:41) in this case you would say well with two (00:44:42) principal components I'm catching 95% of (00:44:46) the variance that is in my data set um (00:44:49) so the two principal components should (00:44:51) be enough to accurately represent the (00:44:54) data that has been measured (00:44:58) all right so variance (00:45:01) explained so the next thing what we want (00:45:04) to do when we want to compute our own (00:45:05) principal components is to do the PCA (00:45:08) projection so it reconstructs the iris (00:45:11) data as a linear combination of the (00:45:13) original data right so that is what PCA (00:45:15) does so how can we do that well we can (00:45:18) take our standard data and then multiply (00:45:21) it by the projection Matrix um so here (00:45:25) we can compute the projection Matrix (00:45:27) which is the principal component Matrix (00:45:29) um so we can take the first we can take (00:45:32) all of the igen vectors we call that W (00:45:36) and then what we do is we multiply our (00:45:39) standardized data together with W right (00:45:43) so if we do that um we can put the (00:45:45) column names on so now P will be (00:45:48) principal component one in the First (00:45:49) Column principal component two in the (00:45:51) second (00:45:52) column all right so let's do that in R (00:45:56) so that you guys can see it as well so P (00:46:01) right so this is now the principal (00:46:04) component Matrix for our data so if we (00:46:09) now would calculate the correlation of P (00:46:11) um then now all of them should be more (00:46:14) or less zero which you can see that that (00:46:16) is the case it is 1.4 * 10- 16 right so (00:46:20) of course every principal component is (00:46:22) 100% correlated to itself but it is (00:46:25) uncorrelated to all of the the other (00:46:27) principal components right so that now (00:46:29) means that each of these vectors of data (00:46:32) is independent of each other is catching (00:46:35) one unique axis of data right so if you (00:46:38) would think about it in a 2d plane uh (00:46:41) what we did is if we have the SLE width (00:46:44) and the SLE length just two vectors (00:46:46) right what we just did is we just (00:46:48) rotated the axis so that the axis the (00:46:51) first axis catches the most variance and (00:46:53) the second axis catches the other (00:46:56) remaining Vari in the (00:46:59) data all right so we have our projection (00:47:02) Matrix so we have computed our own (00:47:04) principal components so that is more or (00:47:06) less what we set out to do um so next (00:47:10) step um is also we can do a partial (00:47:13) reconstruction right so if we are (00:47:15) thinking about compression um then what (00:47:18) we could do instead of storing the (00:47:20) original Matrix we can actually store (00:47:24) part of the principal component Matrix (00:47:26) so if we want to get a scale down kind (00:47:28) of image or a scale down Matrix of the (00:47:32) iris data set um this would mean that we (00:47:35) would be able to store just the first (00:47:37) two principal components because with (00:47:39) just the first two principal components (00:47:41) we can reconstruct 95% of the variance (00:47:44) in the data um so that means that we (00:47:47) could compress this data set by around (00:47:50) half so that means that instead of (00:47:52) storing the four features we could store (00:47:55) the first two principal components and (00:47:57) the first two principal components would (00:47:59) still be good enough to reconstruct most (00:48:01) of the variance right so if we would do (00:48:04) that um let me show you guys how we can (00:48:06) do that so I'm just going to go to (00:48:08) notepad right so I'm going to take the (00:48:11) first (00:48:12) two columns of the W Matrix I'm going to (00:48:17) multiply them together and then I'm (00:48:19) going to do p right so now we see that P (00:48:23) is a two column Matrix um and it's still (00:48:27) there it's still an uncorrelated (00:48:30) Matrix so if we calculate the (00:48:32) correlation of P we can still see that (00:48:35) these are two things which are (00:48:37) independent of each other and if we (00:48:39) would plot Matrix P then it shows us (00:48:42) that this is kind of the structure in (00:48:45) the data right so it will tell us the (00:48:47) loading on pc1 and the loading on PC2 um (00:48:51) but let's go back and let's take all of (00:48:54) the principle or let's take all of the (00:48:56) igen vectors to p and then we would plot (00:48:59) P um and then we see that there's almost (00:49:02) no difference but that is because the (00:49:04) first two principal components already (00:49:06) caught 95% of the data um so I realize y (00:49:12) that you didn't see R all right so let's (00:49:15) do it again right so let's go up (00:49:18) so what we want to do is we can do so (00:49:23) this is the when I look at the first two (00:49:27) principal components of P after I've (00:49:30) used all of the igen vectors right so it (00:49:34) looks like this so if I only take the (00:49:38) first two igen (00:49:41) vectors (00:49:44) then you can see it looks almost (00:49:47) identical almost none of the uh plots or (00:49:51) almost none of the points have moved but (00:49:53) that's because the first two principal (00:49:55) components contain more or less the same (00:49:58) amount of information or 95% of the (00:50:01) information relative to the whole (00:50:03) principal component Matrix right so (00:50:06) instead of storing a matrix which has (00:50:08) 150 entries four columns I could just (00:50:11) store two columns 150 entries so (00:50:16) virtually reducing the data set um by (00:50:18) around (00:50:20) 50% anyway I'm just going to switch back (00:50:23) to notepad right and I'm going to make (00:50:25) sure that I use all of the principal (00:50:28) components uh all of the igen vectors to (00:50:31) compute the principal (00:50:32) components so let's do that and then do (00:50:36) a little plot to make sure that it's (00:50:38) there so plot and you can see that they (00:50:41) barely move the points um so barely move (00:50:46) again I realize you guys are not looking (00:50:48) at R so they barely move right almost (00:50:53) identical all right so let's start (00:50:56) visualizing in it right because this is (00:50:59) uh one of the visualizations that we can (00:51:01) use um principal component one versus (00:51:03) principal component two that's generally (00:51:05) the two that catch the most variant um (00:51:08) so what if we compare our from scratch (00:51:12) PCA to the builtin pr comp function (00:51:16) right it should be very very similar (00:51:18) should be identical so the way that I'm (00:51:20) going to do this is by saying I'm going (00:51:23) to make a plot which has two uh windows (00:51:26) so there's two PL inside a single window (00:51:28) so I can do that by setting the (00:51:30) parameter MF row to say I want to have (00:51:33) one row two columns right so it's a 1 (00:51:36) time two plot so one row two columns so (00:51:39) I do the first plot so I'm plotting the (00:51:41) principal components from scratch in the (00:51:44) first so on the left side and then on (00:51:46) the right side I'm going to plot the (00:51:48) buil-in pr comp function right so our (00:51:51) principal component Matrix is called P (00:51:54) so from P I'm going to take PCA pc1 and (00:51:57) PC2 I'm going to color by the labels I'm (00:51:59) going to give it a main so we know which (00:52:01) one is which um and I'm going to do PCH (00:52:04) is 19 to have filled circles um and then (00:52:07) I'm going to put a legend there um which (00:52:09) is going to be on the bottom right um (00:52:12) the levels are going to be the labels so (00:52:14) that I'm going to take the levels of the (00:52:17) labels so that's going to be setosa (00:52:19) vericolor and FICA and the colors are (00:52:22) going to be the unique labels because (00:52:25) here I use the the labels the (00:52:27) as the colors um same PCH so in our (00:52:31) doing principal component analysis it's (00:52:33) just a single call right so you can do (00:52:36) PR comp on your values scale is equal to (00:52:39) True right so here there is no (00:52:43) autoscaling that you need to do no you (00:52:45) can just give the pr comp function the (00:52:48) values and then you can say well I want (00:52:50) to autoscale them and then from this (00:52:54) extract Matrix X so Matrix small X in (00:52:58) the return of the pr comp function that (00:53:01) is the principal component Matrix um so (00:53:04) this is called PC so um I'm can then (00:53:07) plot the PCA again I'm going to take the (00:53:09) first two columns I'm going to take the (00:53:11) labels um make sure that it that I know (00:53:14) which one is which all right so let's (00:53:17) check to (00:53:18) R sure that we can see the r window so (00:53:21) let's do the little plot so that we can (00:53:23) compare both of them together um so I'm (00:53:26) going to make sure that my windows a (00:53:27) little bit broader like this so we can (00:53:30) fit all (00:53:31) two right so and this is what we see we (00:53:34) see our from scratch principal component (00:53:36) analysis um and then the function PR (00:53:39) comp all right so first question could (00:53:42) you elaborate more on the value of pc1 (00:53:44) and PC2 and they explain the data uh (00:53:48) repartition yeah so pc1 and PC2 are so (00:53:53) the data is so the data in the PCA (00:53:58) Matrix is exactly the same as the data (00:54:00) in the original Matrix right if I use (00:54:03) all principal components it is just that (00:54:06) it's a different projection right so (00:54:09) instead of having one axis which is (00:54:12) petal length another axis which is petal (00:54:14) width and then a third axis which is SLE (00:54:17) length and another fourth axis which is (00:54:19) SLE WID I now have four (00:54:24) AES which are not related to the (00:54:27) original (00:54:28) measurements but they are capturing the (00:54:31) exact same pattern right if we could (00:54:33) make a four dimensional Cube then in the (00:54:37) four-dimensional cube the data has not (00:54:40) changed the only thing which has changed (00:54:42) is the AIS system that we are using to (00:54:45) project the data so instead of having a (00:54:47) data AIS which we can understand as (00:54:50) being SLE length we now have a first X (00:54:54) AIS and the property of this axis is it (00:54:58) explains the most variance in the data (00:55:01) the second axis is this is the the Y AIS (00:55:05) used to be SLE width but now it is the (00:55:07) axis which explains the most variance (00:55:11) except for the other axis right so it (00:55:13) explains the second most amount of (00:55:16) variance so the data doesn't change (00:55:18) right the data is still exactly the same (00:55:21) as it was it is just that we move the (00:55:25) exis system so having four axes which we (00:55:29) have measured petal length and width SLE (00:55:32) length and SLE width we now have four (00:55:34) different axes through our data but the (00:55:37) data is still exactly the (00:55:39) same also what does this plot exactly (00:55:42) mean pc1 versus PC2 how do we analyze PC (00:55:45) in the context of the iris data yeah so (00:55:47) here in the context of the iris data we (00:55:50) might want to know are these two (00:55:53) different plants or are these three (00:55:55) different species right so if we look at (00:55:58) the first principle component we see (00:56:00) that the data kind of splits out into (00:56:03) one very clearly distinct group right so (00:56:05) we can see that the satoa is on the (00:56:09) first principal component axis I (00:56:11) can very basically I could just say well (00:56:15) if the value if the loading of the plant (00:56:18) or of the measurements is below minus (00:56:21) one it is going to be (00:56:23) AOA right but for the other two species (00:56:26) it's not that clear we can see that the (00:56:29) verol and the vinica they overlap each (00:56:32) other in the middle here right so that (00:56:34) means that these two species on the (00:56:38) first principal component axis they are (00:56:40) not separated right so it means that (00:56:43) there's no clear phenotypic difference (00:56:46) between these two plants when we just (00:56:48) measure these four variables so from (00:56:51) these four variables we can uniquely (00:56:53) identify the satoa right so the satoa is (00:56:57) clearly different from all of the other (00:57:00) plants but the feric color and the FICA (00:57:04) they are not uniquely separable on the (00:57:06) first principal component the second (00:57:09) principal component in this case doesn't (00:57:11) add anything it doesn't allow us to (00:57:14) distinguish um better right we could say (00:57:17) well um if you are high on the second (00:57:21) principal component axis right above one (00:57:25) then you are either a (00:57:28) frenica or you are a setosa but the main (00:57:32) difference between these plants like 75% (00:57:35) of the difference in these measurements (00:57:37) is just captured in the first principal (00:57:40) component and it allows us to (00:57:42) distinguish satas from the other two (00:57:46) species very (00:57:47) clearly we can see that the virginica on (00:57:50) the first AIS is slightly higher values (00:57:53) than the verc color but they are not (00:57:55) separating so they are not separating (00:57:58) out of each other so it still means that (00:58:01) if you would want to make a (00:58:03) determination saying are these really (00:58:05) three different species or are they just (00:58:08) two different species the answer here (00:58:10) would be is that well there's some (00:58:12) evidence to say that they are two (00:58:14) species and that the verc color and the (00:58:17) virginica are more or less similar to (00:58:20) each other still based on just having (00:58:23) these four measurements on the the SEL (00:58:25) and the petals right because we only (00:58:28) have four measurements to start off with (00:58:30) we don't have the color or other things (00:58:32) that we look at right so the idea is is (00:58:34) that the loading on the first principal (00:58:36) component tells us um if we can separate (00:58:40) out the groups based on the most or the (00:58:43) AIS with the most variance so if we want (00:58:46) to know right so if we look at um the so (00:58:50) very basically if we have our Matrix P (00:58:53) right and we look at pc1 right so let's (00:58:57) look at (00:58:58) pc1 right so these are our values so if (00:59:01) I correlate this to the original values (00:59:05) that we had right you can see that petal (00:59:09) length and petal width is highly loaded (00:59:13) on the first principal component as well (00:59:16) as SLE length right so you can see that (00:59:18) the correlation is almost 0.99 so that (00:59:22) means that the first principal component (00:59:24) AIS is a combination (00:59:27) or is mostly looking at the petal (00:59:30) length right so it's looking at the (00:59:32) petal length and it's including a little (00:59:35) bit or 0.9 the SLE length right so it's (00:59:38) it's a combination axis so instead of (00:59:40) having one axis which is SLE length this (00:59:43) axis is catching variance mostly from (00:59:47) petal length a lot from petal width but (00:59:50) these two things are relatively (00:59:52) correlated to each other and also a (00:59:54) little bit of the SLE length if we look (00:59:57) at the second principal component right (00:59:59) so we can just say the correlation of P (01:00:01) PC2 to all of our values then we see (01:00:04) that the second principal component (01:00:06) actually catches the variance which is (01:00:08) in the SLE (01:00:10) width right so if we want to annotate (01:00:12) our axis then the principal component (01:00:14) one axis is the seple petal length axis (01:00:20) together with the petal with axis the (01:00:22) second principal component is the axis (01:00:24) which catches the variation of the SLE (01:00:27) width um and a little bit of the SLE (01:00:30) length right so so this is how we can (01:00:33) take PCS and kind of deconstruct our (01:00:36) data into um unique vectors but these (01:00:40) vectors are uncorrelated to each other (01:00:42) and that is the advantage because they (01:00:44) are uncorrelated to each other um it (01:00:46) means that we can look at one axis and (01:00:48) then look at another axis and put them (01:00:51) perpendicular to each other right but (01:00:53) because if we would do that for SLE (01:00:55) length and SLE with we would not clearly (01:00:58) see this (01:00:59) difference all right so um let's go (01:01:03) quickly back right so I also put the (01:01:05) plot in the presentation so I put H here (01:01:08) right because if we look at our from (01:01:10) scratch principal component (01:01:13) analysis it looks slightly different (01:01:15) than when we did the pr (01:01:17) comp right so we can see that there are (01:01:21) slight differences so what are those (01:01:24) differences well there are no (01:01:28) differences because PCS are linear (01:01:31) combinations right so a a principal (01:01:34) component is just (01:01:37) inverted in our case but that doesn't (01:01:40) matter because it doesn't matter if you (01:01:42) if you have the x-axis on the or if you (01:01:44) have the Y AIS going from minus one to (01:01:47) one or if you have it go from minus1 to (01:01:50) one the other way around right so in the (01:01:52) end it's the same thing right minus1 - (01:01:56) and minus one is the same as 11 one it's (01:02:00) just in the opposite direction right so (01:02:02) since they are linear combinations they (01:02:04) can be orthogonal and of course being (01:02:07) orthogonal in the perpendicular (01:02:09) direction is the same as being (01:02:11) orthogonal in the other direction right (01:02:13) so so how do we fix this well we just we (01:02:15) just flip it around right so we just (01:02:18) flip it around um and then you can see (01:02:21) that they are exactly the same PCA plot (01:02:23) right so that's the fix um of course we (01:02:26) can do this in r as well right so in R (01:02:29) when we have our our plot here um so the (01:02:32) only thing which I have to do when I do (01:02:33) my plot um so let me do my plot again (01:02:37) let me get my code for the plot let's go (01:02:39) back to notepad so and the only thing (01:02:42) that we can do is if we want to do this (01:02:44) then I know now that on the x-axis I (01:02:48) want to have pc1 right not just pc1 and (01:02:51) PC2 and I can say on the y axis I want (01:02:55) to have p (01:02:57) PC2 but now take the negative value of (01:03:00) PC2 right so I'm just going to say (01:03:03) negate the PC2 value and on the x-axis (01:03:07) put the pc1 value so if I will do it (01:03:10) like this right so now um when we look (01:03:14) at it um we can see that they are (01:03:16) exactly identical um so had the the (01:03:18) value here is exactly like that oh crap (01:03:21) you guys can see that (01:03:23) again I'm not paying attention to which (01:03:25) window I (01:03:27) right so if I put it in and I say on the (01:03:29) X plot pc1 on the Y plot the negative of (01:03:33) PC2 um then now they look exactly (01:03:35) identical and of course now I need to (01:03:38) move the legend up to here as well so (01:03:40) that it doesn't overlap the the points (01:03:42) here uh but you can see that they are (01:03:44) exactly (01:03:47) identical all right so let me go back to (01:03:51) the presentation so we can just fix it (01:03:52) by flipping it around no no issue (01:03:55) whatsoever um so we can flip the whole (01:03:57) figure or we can just flip the PC2 axis (01:04:01) right so that happens often in principal (01:04:04) components because it's the same thing (01:04:06) they are linear combinations so it (01:04:08) doesn't matter if you project it in the (01:04:10) positive direction or if you projected (01:04:12) in the negative Direction uh like it's (01:04:14) doing (01:04:17) here all right so that's actually (01:04:19) everything that I wanted to say for (01:04:21) today um so principal component analysis (01:04:24) it depends on you doing Auto scaling of (01:04:26) your data you then compute a covariance (01:04:29) matrix you then compute the igen vectors (01:04:32) and the igen values using uh the grum (01:04:35) Smith (01:04:37) process you then compute your variance (01:04:39) explained based on the igen vectors and (01:04:43) then you can do a PCA projection which (01:04:45) means taking your standardized data (01:04:47) Matrix multiplying it with the igen (01:04:50) vectors and then you have a (01:04:53) re-representation of your data in in (01:04:56) principal component space so in (01:04:59) orthogonal principal (01:05:01) components um and then we compared it (01:05:03) back to PR comp um and then we talked a (01:05:06) little bit about the interpretation um (01:05:08) so principal component analysis it's (01:05:10) used a lot to find groups in data to see (01:05:13) how well things separate from each other (01:05:16) um but also to see how reproducibility (01:05:19) is in in experiments right because the (01:05:22) closer things Cluster on the first two (01:05:24) principal components the (01:05:26) the the the less variant there is um so (01:05:30) if we would look at this and we can see (01:05:33) that these um the ctoas they cluster (01:05:37) really really well on the first (01:05:38) principal component so there's no real (01:05:41) variance in the ctoas when it comes to (01:05:45) the seel length um and the Bal length (01:05:49) and we can see that because of the fact (01:05:52) that these two are highly correlated to (01:05:54) the principal component one so that (01:05:56) means that when we look at satas they (01:05:59) don't have that much differences in in (01:06:02) SES or in Petal links um if we look at (01:06:05) the versy colors and the virginas right (01:06:08) so these have much much more spread so (01:06:11) hey if we would if we would look at the (01:06:13) data and we would for example look at (01:06:16) the petal length right then I would (01:06:18) assume that if I make a histogram of uh (01:06:22) the (01:06:23) values right so I'm just going to take (01:06:25) value (01:06:26) I'm going to say (01:06:28) which (01:06:30) labels is is oh (01:06:34) Sosa and then I'm going to look at (01:06:41) the past petal. length right so this is (01:06:45) a histogram of the satas The Petal (01:06:48) length um if I would do the same thing (01:06:51) for the versy colors ver color (01:06:56) then you can see that had these vary (01:06:59) from like 1 cm to 1.8 cm but these vary (01:07:04) from 3 to 5 and 1 12 CM right so here (01:07:07) there's only a 0.8 variance and here (01:07:11) there is a 2 and a half variance right (01:07:13) so that is caught in the principal (01:07:16) component plot as well um because that's (01:07:20) what we see here so there's less (01:07:22) variance in (01:07:24) The Petal length (01:07:27) of satas relative to the other two (01:07:30) species right so we can use it to (01:07:33) interpret our data um and we can kind of (01:07:37) Reason about what we see um same thing (01:07:40) would be if we look at the second (01:07:41) principle component axis which is the (01:07:43) SLE width axis right because it's highly (01:07:46) correlated to the SLE width uh we would (01:07:49) assume that if we would look at the SLE (01:07:51) width from uh Sosa that there's more (01:07:55) spread in the Sosa SLE width than there (01:07:58) is in the ver color right so let's check (01:08:01) that right so if we do the same one um (01:08:05) again um but now we do the Sosa petal (01:08:12) with oh uh petal width probably with a (01:08:15) capital and we do the vericolor petal (01:08:21) width right we would (01:08:24) see that this one various varies (01:08:27) 0.6 this one varies 0.8 so not entirely (01:08:33) but that's also because the correlation (01:08:35) is not it's not 100% but he we can see (01:08:38) that there's at least the same kind of (01:08:40) the same amount of variance in the in (01:08:43) the petal width between the satas and (01:08:46) the ver colors which is kind of what the (01:08:49) PC second PC tells us a little bit as (01:08:53) well because eh I would have expected it (01:08:56) to be a little bit less but there's a (01:08:59) massive outlier here in the Sosa um so (01:09:02) and the Sosa's range from like minus 3 (01:09:05) to one so it's four and these ones go (01:09:08) from minus one to two something so I (01:09:11) would have expected the um SLE width uh (01:09:15) to be a little bit less (01:09:18) variable all right (01:09:21) so are there any questions so far (01:09:28) I understand that it's difficult like um (01:09:32) if we would if I would have to explain (01:09:35) igen vectors and igen values all the way (01:09:38) from zero it would involve doing a whole (01:09:43) linear algebra rection right so all of (01:09:46) these things um you can easily find very (01:09:49) good starting points on things like (01:09:51) Wikipedia right so if you want to learn (01:09:53) more about igen vectors and igen value (01:09:55) use definitely take a look at Wikipedia (01:09:59) um they have very good citations to (01:10:01) original literature that you can read um (01:10:04) there's very much or there's probably (01:10:07) like other YouTubers who do like a whole (01:10:09) linear algebra lecture from scratch (01:10:12) right so because you want to start off (01:10:14) with what is a vector multiple vectors (01:10:17) together they form a matrix you can (01:10:21) transform vectors um so how do you do (01:10:24) that so in this case because I just (01:10:27) wanted to give you an high level (01:10:28) overview of PCA so what is needed (01:10:31) autoscaling co-variance igen vectors and (01:10:34) igen values normally people just use the (01:10:37) pr comp function in R right so they just (01:10:40) use PR comp they don't think about what (01:10:42) is happening a lot of people actually (01:10:44) forget to set the scale is true right (01:10:47) because you do need to do (01:10:49) autoscaling always um so the default is (01:10:54) false in PR comp um and this is not due (01:10:58) to scaling not being needed but this is (01:11:01) because of backward (01:11:03) compatibility um so these are the steps (01:11:05) to do your own PCA normally people just (01:11:08) use the pr comp function and that's it (01:11:11) and then they plot the first two (01:11:12) principal components they look at it and (01:11:14) then they try to interpret what's going (01:11:16) on um but you can do much more with it (01:11:20) um so you can he you can you can modify (01:11:22) so instead of co-variance you could look (01:11:24) at the uh comp the igen vectors and igen (01:11:27) values based on the correlation as well (01:11:29) um so it allows you more flexibility to (01:11:32) know how principal component analysis is (01:11:35) um is (01:11:37) working all right so if there's no (01:11:39) further questions for today um then (01:11:42) that's that's what I wanted to talk (01:11:44) about um 1 hour 11 minutes not too bad (01:11:47) not too bad um I've gotten some (01:11:49) complaints that my videos are too long (01:11:52) um which I can understand like I tend to (01:11:54) rabble on about things that don't really (01:11:56) matter too much um but (01:11:59) uh it it's the way that it is so um I (01:12:03) might go back to uh streaming on Twitch (01:12:06) and then cutting it up um so that you (01:12:08) guys get like bite-sized videos um but I (01:12:12) I I do like doing the streams so I might (01:12:14) just keep them on (01:12:16) YouTube all right so no further (01:12:19) questions then um I'm wishing you all a (01:12:23) very happy Sunday um (01:12:28) uh thanks for this lesson if total pc1 (01:12:30) and PC2 accounts for less than 50% (01:12:33) should we present more components yes (01:12:36) yeah well you always like principal (01:12:38) components they are related to the (01:12:41) original data sources that you had right (01:12:44) so if you have 60 different features of (01:12:47) course the first two principal (01:12:49) components are not going to catch all of (01:12:52) your variation right because if you (01:12:54) start off with 60 features you probably (01:12:57) need like five or six or seven uh to (01:13:00) explain a reasonable amount of variance (01:13:02) but also what is a reasonable amount of (01:13:04) variance is very dependent on what (01:13:06) you're doing right so one of the things (01:13:09) that I always do when I do PCA is do the (01:13:12) correlation of the PCS back to the raw (01:13:16) unscaled data that I had to see what is (01:13:20) causing or what is how how are these (01:13:22) original phenotypes loaded onto the PC (01:13:27) um but yeah no generally you want to (01:13:30) have two or more (01:13:34) components you want to or uh if you have (01:13:36) 60 features but you want to end up with (01:13:39) like 80% to 85% variance explained um (01:13:44) because those are kind of the main (01:13:46) directions in your data right so if you (01:13:48) need four or five components to hit this (01:13:51) 80% explained you would plot PCA one for (01:13:55) vers 2 1 versus 3 2 versus 3 1 versus 4 (01:14:00) 2 versus 4 3 versus 4 and you would look (01:14:03) at all of them to see if you see any (01:14:06) clear grouping right because this clear (01:14:08) grouping will allow you to do (01:14:10) predictions as well right because if we (01:14:12) now measure a new (01:14:14) satoa we then do the computation right (01:14:16) so we multiply the values that we obtain (01:14:19) with the projection Matrix and it ends (01:14:22) up being right so it ends up being a (01:14:26) negative scoring one so we would know (01:14:29) that the measurements come from a satoa (01:14:32) right so we can take four (01:14:33) measurements of the plant measure the (01:14:36) four (01:14:37) things multiply this with our our igon (01:14:41) Vector Matrix and then it will get four (01:14:44) new values which would then allow us to (01:14:47) determine which plan it is without (01:14:49) knowing it right so it PCA can also be (01:14:52) used to predict um what predict what you (01:14:56) are (01:14:58) seeing based on the phenotypic (01:15:01) measurements that you (01:15:02) have all right so thank you guys for (01:15:05) spending your Sunday with me um please (01:15:09) like the the video stream And subscribe (01:15:12) to my YouTube channel if you want to see (01:15:13) more um you would apply electrom (01:15:16) metabolomic data analysis like the one (01:15:18) on RNA from scratch I am working on that (01:15:20) Dion I am working on that um I've been (01:15:23) doing a lot of metabolomics (01:15:26) recently um (01:15:28) and I I almost have a working pipeline (01:15:32) um very similar to things like uh Metabo (01:15:35) analyst uh Ms dial um which goes from (01:15:39) raw machine output through all of the (01:15:43) different steps that you normally need (01:15:45) to do like um like scaling the the ma uh (01:15:50) scaling the profiles um and then (01:15:53) determining features and then doing (01:15:54) feature annotation um so I am planning (01:15:57) on doing one of those in the futures um (01:16:00) there's also a qtl lecture that I'm (01:16:03) currently preparing based on a request (01:16:06) from last no not last week but the week (01:16:09) before that so from the last stream um (01:16:11) someone asked me could you do a lecture (01:16:13) about qtl um so there will definitely be (01:16:15) a qtl lecture and there will definitely (01:16:17) be a metabolomic lecture um once I get (01:16:22) the pre-print out for the qtl mapping (01:16:25) work that we've been doing on the um (01:16:28) head 3 um there will also be a video (01:16:33) about how to do longevity analysis on (01:16:37) the um 3 m that we're currently been (01:16:39) doing so there's there's a lot of things (01:16:41) in the pipeline um it's just finding the (01:16:44) time to make them um but metabolomics is (01:16:47) definitely on my list um like I said (01:16:50) I've almost got a fully working Pipeline (01:16:52) and then we will do metabolomics from (01:16:56) scratch um and then uh it's going to be (01:16:59) fun all right thanks so much guys um (01:17:03) enjoy the rest of your Sunday um I hope (01:17:05) you have better weather than me like (01:17:07) here it's been gray and raining the (01:17:09) whole weekend well mostly the whole week (01:17:13) um so yeah it's been a been a poor (01:17:15) summer here so I hope that you guys have (01:17:17) nice beautiful weather um and that you (01:17:19) can spend some time (01:17:21) outside all right then see you guys next (01:17:24) time ch

Principal Component Analysis from Scratch (YouTube Video Transcript)

Learning Modes

YouTube Video Transcript Hide

Ask AI Result

Leave a Reply Cancel reply

Other Videos:

Do you get jealous easily? ⏲️ 6 Minute English

Featuring SnapChat’s Augumented Reality Doll IF you enjoy this snap...

I’m So Scared to Lose You | {THE AND} Emma...

YT210 Why was Satan in Paradise with Adam and Eve...

Learn English Alone | Shadowing English Speaking Practice |Motivational Podcast...

Report Makes SHOCKING Argument About White Men

MAGA Pundits RACISM Denial INFURIATED Cari Champion in Heated Debate!

Never Save Money | A Grandfather’s Life Lesson | Motivational...

Twin Stories (ft. My Brother)

I Tried Working at a Fried Rice Stall …but a...

SnapChat Lens Titled Most Likely To by Artist 🎨 Haley...

The Truth Behind Mental Health with Dr. Daniel Amen