Recently I have started to take machine learning course by Andrew ng. In this post, I will note down the non-trivial things in this course which do not come up in other courses.
Course structure:
First of all, it is clear in the course structure that this course intends you to know your basics well. The structure of the course includes the first week to know what the definition of machine learning is, and what is a linear regression in one variable.
Here I have encountered definition of machine learning by Tom mitchell from CMU:
A well posed learning problem: A computer is said to learn from experience E with respect to a task T and some performance measure P, if its performance on T, as measured by P, improves with experience T.
Week 1 is not interesting if you have taken a stat course, know what machine learning is some what but want to know the details only. So I attempted the quiz, and went to week 2 straight. There are actually two quizzes on week 1, one is on basic terms of machine learning, and other one is linear regression in one variable. If you have not taken statistics even, still you can attempt this and start learning. Quizzes were simple but knowledgeable.
Scored perfect and went to week 2.
Week 2:
I had a octave course previously, so had octave loaded in my computer. So skipped the videos talking about loading octave. Maybe will skip octave classes.
The mutivariate regression, even if you know how it is, is treated a bit differently with normalization of variables and calling independent variables as features and all in machine learning notations. So, I will suggest watching the video, otherwise you will fall on potholes of terminology and miss the quizzes.
Here another thing differing from statistics background is that we use models for regressions. Also, here we modify the ordinary least square loss function, we work with cost function and then proceed to update the coefficients using gradient descent.
This is new.
With multivariate linear regression, we use different i.e. gradient descent to reach the minimum achievable with loss function or the minimum of the cost function as mentioned in this course. Now, to reach this minimum, in this course, week 2, few techniques are mentioned.
(1) in the first video, we learn about feature scaling and mean normalization. Cutting the long story short;
feature scaling is scale all the features to the same scale, i.e. -1 to 1 or 0 to 1, something like this.
In this case, you will just divide the variables by a significant number to bring them in the same scale.
Mean normalization is different. It takes a variable and normalize in some sense, it subtracts the mean and then divides either by range of the variable or in programming where you do not have to the calculation by yourself, divides by the variance.
The reason what andrew ng gives behind this is that highly different ranges lead to a bad and long iterated gradient descent due to the extremeness of the contour plot. But when we bring the ranges in same or comparable values, it becomes a lot faster to work with the gradient descent.
Ok, lets move on to the next video of the tricks.
The second one is really a technical video, and I think everyone should watch it to get a primer on the gradient descent problems one may face. It deals on the convergence of the cost function and then deals with plotting cost function against different learning rate and talks about how to get the learning rate correct and what exactly to try, i.e. try a number of different learning rates increasing at some gp and then plot the corresponding cost function graph against number of iteration to see the performance.
The ideal graph will be that the value should come down first if you have a good learning rate, it should come down slow, if you have smaller than needed learning rate and if your learning rate, your cost function can be going up also.So, this video, the second part is really important and should be watched with importance.
At the next video, he talks about model selection with different features which are manually created rather than naturally occurring. Here the main motif is to give the concept of polynomial regression; but other than that one more thing also comes in context is that trying to fit data blindly. A data should be fit with eyes open, as he points out; if quadratic fit is done with the size of house and price; it will go down with much bigger sizes due to the fitting and therefore employs cubic fit.
Excited to see forward.
So, now he once at least mentions the theoretical things. He mentions the statistical solution in form of matrices and proceed to give a idea of it. Also, he mentions that in practice it is better to work with gradient descent with samples sized more than n>10k as the matrix inversion gets slower.
Now, there is an octave exercise. Since not a octave fan myself, I will skip these and go on to read further.
This is a update at the end of the week 3. Week 3, logistic regression was really good. Much better than the other internet sources, ng first gives understanding of why the cost function is good and much more meaningful than the square loss function, and then at last makes it compact from the piecewise definition. In this point, the description is much more lively than books, and much more theoretical than the internet resources, and that's what is important with andrew ng's course.
We also talk about using some more optimizational algorithms(do not get into the theoretical background but talks about how to implement them in MatLab). I plan to latter read about them and write a blog on the same. But that's later.
This is new.
With multivariate linear regression, we use different i.e. gradient descent to reach the minimum achievable with loss function or the minimum of the cost function as mentioned in this course. Now, to reach this minimum, in this course, week 2, few techniques are mentioned.
(1) in the first video, we learn about feature scaling and mean normalization. Cutting the long story short;
feature scaling is scale all the features to the same scale, i.e. -1 to 1 or 0 to 1, something like this.
In this case, you will just divide the variables by a significant number to bring them in the same scale.
Mean normalization is different. It takes a variable and normalize in some sense, it subtracts the mean and then divides either by range of the variable or in programming where you do not have to the calculation by yourself, divides by the variance.
The reason what andrew ng gives behind this is that highly different ranges lead to a bad and long iterated gradient descent due to the extremeness of the contour plot. But when we bring the ranges in same or comparable values, it becomes a lot faster to work with the gradient descent.
Ok, lets move on to the next video of the tricks.
The second one is really a technical video, and I think everyone should watch it to get a primer on the gradient descent problems one may face. It deals on the convergence of the cost function and then deals with plotting cost function against different learning rate and talks about how to get the learning rate correct and what exactly to try, i.e. try a number of different learning rates increasing at some gp and then plot the corresponding cost function graph against number of iteration to see the performance.
The ideal graph will be that the value should come down first if you have a good learning rate, it should come down slow, if you have smaller than needed learning rate and if your learning rate, your cost function can be going up also.So, this video, the second part is really important and should be watched with importance.
At the next video, he talks about model selection with different features which are manually created rather than naturally occurring. Here the main motif is to give the concept of polynomial regression; but other than that one more thing also comes in context is that trying to fit data blindly. A data should be fit with eyes open, as he points out; if quadratic fit is done with the size of house and price; it will go down with much bigger sizes due to the fitting and therefore employs cubic fit.
Excited to see forward.
So, now he once at least mentions the theoretical things. He mentions the statistical solution in form of matrices and proceed to give a idea of it. Also, he mentions that in practice it is better to work with gradient descent with samples sized more than n>10k as the matrix inversion gets slower.
Now, there is an octave exercise. Since not a octave fan myself, I will skip these and go on to read further.
This is a update at the end of the week 3. Week 3, logistic regression was really good. Much better than the other internet sources, ng first gives understanding of why the cost function is good and much more meaningful than the square loss function, and then at last makes it compact from the piecewise definition. In this point, the description is much more lively than books, and much more theoretical than the internet resources, and that's what is important with andrew ng's course.
We also talk about using some more optimizational algorithms(do not get into the theoretical background but talks about how to implement them in MatLab). I plan to latter read about them and write a blog on the same. But that's later.
Comments
Post a Comment