Skip to main content

Review of machine learning by Andrew ng

Recently I have started to take machine learning course by Andrew ng. In this post, I will note down the non-trivial things in this course which do not come up in other courses.

Course structure:

First of all, it is clear in the course structure that this course intends you to know your basics well. The structure of the course includes the first week to know what the definition of machine learning is, and what is a linear regression in one variable. 

Here I have encountered definition of machine learning by Tom mitchell from CMU:
A well posed learning problem: A computer is said to learn from experience E with respect to a task T and some performance measure P, if its performance on T, as measured by P, improves with experience T.


Week 1 is not interesting if you have taken a stat course, know what machine learning is some what but want to know the details only. So I attempted the quiz, and went to week 2 straight. There are actually two quizzes on week 1, one is on basic terms of machine learning, and other one is linear regression in one variable. If you have not taken statistics even, still you can attempt this and start learning. Quizzes were simple but knowledgeable. 
Scored perfect and went to week 2.   
Week 2:
I had a octave course previously, so had octave loaded in my computer. So skipped the videos talking about loading octave. Maybe will skip octave classes. 
The mutivariate regression, even if you know how it is, is treated a bit differently with normalization of variables and calling independent variables as features and all in machine learning notations. So, I will suggest watching the video, otherwise you will fall on potholes of terminology and miss the quizzes.
Here another thing differing from statistics background is that we use models for regressions. Also, here we modify the ordinary least square loss function, we work with cost function and then proceed to update the coefficients using gradient descent. 
This is new.
With multivariate linear regression, we use different i.e. gradient descent to reach the minimum achievable with loss function or the minimum of the cost function as mentioned in this course. Now, to reach this minimum, in this course, week 2, few techniques are mentioned.
(1) in the first video, we learn about feature scaling and mean normalization. Cutting the long story short; 
feature scaling is scale all the features to the same scale, i.e. -1 to 1 or 0 to 1, something like this. 
In this case, you will just divide the variables by a significant number to bring them in the same scale. 
Mean normalization is different. It takes a variable and normalize in some sense, it subtracts the mean and then divides either by range of the variable or in programming where you do not have to the calculation by yourself, divides by the variance. 
The reason what andrew ng gives behind this is that highly different ranges lead to a bad and long iterated gradient descent due to the extremeness of the contour plot. But when we bring the ranges in same or comparable values, it becomes a lot faster to work with the gradient descent. 
Ok, lets move on to the next video of the tricks. 
The second one is really a technical video, and I think everyone should watch it to get a primer on the gradient descent problems one may face. It deals on the convergence of the cost function and then deals with plotting cost function against different learning rate and talks about how to get the learning rate correct and what exactly to try, i.e. try a number of different learning rates increasing at some gp and then plot the corresponding cost function graph against number of iteration to see the performance. 
The ideal graph will be that the value should come down first if you have a good learning rate, it should come down slow, if you have smaller than needed learning rate and if your learning rate, your cost function can be going up also.So, this video, the second part is really important and should be watched with importance.
At the next video, he talks about model selection with different features which are manually created rather than naturally occurring. Here the main motif is to give the concept of polynomial regression; but other than that one more thing also comes in context is that trying to fit data blindly. A data should be fit with eyes open, as he points out; if quadratic fit is done with the size of house and price; it will go down with much bigger sizes due to the fitting and therefore employs cubic fit. 
Excited to see forward. 
So, now he once at least mentions the theoretical things. He mentions the statistical solution in form of matrices and proceed to give a idea of it. Also, he mentions that in practice it is better to work with gradient descent with samples sized more than n>10k as the matrix inversion gets slower.
 Now, there is an octave exercise. Since not a octave fan myself, I will skip these and go on to read further.

This is a update at the end of the week 3. Week 3, logistic regression was really good. Much better than the other internet sources, ng first gives understanding of why the cost function is good and much more meaningful than the square loss function, and then at last makes it compact from the piecewise definition. In this point, the description is much more lively than books, and much more theoretical than the internet resources, and that's what is important with andrew ng's course.

We also talk about using some more optimizational algorithms(do not get into the theoretical background but talks about how to implement them in MatLab). I plan to latter read about them and write a blog on the same. But that's later.

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...