Skip to main content

Review of machine learning by Andrew ng

Recently I have started to take machine learning course by Andrew ng. In this post, I will note down the non-trivial things in this course which do not come up in other courses.

Course structure:

First of all, it is clear in the course structure that this course intends you to know your basics well. The structure of the course includes the first week to know what the definition of machine learning is, and what is a linear regression in one variable. 

Here I have encountered definition of machine learning by Tom mitchell from CMU:
A well posed learning problem: A computer is said to learn from experience E with respect to a task T and some performance measure P, if its performance on T, as measured by P, improves with experience T.


Week 1 is not interesting if you have taken a stat course, know what machine learning is some what but want to know the details only. So I attempted the quiz, and went to week 2 straight. There are actually two quizzes on week 1, one is on basic terms of machine learning, and other one is linear regression in one variable. If you have not taken statistics even, still you can attempt this and start learning. Quizzes were simple but knowledgeable. 
Scored perfect and went to week 2.   
Week 2:
I had a octave course previously, so had octave loaded in my computer. So skipped the videos talking about loading octave. Maybe will skip octave classes. 
The mutivariate regression, even if you know how it is, is treated a bit differently with normalization of variables and calling independent variables as features and all in machine learning notations. So, I will suggest watching the video, otherwise you will fall on potholes of terminology and miss the quizzes.
Here another thing differing from statistics background is that we use models for regressions. Also, here we modify the ordinary least square loss function, we work with cost function and then proceed to update the coefficients using gradient descent. 
This is new.
With multivariate linear regression, we use different i.e. gradient descent to reach the minimum achievable with loss function or the minimum of the cost function as mentioned in this course. Now, to reach this minimum, in this course, week 2, few techniques are mentioned.
(1) in the first video, we learn about feature scaling and mean normalization. Cutting the long story short; 
feature scaling is scale all the features to the same scale, i.e. -1 to 1 or 0 to 1, something like this. 
In this case, you will just divide the variables by a significant number to bring them in the same scale. 
Mean normalization is different. It takes a variable and normalize in some sense, it subtracts the mean and then divides either by range of the variable or in programming where you do not have to the calculation by yourself, divides by the variance. 
The reason what andrew ng gives behind this is that highly different ranges lead to a bad and long iterated gradient descent due to the extremeness of the contour plot. But when we bring the ranges in same or comparable values, it becomes a lot faster to work with the gradient descent. 
Ok, lets move on to the next video of the tricks. 
The second one is really a technical video, and I think everyone should watch it to get a primer on the gradient descent problems one may face. It deals on the convergence of the cost function and then deals with plotting cost function against different learning rate and talks about how to get the learning rate correct and what exactly to try, i.e. try a number of different learning rates increasing at some gp and then plot the corresponding cost function graph against number of iteration to see the performance. 
The ideal graph will be that the value should come down first if you have a good learning rate, it should come down slow, if you have smaller than needed learning rate and if your learning rate, your cost function can be going up also.So, this video, the second part is really important and should be watched with importance.
At the next video, he talks about model selection with different features which are manually created rather than naturally occurring. Here the main motif is to give the concept of polynomial regression; but other than that one more thing also comes in context is that trying to fit data blindly. A data should be fit with eyes open, as he points out; if quadratic fit is done with the size of house and price; it will go down with much bigger sizes due to the fitting and therefore employs cubic fit. 
Excited to see forward. 
So, now he once at least mentions the theoretical things. He mentions the statistical solution in form of matrices and proceed to give a idea of it. Also, he mentions that in practice it is better to work with gradient descent with samples sized more than n>10k as the matrix inversion gets slower.
 Now, there is an octave exercise. Since not a octave fan myself, I will skip these and go on to read further.

This is a update at the end of the week 3. Week 3, logistic regression was really good. Much better than the other internet sources, ng first gives understanding of why the cost function is good and much more meaningful than the square loss function, and then at last makes it compact from the piecewise definition. In this point, the description is much more lively than books, and much more theoretical than the internet resources, and that's what is important with andrew ng's course.

We also talk about using some more optimizational algorithms(do not get into the theoretical background but talks about how to implement them in MatLab). I plan to latter read about them and write a blog on the same. But that's later.

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle