Skip to main content

For those wanna-be statisticians

Introduction:

Today, I found a question on Reddit asking that what do you have to read to be a statistician. I started to write an answer and immediately understood that it is going to be a good answer for those who want to have a certain checklist to complete a self-teaching journey. In any self-teaching journey, this is one of the problems, that you do not know where to stop and what to read exactly unless you get your checklist straight.
So here is the answer and from this you can get your checklist correctly.
For statistics, make sure you have a good probability background i.e. you understand random variables, expectations, variances, pdf, cdf , moment generating functions, techniques of solving probability questions, convergences etc basics of probability. Also, you will need to have a good linear algebra background as much of the statistics will need matrices and vector spaces. Then, once you have that, you can balance by taking MOOCs and read the topics taught in the courses from standard statistics books. Now, I am assuming that you want to be a statistician. Therefore, you will want to check the following lists to be ticked:
(1) descriptive statistics
histogram, bar, chart,box,stem-leaf plotting etc, describing and understanding basic natural problems in statistical versions. This is descriptive level.
(2) diagonostic and predictive statistics
This will need to know sufficient,ancillary statistics, mle, mom methods, hypothesis, t,z,f,chi-square, goodness of fit and other different tests for hypothesis testing, different types of relations like univariate, bivariate relations, correlation and dependence of variables and their effects. These helps to understand a problem situation to a statistician. Also predictive statistics means to know different types of regressions, i.e. linear, logistic, multilinear etc and their details. Predictive basically introduces a statistician to fit the data into some specified pattern and then predict the outcome for next things.
(3) forecasting and time series analysis
These are then branches of statistics. Forecasting and time series are basically used to know,model and predict things which are dependent on times and therefore are far more interesting. There are numbers of models under both of these and therefore good time is required for the same.
(4) Bayesian statistics and non-parametric based studies: 
Although they come under predictive and diagnostic, but a lot of books and courses will not go into these while doing regression and other parametric staffs. Bayesian statistics may need good amount of probability, but once known will introduce you to a big area of modern statistics. Also, as data is not always fit for all our assumptions, in practical, lot of things are done under the hood of non-parametric based studies.
(5) Sample surveying: This, although is not that important, but as for a statistician may be looked for survey and other works in a company and/or in academics, sample for research is to be collected by the researchers only, a good understanding of the undergoing techniques of sample surveying is also good to have.
So, I think you will now have a sense of the things you need to go through. The topics are in itself a order of increasing difficulty and are also less mandatory to already know as a statistician. But then again, if you are self teaching, why be a bad teacher to leave some of the syllabus!
For linear algebra, you may follow michael artin's linear algebra. For basic probability, it is good to follow introduction to probability by sheldon ross. Now, for beginners statistics, give a read once to introductory statistics by sheldon ross, the descriptive statistics part is good here. 
For point (2) topics, it will be enough to follow casella and berger. Then, for the other topics, you can follow a lot of books and online courses. For regression,time series, forecasting, non-parametric tests; please also go through R and/or python implementation of them; if possible.
Hope you enjoy the journey in statistics.
I have started compiling some of the necessary building blocks for essentially doing a statistician or data scientist job. Please follow these links below to get started with me in:
(1) time series analysis
(2) pandas use in data science
(3) Regression
(4) non-parametric tests
(5) a basic understanding of python
(6) keras introduction
and many other posts to come.

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle