Skip to main content

For those wanna-be statisticians

Introduction:

Today, I found a question on Reddit asking that what do you have to read to be a statistician. I started to write an answer and immediately understood that it is going to be a good answer for those who want to have a certain checklist to complete a self-teaching journey. In any self-teaching journey, this is one of the problems, that you do not know where to stop and what to read exactly unless you get your checklist straight.
So here is the answer and from this you can get your checklist correctly.
For statistics, make sure you have a good probability background i.e. you understand random variables, expectations, variances, pdf, cdf , moment generating functions, techniques of solving probability questions, convergences etc basics of probability. Also, you will need to have a good linear algebra background as much of the statistics will need matrices and vector spaces. Then, once you have that, you can balance by taking MOOCs and read the topics taught in the courses from standard statistics books. Now, I am assuming that you want to be a statistician. Therefore, you will want to check the following lists to be ticked:
(1) descriptive statistics
histogram, bar, chart,box,stem-leaf plotting etc, describing and understanding basic natural problems in statistical versions. This is descriptive level.
(2) diagonostic and predictive statistics
This will need to know sufficient,ancillary statistics, mle, mom methods, hypothesis, t,z,f,chi-square, goodness of fit and other different tests for hypothesis testing, different types of relations like univariate, bivariate relations, correlation and dependence of variables and their effects. These helps to understand a problem situation to a statistician. Also predictive statistics means to know different types of regressions, i.e. linear, logistic, multilinear etc and their details. Predictive basically introduces a statistician to fit the data into some specified pattern and then predict the outcome for next things.
(3) forecasting and time series analysis
These are then branches of statistics. Forecasting and time series are basically used to know,model and predict things which are dependent on times and therefore are far more interesting. There are numbers of models under both of these and therefore good time is required for the same.
(4) Bayesian statistics and non-parametric based studies: 
Although they come under predictive and diagnostic, but a lot of books and courses will not go into these while doing regression and other parametric staffs. Bayesian statistics may need good amount of probability, but once known will introduce you to a big area of modern statistics. Also, as data is not always fit for all our assumptions, in practical, lot of things are done under the hood of non-parametric based studies.
(5) Sample surveying: This, although is not that important, but as for a statistician may be looked for survey and other works in a company and/or in academics, sample for research is to be collected by the researchers only, a good understanding of the undergoing techniques of sample surveying is also good to have.
So, I think you will now have a sense of the things you need to go through. The topics are in itself a order of increasing difficulty and are also less mandatory to already know as a statistician. But then again, if you are self teaching, why be a bad teacher to leave some of the syllabus!
For linear algebra, you may follow michael artin's linear algebra. For basic probability, it is good to follow introduction to probability by sheldon ross. Now, for beginners statistics, give a read once to introductory statistics by sheldon ross, the descriptive statistics part is good here. 
For point (2) topics, it will be enough to follow casella and berger. Then, for the other topics, you can follow a lot of books and online courses. For regression,time series, forecasting, non-parametric tests; please also go through R and/or python implementation of them; if possible.
Hope you enjoy the journey in statistics.
I have started compiling some of the necessary building blocks for essentially doing a statistician or data scientist job. Please follow these links below to get started with me in:
(1) time series analysis
(2) pandas use in data science
(3) Regression
(4) non-parametric tests
(5) a basic understanding of python
(6) keras introduction
and many other posts to come.

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme