Skip to main content

A look in probability for data science

To have a solid foundation in probability theory for data science, let's explore key concepts in a structured manner. We’ll start from the basics and gradually move to more advanced ideas. This overview will give you the necessary theoretical background to understand how probability is applied in data science, particularly in machine learning, statistical modeling, and predictive analytics.

1. Random Variables

A random variable is a variable that takes on different values based on the outcomes of a random phenomenon. Random variables are of two main types:

  • Discrete Random Variables: These take on a countable number of values. For example, the outcome of a die roll (1 through 6) is a discrete random variable.
  • Continuous Random Variables: These take on an uncountable number of values, typically within some interval. For example, the time it takes for a customer to make a purchase in an online store can be modeled as a continuous random variable.

2. Probability Distribution

The probability distribution of a random variable describes how likely the different values of the variable are. For discrete and continuous random variables, we have different methods of representing these distributions:

  • For Discrete Random Variables: We use a probability mass function (PMF). The PMF gives the probability that the random variable takes on a specific value. P(X=x)=p(x)P(X = x) = p(x) where XX is the random variable, and xx is a particular value.
  • For Continuous Random Variables: We use a probability density function (PDF). The PDF does not give the probability of a specific outcome but rather the likelihood of different outcomes. To get the probability of a continuous random variable lying within an interval, we integrate the PDF over that interval. P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_a^b f(x)\,dx

3. Cumulative Distribution Function (CDF)

The CDF of a random variable gives the probability that the variable will take a value less than or equal to a specific value. For both discrete and continuous variables, the CDF is denoted as:

F(x)=P(Xx)F(x) = P(X \leq x)

The CDF is useful because it provides an aggregate view of the probability distribution and is always a non-decreasing function.

4. Expectation (Mean)

The expectation (or mean) of a random variable gives its average or expected value over many trials of the experiment. It is calculated differently for discrete and continuous random variables:

  • For Discrete Variables: E[X]=xxP(X=x)E[X] = \sum_x x \cdot P(X = x)
  • For Continuous Variables: E[X]=xf(x)dxE[X] = \int_{-\infty}^{\infty} x \cdot f(x)\,dx

Expectation is one of the most fundamental concepts in probability and provides a measure of the central tendency of a distribution.

5. Variance and Standard Deviation

Variance measures how much the values of the random variable differ from the mean. A large variance indicates that the random variable takes values that are spread out over a large range. The standard deviation is the square root of the variance and gives a measure of the spread of the random variable in the same units as the variable itself.

  • Variance: Var(X)=E[(XE[X])2]\text{Var}(X) = E[(X - E[X])^2]
  • Standard Deviation: σX=Var(X)\sigma_X = \sqrt{\text{Var}(X)}

6. Covariance and Correlation

Covariance and correlation describe the relationship between two random variables.

  • Covariance: This measures how two random variables vary together. Positive covariance means that when one variable increases, the other tends to increase as well, while negative covariance means that when one increases, the other tends to decrease. Cov(X,Y)=E[(XE[X])(YE[Y])]\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])]
  • Correlation: This is a normalized version of covariance that gives the strength and direction of the linear relationship between two random variables. It is dimensionless and always lies between -1 and 1. Corr(X,Y)=Cov(X,Y)σXσY\text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}

7. Independence

Two random variables XX and YY are said to be independent if the occurrence of one does not affect the probability distribution of the other. Formally:

P(X=x,Y=y)=P(X=x)P(Y=y)P(X = x, Y = y) = P(X = x) \cdot P(Y = y)

For continuous variables, independence is defined in terms of the joint probability density function.

8. Conditional Probability

Conditional probability measures the probability of an event, given that another event has occurred. If AA and BB are two events, the conditional probability of AA given BB is:

P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}

This concept is especially important in Bayesian statistics, where we update our beliefs based on new information.

9. Bayes' Theorem

Bayes' Theorem relates the conditional probability of two events and provides a way to update the probability of a hypothesis as more evidence is available. It is particularly useful in classification problems and machine learning.

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

This theorem is the foundation of Bayesian inference.

10. Law of Large Numbers (LLN)

The Law of Large Numbers states that as the number of trials increases, the sample average of the outcomes converges to the expected value. This law justifies why averages of large samples are often close to the theoretical expectation.

11. Central Limit Theorem (CLT)

The Central Limit Theorem is a key result in probability theory that explains why many distributions tend to be normal (Gaussian) when the sample size is large. It states that the sum (or average) of a large number of independent, identically distributed random variables will be approximately normally distributed, regardless of the original distribution of the variables.

12. Common Probability Distributions

Some widely-used probability distributions in data science include:

  • Bernoulli Distribution: Models binary outcomes (0 or 1).
  • Binomial Distribution: Models the number of successes in a fixed number of Bernoulli trials.
  • Poisson Distribution: Models the number of events in a fixed interval of time or space, given the events happen independently at a constant rate.
  • Normal Distribution (Gaussian): A symmetric, bell-shaped distribution, widely used due to the CLT.
  • Exponential Distribution: Models the time between events in a Poisson process.

Applications in Data Science

  1. Predictive Modeling: Many machine learning models, including logistic regression and Bayesian networks, rely on understanding probability distributions and conditional probabilities.
  2. Hypothesis Testing: Probability is essential in statistical inference, allowing us to determine if results are significant.
  3. Uncertainty Quantification: In real-world applications, it’s essential to quantify the uncertainty of predictions, which often involves computing variances and making use of Bayesian methods.

Understanding probability theory deeply enhances the way you interpret data, design experiments, and implement algorithms in data science.

 

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...