Introduction:
picture credit to peter laurinec.
I have been studying about PyGAM package for last couple of days. Now, I am planning to thoroughly analyze the code of PyGAM package with necessary description of GAM model and sources whenever necessary. This is going to be a long post and very much technical in nature.
Pre-requisites:
For understanding the coding part of PyGAM package, first you have to learn what is a GAM model. GAM stands for generalized additive model, i.e. it is a type of statistical modeling where a target variable Y is roughly represented by additive combination of set of different functions. In formula it can be written as:
where g is called a link function and f are different types of functions. In technical terms, in GAM model, theoretically expectation of the link transformed target variable is assumed to be an additive form of different functions of the independent variables.
Now, there are a number of link function g(x) which are used in practice and the gam models are named often according to the link functions and the associated distribution of the g(Y) random variable. i.e. if you use the logit function as link function then your gam model is called logisticGAM. Similarly if you use identity function as link function and consider Y to follow normal distribution then the GAM model you will be building is called LinearGAM. There are a number of different GAM models similarly, but we will focus on the different types of GAM models later on.
The second pre-requisite is to know about spline functions and what are B-spline, p-spline, cyclic splines and different other spline functions and how they make a basis for all the spline functions. I have discussed about spline functions in this post and you can also read about spline functions in details in this pdf here.
For the time being, it is enough to consider that B-spline, cyclic spline and similar few other family of functions are there who actually build basis of spline functions.
Now, considering that, basically we can rewrite each function as a sum of the basis functions and then the job of fitting a model is nothing but finding the coefficients of the basis functions.
Now, lets jump into the code and lets understand what is happening under the hood of PyGAM project.
Basic introduction:
First things first, to understand the code, one has to start understanding from the pygam file in the master branch. In pygam package, one can use the following different readymade gam models:
(1) linear gam
(2) logistic gam
(3) expectile gam
(4) poisson gam
(5) gamma gam
(6) Inverse gaussian gam
Now, from documentation we can see, linear gam is a gam model where the link function is a identity function and the underlying distribution for error of Y is normal distribution.
For logistic gam, the link function is the logit function and the error distribution is binomial.
For poisson gam, the link function is log and the error distribution follows poisson distribution.
For gamma gam, the link function provided is log and the error follows gamma distribution. Here, the point mentioned is that, although the canonical link function for gamma distribution is inverse function ( f(x) = 1/x ) but because of numerical unstability, this function is not used.
For inverse gaussian gam, the link function provided is log and the error follows the inverse gaussian distribution. here again, the same note as gamma gam follows.
Pygam also provides a expectile gam which uses a normal distribution for error and a identity link, but neverthless, it minimizes a least asymetrically weighted error rather than a least square error like linear gam.
Now, the important point is that while these readymade gam models are actually quite enough to start your modeling works, it may not be sufficient to work with always. Therefore, you can always try creating custom gam models with specifying the link and the distribution on your own.
This follows in the code as all the other readymade models are actually derived classes of the original gam class which can be then provided
link
and distribution
parameter values.Now this is where we start to look out of the pygam file into different links and distributions available in pygam. The different links are mentioned in the links.py and distributions.py files under the master , pygam folder.
Currently the links available are:
(1) identity:
(2) logit:
(3) inv:
(4) inv_squared:
(2) logit:
(3) inv:
(4) inv_squared:
And the distributions available are:
(1) normal (2) poisson (3) binomial and (4) gamma and (5) inverse gaussian distributions.
So basically, although you can run into numerical problems and unstable or unreliable results, you can pretty much create all these combinations of link and distribution.
Terms and their specifications:
Now, the best part of gam models are the non-linear functions which we can additively use. In this case, it is necessary to note that, there are four different terms we can use:
(1) spline term: minimal name is s, denotes one variate spline functions
(2) factor term: minimal name is f, denotes one variate functions, either from numerical or categorical variables
(3) linear term: minimal name is l, denotes one variate linear functions
(4) tensor term: minimal name is te, denotes bi-variate non-linear functions, generally used to depict interaction terms
Lets dive into each type of functions and when to use it.
Spline terms details:
Spline terms are normally used when one knows that the relation between an independent variable and dependent/target variable is non-linear in nature and can be best depicted by a polynomial relation. For controlling the shape of spline term we use two parameters, which are n_splines which is a integer term, lam, which is short for lambda, the penalization parameter. Other than these some other parameters are also there. But we will focus on them later.
Currently, lets understand what are n_splines and lam and how do they change the shape of the spline function getting fitted for the parameter.
To understand the idea how shapes are determined by n_splines and lam, you will need the idea of B-spline basis. Basically n_splines is the degree of the B-spline basis for the spline term we are trying to estimate. So, the higher n_spline you fit, the more higher degree polynomial is going to fit to your data for that particular feature.
Now, another issue with fitting high degree polynomial is that curves can get overfitted to the data. That's where the lambda comes to play. Here, lambda is a regularization parameter. In case of GAM model, to stop overfitting, we add a regularization portion with the objective function z to minimize. The regularization portion is dependent on wiggliness of the functions being fitted.
In mathematical term, this wiggliness term is:
which depicts that the wiggliness is depicted by the integration of square of double derivative of the spline function. Now, you may ask, how does it decrease wiggliness?
clearly, the integration of the double derivative term increases as the curve becomes more wiggly and changes upward/downward direction more times. Therefore, adding this term to the objective function in minimization, in turn resists creation of highly wiggly curves and therefore over fitting.
Now, other than n_splines and lambda, there is also a option to fix a term to be strictly increasing on strictly decreasing. In this case, there is a parameter called constraints. If you want to set your spline term to be monotonically increasing, then you have to set constraints equal to monotonic_inc. Similarly, to set it monotonically decreasing, the constraint has to be set to monotonic_dec. In this case, you may wonder, how this option works.
Inside the package, for these constraints, the program sets a high lambda ( equal to 109) which makes it impossible for the curve to break the conditions set by the curve. Also, note that if you are setting monotonicity constraints in GAM, then it is not possible to set the lambda associated to that condition, as clearly, it is a internal parameter.
factor terms:
This is pretty much a similar class, and uses most of the functionality from the spline one, but in meaning, factor terms are different. Factor terms are meant to be terms which are sort of a factor, but not direct correspondent or directly effecting the dependent variable. Also, in case of a factor term, you can put a categorical behavior for this kind of definition, and if you have not one-hot-encoded it, there is a option to embed dummy variables in place of your factor term. So other than that, there is not much to talk about these.
The other two terms are Linear terms and Intercept terms. Intercept term is fitted by default and when fit_intercept is set to be true. Intercept term adds one constant value in the equation.
Understanding the computation of p-value:
It is a known bug in the pygam package that the p-values are smaller than what they should be. For solving this bug, we have to first understand how currently pygam solves the problem and what are the alternative literatures provided.
Code navigation-wise, the main function to compute p-value lies inside the pygam.py file, as a _compute_p_value method, which calculates the p-value for a desired feature. According to documentation, it follows the simon wood's book version for computing the p-value; which is correct for un-penalized version of the model, but the values are much less than what they should be for penalized version.
Now, let's see inside the code.
First, as we can see, there are some utility line codes such as in 1262th to 1264th line. In case of spline terms, in 1267-68 line, spline coefficients are centered by subtracting the mean from them.
Now, we see the main significant lines from 1270-80. First lets focus on 1270th line. Now from utility lines, cov is the covariance matrix of the term being considered in the function. pinv is the function to calculate the penrose-moore inverse of the matrix. Hence, inv_cov is the pseudo-inverse of the covariance matrix and the rank is the rank of the matrix.
Now, the score is created. This score is basically therefore following the formula;
; where B is the coefficient for the term and ∑ is the covariance matrix for the term.
; where B is the coefficient for the term and ∑ is the covariance matrix for the term.
Now, if you refer to the first issue (of simon wood's book, page 185), you will get to read that this variable B∑-1B divided by the rank of the matrix is distributed F(rank,n_sample - edof). And that's how this p-value is calculated.
Here, I would like to make a point that, there is a conditioning issue with the underlying covariance matrix. The issue is that, the conditional number for the covariance matrix is very high in many cases, ending up with a very high determinant in the penrose inverse; which in turn most of the time yields a high F-statistics and extremely low p-value.
This is the issue number 163; which is one of the biggest drawbacks of this model at current time.
Describing the different types of callbacks:
In a model, callbacks are used to capture the different developments of model training. In case of callbacks, they use deviance, diffs and coefs to record and observe the development of the model during the training.
If you are used to using such callbacks, then you can skip this part and go forward. Now, let's discuss.
(1) deviance: deviance is a goodness of fit statistics which is generally used in cases of general linear models and general additive models with exponential dispersion family for the error distribution. deviance can be seen as the difference of log likelihood of a model with the currently fitted parameter and the log likelihood of a saturated model. Deviance is used to track how the model is training.
(2) diffs: diffs is basically the difference in model coefficient norms between two iterations. From the definition itself, it shows that this callback is used to track the progress and training of the model.
Other than these two, accuracy and even coefficients are also used as callbacks to continually track model training.
Within these, diffs is used to track whether the model is converging to the final version or not and often tolerance is set based on such difference callbacks. Now, we will focus on more gam modeling procedures and datasets.
Understanding the different datasets linked in the package:
We will explore the different datasets embedded in this package next time.
Comments
Post a Comment