Rubin causal model
Information theoretical choice among statistical models
commonly used rule-of-thumb, that states two models are indistinguishable by AIC criterion if the difference |AIC1−AIC2|<2.
As a rough rule of thumb, models having their AIC within 1–2 of the minimum have substantial support and should receive consideration in making inferences. Models having their AIC within about 4–7 of the minimum have considerably less support, while models with their AIC>10 above the minimum have either essentially no support and might be omitted from further consideration or at least fail to explain some substantial structural variation in the data.
Denote the AIC values of the candidate models by AIC1, AIC2,AIC3,…,AICR. Let AICmin denotes the minimum of those values. Then
e(AICmin−AICi)/2 can be interpreted as the relative probability that the ith model minimizes the (expected estimated) information loss.
As an example, suppose that there were three models in the candidate set, with AIC values 100, 102, and 110.
Then the second model is e(100−102)/2=0.368 times as probable as the first model to minimize the information loss,
and the third model is e(100−110)/2=0.007 times as probable as the first model to minimize the information loss.
In this case, we might omit the third model from further consideration and take a weighted average of the first two models, with weights 1 and 0.368, respectively. Statistical inference would then be based on the weighted multimodel.
AIC is less preferable for large-scale data sets.
In addition to BIC you may find useful to apply bias-corrected version of AIC criterion AICc (you may use this
R code or use the formula AICc=AIC+2p(p+1)n−p−1, where p is the number of estimated parameters).
Rule-of-thumb will be the same.
One can not compare two models if they do not model the same variable
AIC should work when comparing both nested and nonnested models.
A Gaussian log-likelihood is given by: log(L(θ))=−|D|2log(2∗π)−12log(|K|)−12(x−μ)TK−1(x−μ), K being the covariance structure of your model, |D| being the number of points in your datasets, μ the mean response and obviously x being your dependent variable.
AIC is calculated to be equal to 2k−2log(L), where k is the number of fixed effect in your model and L your likelihood function .
It practically compares trade-off between variance(2k) and bias (2log(L)) in your modelling assumptions.
When you calculate your log-likelihood practically you look at two terms: A fit term, denoted by −12(x−μ)TK−1(x−μ) and a complexity penalization term, denoted by −12log(|K|).
Aside wikipedia AIC is also defined to equate: |D|∗log(RSS|D|)+2∗k ; this form makes it even more obvious why different models with different dependent variable are not comparable. The RSS in the two case is just incomparable between the two.
AIC is based on KL divergence (difference between two distributions roughly speaking) and works its way on proving how you can approximate the unknown true distribution of your data and compare that to the distribution of the data your model assumes. That’s why “smaller AIC score is better”; you are closer to the approximate true distribution of your data.
using AIC :
- You can not use it to compare models of different data sets.
- You should use the same response variables for all the candidate models.
- You should have |D|>>k, because otherwise you do not get good asymptotic consistency.
Akaike Information Criterion, Shuhua Hu, (Presentation p.17-18)
Applied Multivariate Statistical Analysis, Johnson & Wichern, 6th Ed. (p. 386-387)
A new look at the statistical model identification, H. Akaike, IEEE Transactions on Automatic Control 19 (6): 716–723 (1974)
Model Selection Tutorial #1: Akaike’s Information Criterion, D. Schmidt and E. Makalic, (Presentation p.39)