We want models with a few agents, rather than those with only one
or two or infinitely many.

We want to understand agents that are neither extremely brilliant nor extremely stupid, but rather live somewhere in the middle.

It is the interest in between stasis and utter chaos. The world tends not
to be completely frozen or random, but rather it exists in between these
two states. It is the interest in between control and anarchy. We find robust
patterns of organization and activity in systems that have no central
control or authority. 

It is the interest in between the continuous and the discrete. 

We have corporations and human bodies that maintain a recognizable form and activity over long periods of time, even though their constituent parts exist on time scales that are orders of magnitude less long lived.

It is the interest in between the continuous and the discrete. The
behavior of systems as we transition between the continuous and discrete
is often surprising. Many systems do not smoothly move between these
two realms, but instead exhibit quite different patterns of behavior, even
though from the outside they seem so “close.”

my interpretation of Page 7

In a complicated system, the various elements that make up
the system maintain a degree of independence from one another such that
removing one element does not fundamentally alter the system’s behavior apart from that which directly resulted from the piece that was removed.

In a complex system, the dependencies among the elements is such that removing one element changes system behavior to an extent that goes well beyond what is embodied by the particular element that is removed.

Complex systems can be fragile. They can also exhibit an unusual degree of robustness to less radical changes in their component parts as a result of a very powerful organizing force that can overcome a variety of changes to the lower-level components.

my interpretation of Page 9

—-

Social agents must predict and react to the actions and predictions of other agents. (p.10)

—-

Miller, John H., and Scott E. Page. 2007. Complex Adaptive Systems: An Introduction to Computational Models of Social Life. 1 edition. Princeton, N.J: Princeton University Press.

=============================================================================

A free course on Modelling

https://class.coursera.org/modelthinking-004/lecture/index

Statistical Models

http://ntl.bts.gov/DOCS/98133/ch05/ch05.html

Logistic Modeling

http://ntl.bts.gov/DOCS/98133/ch05/body_ch05_04.html

 

List of computer simulation software

List of discrete event simulation software

Rubin causal model

https://en.wikipedia.org/wiki/Rubin_causal_model

Information theoretical choice among statistical models

Akaike information criterion

commonly used rule-of-thumb, that states two models are indistinguishable by AIC criterion if the difference |AIC1AIC2|<2.

As a rough rule of thumb, models having their AIC within 12 of the minimum have substantial support and should receive consideration in making inferences. Models having their AIC within about 47 of the minimum have considerably less support, while models with their AIC>10 above the minimum have either essentially no support and might be omitted from further consideration or at least fail to explain some substantial structural variation in the data.

Denote the AIC values of the candidate models by AIC1, AIC2,AIC3,,AICR. Let AICmin denotes the minimum of those values. Then

e(AICminAICi)/2 can be interpreted as the relative probability that the ith model minimizes the (expected estimated) information loss.

As an example, suppose that there were three models in the candidate set, with AIC values 100, 102, and 110.

Then the second model is e(100102)/2=0.368 times as probable as the first model to minimize the information loss,

and the third model is e(100110)/2=0.007 times as probable as the first model to minimize the information loss.

In this case, we might omit the third model from further consideration and take a weighted average of the first two models, with weights 1 and 0.368, respectively. Statistical inference would then be based on the weighted multimodel.

AIC is less preferable for large-scale data sets.

https://en.wikipedia.org/wiki/Bayesian_information_criterion

In addition to BIC you may find useful to apply bias-corrected version of AIC criterion AICc (you may use this R code or use the formula AICc=AIC+2p(p+1)np1, where p is the number of estimated parameters).

Rule-of-thumb will be the same.

http://stats.stackexchange.com/questions/8557/testing-the-difference-in-aic-of-two-non-nested-models

One can not compare two models if they do not model the same variable 

AIC should work when comparing both nested and nonnested models.

A Gaussian log-likelihood is given by: log(L(θ))=−|D|2log(2∗π)−12log(|K|)−12(x−μ)TK−1(x−μ), K being the covariance structure of your model, |D| being the number of points in your datasets, μ the mean response and obviously x being your dependent variable.

AIC is calculated to be equal to 2k−2log(L),
where k is the number of fixed effect in your model
and L your likelihood function.

  • if L increases (better likelihood), AIC decreases
  • lower AIC indicates a preferred model among competing models

Important:

  • AIC is only meaningful comparatively between models fitted to the same data.
  • Absolute AIC values do not have standalone interpretation.

Historical source:

  • Akaike introduced it as an estimator of expected information loss between model and reality.

It practically compares trade-off between variance(2k) and bias (2log(L)) in your modelling assumptions.

When you calculate your log-likelihood practically you look at two terms: A fit term, denoted by −12(x−μ)TK−1(x−μ) and a complexity penalization term, denoted by −12log(|K|).

Aside wikipedia AIC is also defined to equate: |D|∗log(RSS|D|)+2∗k [3]; this form makes it even more obvious why different models with different dependent variable are not comparable. The RSS in the two case is just incomparable between the two.

AIC is based on KL divergence (difference between two distributions roughly speaking) and works its way on proving how you can approximate the unknown true distribution of your data and compare that to the distribution of the data your model assumes. That’s why “smaller AIC score is better”; you are closer to the approximate true distribution of your data.

using AIC :

  • You can not use it to compare models of different data sets.
  • You should use the same response variables for all the candidate models.
  • You should have |D|>>k, because otherwise you do not get good asymptotic consistency.

http://en.wikipedia.org/wiki/Akaike_information_criterion
Akaike Information Criterion, Shuhua Hu, (Presentation p.17-18)
Applied Multivariate Statistical Analysis, Johnson & Wichern, 6th Ed. (p. 386-387)
A new look at the statistical model identification, H. Akaike, IEEE Transactions on Automatic Control 19 (6): 716–723 (1974)
Model Selection Tutorial #1: Akaike’s Information Criterion, D. Schmidt and E. Makalic, (Presentation p.39)

http://stats.stackexchange.com/questions/48714/prerequisites-for-aic-model-comparison

 

 

******************************************

How Maximized Likelihood Is Calculated

Suppose you have observed data:

y_1,y_2,\dots,y_n

and a model with parameters \theta.

The likelihood is:

L(\theta)=P(\text{data}\mid \theta)

for discrete models, or

L(\theta)=f(\text{data}\mid \theta)

for continuous models.

The “maximized likelihood” means:

  • vary the parameters
  • find the parameter values that make the observed data most probable

That maximum value is the L inserted into AIC.


Concrete Example

Suppose you measure heights:

170,172,168

and assume they come from a normal distribution:

X \sim N(\mu,\sigma^2)

For simplicity assume \sigma=2 is known.

We want to estimate \mu.


Step 1: Write probability density

For a normal distribution:

f(x\mid \mu)=
\frac{1}{\sqrt{2\pi\sigma^2}}
e^{-\frac{(x-\mu)^2}{2\sigma^2}}

Since observations are assumed independent:

L(\mu)
=
\prod_{i=1}^n f(y_i\mid \mu)

So:

L(\mu)
=
f(170\mid\mu)
f(172\mid\mu)
f(168\mid\mu)

Step 2: Plug numbers

Using \sigma=2:

L(\mu)
=
\left(\frac1{\sqrt{8\pi}}\right)^3
e^{-\frac{
(170-\mu)^2+
(172-\mu)^2+
(168-\mu)^2
}{8}}

Now the likelihood is a function of \mu.


Step 3: Maximize likelihood

We ask:

Which \mu makes this expression largest?

Because exponentials decrease with squared distance, this happens when:

\mu = \text{sample mean}

So:

\hat\mu = \frac{170+172+168}{3}=170

This is the maximum likelihood estimator (MLE).


Step 4: Compute maximized likelihood

Now substitute:

\mu=170

into the likelihood formula.

Then:

L_{\max}
=
\left(\frac1{\sqrt{8\pi}}\right)^3
e^{-\frac{
0^2+2^2+(-2)^2
}{8}}
=
\left(\frac1{\sqrt{8\pi}}\right)^3 e^{-1}

Numerically:

L_{\max}\approx 0.00465

That is the maximized likelihood.


Step 5: Insert into AIC

There is 1 estimated parameter:

k=1

So:

AIC=2k-2\ln(L_{\max})

becomes:

AIC
=
2(1)-2\ln(0.00465)
\approx 12.74

Important Conceptual Point

Likelihood is NOT:

P(\text{hypothesis}\mid \text{data})

It is:

P(\text{data}\mid \text{hypothesis})

AIC compares how well models generate the observed data while penalizing extra flexibility.


Why log-likelihood is usually used

Products of many probabilities become extremely tiny:

0.001 \times 0.0002 \times \cdots

So people maximize:

\ln L

instead of L.

Since log is monotonic:

  • maximizing L
  • maximizing \ln L

give the same parameter estimates.

For independent normal observations:

\ln L
=
-\frac n2\ln(2\pi\sigma^2)
-
\frac1{2\sigma^2}
\sum (y_i-\mu)^2

Maximizing this becomes equivalent to minimizing squared error.

 

Loading