Finding the best estimate for distribution of the population is done by maximum likelihood method. but the parameters that lead to this best distribution are sometimes biased estimators of the paraters themselves.

The method of Maximum Likelihood tries to find estimators for population parameters that create a distribution that maximizes the likelihood of observations statistics.

In maximum likelihood method, we choose the P1,P2, Pn parameteres of the population which maximize the likelihood of s1,s2,s3 statistics of samples condistional to p1,p2,p3 being the parameters of the population (Wackerly, D., Mendenhall, W., & Scheaffer, R. L. , 2001, p.449)

Continuous random variables *X*_{1}, …, *X _{n}*

**are all statistically independent**from each other if and only if their joint density function can be factored as:

- This means that the COV(Xi,Xj)=0 Johnson, R. A., & Wichern, D. W. 2007, p. 69)

Suppose there is a sample *x*_{1}, *x*_{2}, …, *x*_{n} of *n* **independent and identically distributed (**iid) observations, coming from a distribution with an unknown *probability distribution function* pdf *ƒ*_{0}(·). It is however surmised that the function *ƒ*_{0} belongs to a certain family of distributions {?*ƒ*(·|*?*), *?* ? ??}, called the parametric model, so that *ƒ*_{0} = *ƒ*(·|*?*_{0}). The value *?*_{0} is unknown and is referred to as the “*true value*” of the parameter. It is desirable to find some estimator which would be as close to the true value *?*_{0} as possible. Both the observed variables *x*_{i} and the parameter *?* can be vectors.

To use the method of maximum likelihood, one first specifies the joint density function for all observations. For an iid sample this joint density function will be

Now we look at this function from a different perspective by considering the observed values *x*_{1}, *x*_{2}, …, *x*_{n} to be fixed “parameters” of this function, whereas *?* will be the function’s variable and allowed to vary freely.

MLE was first suggested by Fisher in 1920s. A MLE estimate is the same whether we maximize the likelihood or the log-likelihood function, since log is a monotone funstion. In many situations we maximize the log of the likelihood since log converts multiplications to additions which are easier to work with (Devore, P. J. L. , 2007, p. 245).

=============================================================================

This example is from wikipedia:http://en.wikipedia.org/wiki/Maximum_likelihood#Continuous_distribution.2C_continuous_parameter_space

Continuous distribution, continuous parameter space

For the normal distribution which has probability density function

the corresponding probability density function for a sample of *n* independent identically distributed normal random variables (the likelihood) is

or more conveniently:

where is the sample mean.

This family of distributions has two parameters: *?* = (*?*, *?*), so we maximize the likelihood, , over both parameters simultaneously, or if possible, individually.

Since the logarithm is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm. Since maximizing the logarithm often requires simpler algebra, it is the logarithm which is maximized below. (Note: the log-likelihood is closely related to information entropy and Fisher information.)

which is solved by

This is indeed the maximum of the function since it is the only turning point in ? and the second derivative is strictly less than zero.

An estimator is said to be **unbiased** if its bias is equal to zero for all values of parameter *?*.

*P*(

*x*| ?) (where ? is a fixed constant that is part of this distribution, but is unknown), and then we construct some estimator that maps observed data to values that we hope are close to ?. Then the

**bias**of this estimator is defined to be

where E[ ] denotes expected value over the distribution *P*(*x* | ?), i.e. averaging over all possible observations *x*.

We can show that expected value is equal to the parameter ? of the given distribution, http://www.google.ca/url?sa=t&source=web&cd=2&ved=0CB4QFjAB&url=http%3A%2F%2Fclassweb.gmu.edu%2Ftkeller%2FHANDOUTS%2FHandout6.pdf&ei=O4pATsOwF8bSiALQwY3JBQ&usg=AFQjCNHy14oE99pDSgeMQmm_R3qks3oIGA&sig2=SpONA9WK-OOa9qneipefvA

which means that the maximum-likelihood estimator is unbiased.

E(Xbar)

= E(1/n ? Xi)

= 1/n * E(?Xi)

expectation is a linear operator so we can take the sum out side of the argurement

= 1/n * ? E(Xi)

there are n terms in the sum and the E(Xi) is the same for all i

= 1/n * nE(Xi)

= E(Xi)

E(Xbar) = ?

since E(Xbar) = ?, Xbar is an unbiased estimator for the populaiton mean ?.

If we differentiate the log likelihood with respect to ? and equate to zero:

the solution is:

================================

We can show that this is a biased estimator 🙁

Inserting we obtain

To calculate its expected value, it is convenient to rewrite the expression in terms of zero-mean random variables (statistical error) . Expressing the estimate in these variables yields

Simplifying the expression above, utilizing the facts that and , allows us to obtain

This means that the estimator is biased. (It will under estimate the variance) However, is consistent.

( In mathematical terms consistent means that as *n* goes to infinity the estimator converges in probability to its true value: Under slightly stronger conditions, the estimator converges almost surely (or *strongly*) to: )Formally we say that the *maximum likelihood estimator* for ? = (?,?^{2}) is:

**which will be an unbiased estimator ( Devore, P. J. L., 2007, p.233).**

Note that, since *x*_{1}, *x*_{2}, · · · , *x _{n}* are a random sample from a distribution with variance

*?*

^{2}, it follows that for each

*i*= 1, 2, . . . ,

*n*:

- Wackerly, D., Mendenhall, W., & Scheaffer, R. L. (2001), p.372

and also

This is a property of the variance of uncorrelated variables, arising from the Bienaymé formula. For a proof, see here. The required result is then obtained by substituting these two formulae:

*g(*?

*)*is any transformation of

*?*, then the MLE for

*?*=

*g*(

*?*) is by definition

where is the sample (formally, realizations from a random variable *X*) and is the sample mean.

**underestimate.**The use of

*n*? 1 instead of

*n*in the formula for the sample variance is known as Bessel’s correction, which corrects the bias in the estimation of the sample

*variance,*and some, but not all of the bias in the estimation of the sample

*standard deviation.*

*The unbiased estimation for standard deviation is:*

http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

*y*

_{i}are independent observations from a normal distribution, Cochran’s theorem shows that

*s*

^{2}follows a scaled chi-square distribution:

As a direct consequence, it follows that E(*s*^{2}) = *?*^{2 }

The Variance of sample variance will be (2*df) which will be 2*(n-1)

If the *y*_{i} are independent and identically distributed, but not **necessarily normally distributed**, then

where *?* is the kurtosis of the distribution. If the conditions of the law of large numbers hold, *s*^{2} is a consistent estimator of *?*^{2}.

*Probability and Statistics for Engineering and the Sciences, Enhanced Review Edition*(7th ed.). Duxbury Press.

*Student Solutions Manual for Devore’s Probability and Statistics for Engineering and the Sciences, 7th*(7th ed.). Duxbury Press.

Johnson, R. A., & Wichern, D. W. (2007). *Applied Multivariate Statistical Analysis* (6th ed.). Prentice Hall.

*Mathematical Statistics with Applications*(6th ed.). Duxbury Press.