A Bayesian perspective on estimating mean, variance, and standard-deviation from data Travis E. Oliphant December 5, 2006 Abstract After reviewing some classical estimators for mean, variance, and standard-deviation and showing that un-biased estimates are not usually desirable, a Bayesian perspective is employed to determine what is known about mean, variance, and standard deviation given only that a data set in-fact has a common mean and variance. Maximum-entropy is used to argue that the likelihood function in this situation should be the same as if the data were independent and identically distributed Gaussian. A noninformative prior is derived for the mean and variance and Bayes rule is used to compute the posterior Probability Density Function (PDF) of ( ; ) as well as ; 2 in terms of the su cient statistics x = 1 n P i xi and C = 1 n P i (xi x)2 : From the joint distribution marginals are determined. It is shown that x pC pn 1 is distributed as Student-t with n 1 degrees of freedom, p 2 nC is distributed as generalized-gamma with c = 2 and a = n 1 2 ; and 2 2 nC is distributed as inverted-gamma with a = n 1 2 : It is suggested to report the mean of these distributions as the estimate (or the peak if n is too small for the mean to be dened) and a condence interval surrounding the median. 1 Introduction A standard concept encountered by anyone exposed to data is the idea of computing a mean, a variance, and a standard deviation from the data. This paper will explore various approaches to computing estimates of mean, standard-deviation, and variance from samples and will conclude by recommending a Bayesian approach to inference about these values from data. Typically it is assumed that the data are realizations of a collection of independent, identically distributed (i.i.d.) random variables. This random vector is denoted X = [X1;X2; : : : ;Xn], and the joint Probability Density Function (PDF) of X is fX (X) = Yn i=1 fX (xi) : 1.1 Traditional mean estimate Commonly, the mean of X is estimated as the sample average: ^ = 1 n Xn i=1 Xi: One can then show in a rather satisfying fashion that E [^ ] = E [X] Var [^ ] = 1 n Var [X] : These statements are typically used to justify this choice of estimator for the mean as unbiased and consistent. Sometimes this estimator is further justied by noticing that it is also the Maximum Likelihood (ML) estimate for the mean assuming the noise comes from an exponential family (e.g. Gaussian). 1 1.2 Traditional variance estimate The ML estimate for variance assuming Gaussian noise is ^ 2 ML = 1 n Xn i=1 (Xi ^ )2 : It is sometimes suggested to use instead ^ 2 = 1 n 1 Xn i=1 (Xi ^ )2 to ensure that E ^ 2 = ^ 2: We are supposed to believe that this is preferable to an estimator that instead minimizes some other metric such as the mean-squared error (which includes both bias and variance). A good discussion of these concepts will also mention that if the Xi are all normal random variables, then ^ and ^ 2 are independent, ^ is normal, and (n 1) ^ 2= 2 is chi-squared with n 1 degrees of freedom. Condence intervals can then be determined from these facts in a straightforward way. 1.3 Standard-deviation estimates Typically, standard-deviation estimates are obtained using ^ = p^ 2. Typically, little is then said about the uncertainty of this estimate. Often, the square-root of the un-biased variance is taken with little justication other than convenience and despite the fact that ^ is generally not an un-biased estimate of even when ^2is. 1.4 Outline of the paper In this paper, the mean-square error of modied classical estimators for the variance and standard-deviation will be compared. The point of this comparison will be to elucidate which normalization factor gives the smallest error (under the hypothesis of normally-distributed data). While instructive, this comparison does not end the discussion as it does not address the question of whether or not the normalization constant should be the only issue in dispute. As a result, the problem will be addressed from a Bayesian perspective. Under this perspective, I begin with the assumption that the data has a common mean and variance and use maximum entropy (with a at prior) to assert that the likelihood function is normal. Using a at prior for and a Je rey's prior [2] for (and 2), the posterior probability of ( ; ) and ; 2 is derived. From this joint posterior, the posterior probability for ; , and 2 can be given which leads to simple rules for an estimate and condence interval calculations. 2 Comparing various estimators Assuming the Xi come from a standard normal population with mean and variance 2, three estimators for and 2 will be compared in terms of the mean-squared error and bias: 1) the unbiased estimator: ^ UB and ^ 2 UB, 2) the maximum-likelihood estimator: ^ ML and ^ 2 ML, and 3) the Minimum Mean-Squared Estimator (MMSE) (among those of a certain class): ^ MMSE and ^ 2 MMSE. All three estimators of both quantities are of the form ^ 2 = a Xn i=1 (Xi ^ )2 ^ = vuut a Xn i=1 (Xi ^ )2: 2 For both classes of estimators, the bias, E h ^ i , and the mean-square error, E ^ 2 = E h ^ 2 i 2 E h ^ i + 2, will be calculated assuming Xi comes from a normal distribution with mean and variance 2: The identity MSE h ^ i E ^ 2 = Var h ^ i + E h ^ i 2 will be useful in what follows. 2.1 Estimators of variance For all three estimators of variance it is known that under the hypothesis of normally distributed data, ^ 2=a 2 is 2 n 1 and therefore has mean n 1 and variance 2 (n 1). Consequently, E ^ 2 = a 2 (n 1) E ^ 4 = a2 4 h (n 1)2 + 2 (n 1) i = a2 4 n2 1 E h ^ 2 2 2 i = E ^ 4 2 2E ^ 2 + 4 = 4 a2 n2 1 2a (n 1) + 1 : It can be shown that the maximum-likelihood estimator for 2 requires aML = 1 n: The unbiased estimator for 2 is obviously aUB = 1 n 1 : The minimum mean-square error estimator is found by di erentiating Eq. (??) and setting the result equal to zero. This procedure results in aMMSE = 1 n+1. The three estimators and their performance are summarized in the following table: ^ 2 E ^ 2 MSE ^ 2 UB 1 n 1 Xn i=1 (Xi ^ )2 2 2 4 n 1 ML 1 n Xn i=1 (Xi ^ )2 n 1 n 2 (2n 1) 4 n2 MMSE 1 n + 1 Xn i=1 (Xi ^ )2 n 1 n + 1 2 2 4 n + 1 It is not di cult to show that for n > 1 2 n + 1 < 2n 1 n2 < 2 n 1 ; and therefore in a mean-square sense, the MMSE and ML estimators are both better than the unbiased estimator. This example serves to show a general property that improved estimators are usually possible in a mean-square sense by using biased estimators. Figure 1 shows p MSE[^ 2] = 4 and E ^ 2 = 2 for the three estimators when n > 1. 3 Estimators for Estimators for are not often discussed, but are often used and should, therefore, receive better treatment. For normally distributed data, the maximum likelihood estimator for is ^ ML = q ^ 2 ML = s 1 n X i (Xi ^ )2; 3 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 0.0 0.5 1.0 1.5 Normalized Mean Minimum MSE Maximum Likelihood Unbiased n Normalized R MSE Minimum MSE Maximum Likelihood Unbiased Figure 1: Normalized mean, E ^ 2 = 2 and normalized root-mean-square error (R-MSE), p MSE[^ 2] = 4, of several estimators of 2. 4 and thus aML = 1 n: The mean and mean-square error for all three estimators can be computed by noticing that p ^ 2= 2a is n 1 (a chi random variable with n 1 degrees of freedom). Because E [ n 1] = p2 n 2 n 1 2 Var [ n 1] = n 1 2 " n 2 n 1 2 #2 we can conclude that E [^ ] = p2a n 2 n 1 2 = tn p2a E h (^ )2 i = Var [^ ] + (E [^ ] )2 = a 2 (n 1) 2a 2 " n 2 n 1 2 #2 + 2 p2a n 2 n 1 2 1 !2 = 2 h a (n 1) 2p2atn + 1 i where tn = n 2 n 1 2 : From these expressions, the unbiased estimator will result if aUB = 1=2t2 n while the minimum meansquare estimator can be found by di erentiating with respect to a the expression for mean-square error and solving for a. The result is aMMSE = 2t2 n= (n 1)2. The following table summarizes the estimators and their performance. ^ E [^ ] MSE[^ ] UB 1 tn vuut 1 2 Xn i=1 (Xi ^ )2 2 n 1 2t2 n 1 ML vuut 1 n Xn i=1 (Xi ^ )2 tn r 2 n 2 2 " 1 tn r 2 n 1 2n # MMSE tn n 1 vuut 2 Xn i=1 (Xi ^ )2 2t2 n n 1 2 2 1 2t2 n n 1 It can be shown (or observed from the plot below) that 1 2t2 n n 1 < 2 2tn r 2 n 1 n < n 1 2t2 n 1: Therefore, comparing the estimators on the basis of mean-squared error results in the MMSE and the ML estimator outperforming the unbiased estimator. Figure 2 shows plots of E [^ ] = and p MSE[^ ] = 2 to give some idea of the small-sample performance of these di erent estimators on normal data. 4 Bayesian Perspective Given data fx1; x2; x3; : : : ; xng ; the task is to nd the mean , variance 2 = v and standard-deviation of these data. As stated the problem doesn't have a solution. More information is needed in order to work 5 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 Normalized Mean Minimum MSE Maximum Likelihood Unbiased n Normalized R MSE Minimum MSE Maximum Likelihood Unbiased Figure 2: Normalized mean, E [^ ] = and normalized root-mean-square error (R-MSE), p MSE[^ ] = 2, of several estimators of . 6 towards an answer. First, assume that data has a common mean and a common variance. The principle of maximum entropy can then be applied under these constraints (using a at \ignorance" prior) to choose the distribution f (Xj ; ) = 1 (2 )n=2 n exp " 1 2 2 X i (xi )2 # : which adds the least amount of information to the problem other than the assumption of a common and 2. Notice that we can use maximum entropy (with a at \ignorance" distribution so that entropy is R f (x) log f (x) dx) to justify the common assumption of normal i.i.d. data. Using Bayes rule we nd that f ( ; jX) = f (Xj ; ) f ( ; ) f (X) = Dnf (Xj ; ) f ( ; ) where Dn is a normalizing constant. This distribution tells us all the information that is available about and given the data X. We can use this joint-PDF to estimate and/or and to report condence in the estimates. 4.1 Choosing the prior f ( ; ) Central to solving this problem is choosing the prior knowledge for and . Because we can normalize the random variables using Z to obtain zero-mean, unit variance random variables, is a location parameter, and is a scale parameter. Following Jaynes's reasoning [1], we choose the prior which expresses complete ignorance except for the fact that is a location parameter and is a scale parameter. In other words, we consider a new problem with data x0 which is shifted and scaled version of the old data. The prior in both of these case should be the same function. However the prior has adjusted according to well-established rules. This denes an expression that the prior should satisfy: f ( ; ) = af ( + b; a ) where a > 0 and b is an arbitrary real number. The prior that satises this transformation equation is the so-called \Je rey's" prior. f ( ; ) = const : This prior is improper in the sense that it is not normalizable by itself. However, when used to nd the posterior a total normalization constant can be found. Specically, f ( ; jX) = Dn n+1 exp " 1 2 2 X i (xi )2 # = Dn n+1 exp " 1 2 2 X i x2i 2 X i xi + n 2 !# = Dn n+1 exp " ( x)2 + C 2 2=n # where x = 1 n X i xi C = x2 x2 = 1 n X i x2i 1 n X i xi !2 = 1 n X i (xi x)2 7 Dn = Z 1 0 1 n+1 Z 1 1 exp 2 + C 2 2=n d d 1 = r nnCn 1 2n 2 1 n 1 2 : This joint posterior PDF tells the whole story about and if only samples constrained to have the same and are given. Using this joint PDF we can compute any desired probability. Notice that n > 1 or else Dn ! 0 which is expressing the fact that with n = 1 there is no information about whatever. Later, will be needed the joint posterior PDF of and v = 2 which is f ( ; vjX) = Gnf (Xj ; v) f ( ; v) where f ( ; v) = const so that we are just as uniformed about v as about . Then, G 1 n = Z 1 0 Z 1 1 v ( n+2 2 ) exp " ( x)2 + C 2v=n # d dv Gn = r nnCn 1 2n 1 n 1 2 = 1 2 Dn: 4.2 Marginal distributions The joint distributions provide all of the information available about the parameters of interest using the data and the assumptions. Notice that these distributions only depend on the data (specically, the statistics x and C), and can be used easily to compute condence intervals. We can integrate out one of the variables and get just the marginal density function of or separately. f ( jX) = Z 1 0 Dn n+1 exp " ( x)2 + C 2 2=n # d = n 2 n 1 2 p C " 1 + ( x)2 C # n=2 so that x pC pn 1 is Student-t distributed with n 1 degrees of freedom. We naturally need n > 1 for this distribution to provide information. When n = 1 we have an improper distribution for proportional to 1 jx j : For other cases we can deduce: E [ jX] = x Var [ jX] = C n 3 n > 3 arg max f ( jX) = x: The marginal distribution of is f ( jX) = Z 1 1 Dn n+1 exp 2 + C 2 2=n d = Dn p2 npn exp nC 2 2 > 0 = r nn 1Cn 1 2n 1 2 exp nC 2 2 n 1 2 n > 0: 8 Thus, p2 pnC is generalized gamma distributed with shape parameters c = 2 and a = n 1 2 . If n = 1; the distribution reduces to an improper distribution proportional to 1= (i.e. we have no additional information about other than what we started with. For other values of n we can nd: E [ jX] = r n 2 n 2 2 n 1 2 pC n > 2; Var [ jX] = n 2 " 2 n 3 2 n 2 1 2 n 1 2 # C n > 3: arg max f ( jX) = pC: This distribution does not have a well-dened mean unless n > 2 and it does not have a well-dened variance unless n > 3: Finally, the marginal distribution of v = 2 is f (vjX) = Z 1 1 Gnv n+2 2 exp 2 + C 2 =n d = nC 2 n 1 2 n 1 2 v(n+1)=2 exp nC 2v v > 0 When n = 1, this also reduces to an improper distribution proportional to 1=v. For other values of n, 2 nC is an inverted gamma distribution with a = n 1 2 . Useful parameters of this distribution are E 2jX = n n 3 C n > 3 Var 2jX = 2n2C2 (n 3)2 (n 5) n > 5 arg max 2 f 2jX = n n + 1 C: Notice that this distribution does not have a well-dened mean unless n > 3 and does not have a well-dened variance unless n > 5: To illustrate, the posterior probabilities for various numbers of samples, Figures 3, 4, 5 show normalized plots of the Student-t, generalized Gamma, and inverted gamma distributions for n =3, 10, and 50, corresponding to the mean, standard-deviation, and variance of the data sample. 4.3 Gaussian approximations The marginal posterior distributions for ; ; and 2 all approach Normal distributions as n ! 1: In particular, the posterior distribution for approaches a normal distribution with mean x and variance C n : The posterior distribution for approaches a normal distribution with mean pC and variance C 2n. Finally, the posterior distribution for 2 approaches a normal distribution with mean C and variance 2C2 n : 4.4 Joint MAP estimators Joint Maximum A-Posterior (MAP) estimators are sometimes useful Because jX and jX (and similarly jX and 2jX) are not independent, the joint MAP estimator can produce di erent results than the marginal MAP estimators. These estimators minimize the jointly-uniform loss function. To nd the joint estimator we solve ^ ; ^ = arg max ; f ( ; jX) = arg min ; [ log f ( ; jX)] = arg min ; " (n + 1) log + ( x)2 + C 2 2=n # 9 -3 -2 -1 0 1 2 3 (1/2) t=( x )/C 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 1) , n (1/2) 1) f (t (n Student-t PDF with n-1 DOF (Mean) n=3 n=10 n=50 Figure 3: Graph of the posterior PDF for the mean for several values of n: The function f (t; ) = ( +1 2 ) p ( 2 ) 1+x2 +1 2 is the PDF of the Student-t distribution with Degrees Of Freedom (DOF). 10 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 (1/2) s= (2/C) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 1)/2) /n 2, (n f (s /n(1/2), Generalized gamma PDF with c=-2, a=(n-1)/2 (Std.) n=3 n=10 n=50 Figure 4: Graph of the posterior PDF for the standard-deviation for several values of n. The function f (s; c; a) = jcjxca 1 (a) exp ( xc) x > 0 is the PDF of the generalized gamma distribution with shape parameters c and a. 11 0 1 2 3 4 5 6 7 8 v= 2(2/C) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 2 1)/2) /n(3/2) f (v /n, (n Inverse gamma PDF with a=(n-1)/2 (Variance) n=3 n=10 n=50 Figure 5: Graph of the posterior PDF for the standard-deviation for several values of n. The function f (s; a) = 1 (a)x a 1 exp 1 x x > 0 is the PDF of the inverse gamma distribution with shape parameter a. 12 and thus, 0 = (^ x) ^ 2=n 0 = n + 1 ^ (^ x)2 + C ^ 3=n : Solving these simultaneously gives ^ JMAP = x; ^ JMAP = r n n + 1 C: The joint estimator for and v can also be found in the same way. ^ ; ^v = arg min ;v " n + 2 2 log v + ( x)2 + C 2v=n # : Di erentiating results in 0 = ^ x ^v=n 0 = n + 2 2 (^ x)2 + C 2v2=n : Solving these simultaneously gives ^ JMAP = x ^ 2 JMAP = n n + 2 C: While the estimator for the mean is un-interesting, a wide variety of normalization constants to C show up in this analysis. Using these estimates, requires a particular devotion to maximizing f ( ; jV) instead of other estimation approaches. The most useful approach to understanding estimates of ; ; and 2 is to determine condence intervals which is the subject of the next section. 4.5 Condence intervals One of the advantages of the Bayesian perspective is that it automatically provides a method to obtain practical condence intervals for the estimates. With the probability density function given, condence intervals can be constructed by nding an area straddling the mean (or the peak, or the median) with equal areas on either side. Given the nature of the condence interval as an area there is some aesthetic value in choosing the median as the middle value to surround. 4.5.1 General case Suppose there is a parameter with probability density function f ( ) and cumulative distribution function F ( ) : How is a condence interval, [a; b], constructed about the mean, peak, and/or median. The interval should be such that the probability of lying within the range is 100 percent, where is a given parameter. In other words, the area under f ( ) over the condence interval should be . Suppose ^ is the position about which it is desired an equal-area condence interval. Then, the two end points of the interval can be calculated from P n a ^ o = 2 P n ^ b o = 2 : 13 These state that F ^ F (a) = 2 F (b) F ^ = 2 so that a = F 1 h F ^ 2 i b = F 1 h F ^ + 2 i : For the condence interval about the median, F ^ = 1 2 so that for that important case. a = F 1 1 2 b = F 1 1 + 2 4.5.2 Mean For the case of the mean, we have seen that the distribution of jX x pC=(n 1) is Student-t with n 1 degrees of freedom. As a result, F 1 (q) = x + r C n 1 F 1 t (q; n 1) where F 1 t (q; ) is the inverse cumulative distribution function (cdf) of the Student-t distribution with degrees of freedom. In addition, the mean, the median, and the peak are all the same value. Note also that because the Student-t distribution is symmetric: F 1 t 1 2 q; = F 1 t 1 2 + q; so that a = x r C n 1 F 1 t 1 + 2 ; n 1 b = x + r C n 1 F 1 t 1 + 2 ; n 1 : 4.5.3 Standard deviation For the case of the standard deviation, we have seen that the distribution of ( jX) q 2 nC is generalized gamma with c = 2 and a = n 1 2 . Therefore F 1 (q) = r nC 2 F 1 1 q; n 1 2 where F 1 1 (q; a) is the inverse cumulative distribution function (cdf) of the generalized gamma distribution with parameters c = 2 and a: F 1 1 (q; a) = 1 [a; (a) q] 1=2 where a; 1 (a; y) = y. As a result: a = r nC 2 F 1 1 F1 ^ r 2 nC ; n 1 2 ! 2 ; n 1 2 ! b = r nC 2 F 1 1 F1 ^ r 2 nC ; n 1 2 ! + 2 ; n 1 2 ! 14 where F1 (x; a) = 1 a; x 2 = (a) is the cumulative distribution function (cdf) of the generalized gamma with parameter a and c = 2. When using the median as the center point, these expressions simplify to: a = r nC 2 F 1 1 1 2 ; n 1 2 ; b = r nC 2 F 1 1 1 + 2 ; n 1 2 : If the peak of the distribution is used as the center point, then a = r nC 2 F 1 1 F1 r 2 n ; n 1 2 ! 2 ; n 1 2 ! b = r nC 2 F 1 1 F1 r 2 n ; n 1 2 ! + 2 ; n 1 2 ! 4.5.4 Variance We have seen that the distribution of 2jX 2 nC is inverted gamma with a = n 1 2 : Therefore, F 1 2 (q) = nC 2 F 1 I q; n 1 2 where F 1 I (q; a) is the inverse cdf of the inverted gamma distribution with parameter a: F 1 I (q; a) = 1 1 [a; (a) q] where a; 1 (a; y) = y. Therefore, a = nC 2 F 1 I FI ^ 2 2 nC ; n 1 2 2 ; n 1 2 b = nC 2 F 1 I FI ^ 2 2 nC ; n 1 2 + 2 ; n 1 2 where FI (x; a) = 1 a; x 1 = (a) is the cdf of the inverted gamma with parameter a. Again, if the median is used as the center point, then these expression simplify to a = nC 2 F 1 I 1 2 ; n 1 2 ; b = nC 2 F 1 I 1 + 2 ; n 1 2 : If the peak of the marginal distribution is used as the center point, the equations are a = nC 2 F 1 I FI 2 n + 1 ; n 1 2 2 ; n 1 2 b = nC 2 F 1 I FI 2 n + 1 ; n 1 2 + 2 ; n 1 2 Notice that the condence interval for the variance is the square of the condence interval for the standard deviation when the median of the distribution is used in both cases. Also, care must be taken for large and small n that none of the arguments to the inverse cdf are negative. Such a situation, indicates that symmetry around the peak is impossible. Therefore, either the median should be used as the middle point, or the area should be taken from 0 to an upper bound, b. 15 5 Discussion Having learned that the minimum mean-square estimator for from data X is E [ jX], one might be surprised by the fact that the expected value of the posterior marginal distribution in this case does not result in the same estimator as the minimum mean-square estimator over all classes of estimators of the form aC even though it has the same form. For example, the classic estimator for ^ 2 that gives minimum MSE is n n+1C but the Bayesian minimum mean-square estimator is n n 3C. Why are these di erent? The di erence comes in the subtle distinction between the two estimators. The former nds the value of a that minimizes the integral Z aC 2 2 f ^ 2 d^ 2 while the latter nds the function of X (which happens to be aC (X)) that minimizes the integral Z Z g (X) 2 2 f X; 2 d 2dX: This nal integral includes an averaging integral against the non-informative prior as well as an integral over the data. These are two entirely di erent optimization problems and should not be expected to provide the same result. It is important to understand the full probability distribution of ; ; and 2 especially when the number of data-samples is small. For example, the variance of the posterior probability distribution for 2 is not even dened if n 5: As a result, it can be impossible to just report the mean and variance. A condence interval (or high-density region) around the median of the distribution is always possible. 6 Summary and Conclusions In this paper, a study of several estimators for the mean, variance, and standard-deviation of data was presented. In particular, it was shown that the unbiased estimator for variance so commonly used is not typically a good choice (especially for small n) because using n+1 as a divisor rather than n 1 shrinks the mean-square error of the estimator. In addition, a fully Bayesian perspective on the problem of estimating a common mean and variance from samples was presented using maximum entropy and non-informative priors. The results provide the posterior conditional PDF of the mean, standard-deviation, and variance from which estimates and condence intervals can be calculated. The results also emphasize the point that calculating the Bayesian Mean-Square Error (MSE) is not necessarily the same as other non-Bayesian denitions of MSE because it involves another averaging integral over the prior information on the quantity to be estimated. The mean of the conditional PDF minimizes Bayesian MSE. Table 1 summarizes the results for understanding ; ; and 2 from data assumed to have a common mean and variance. The table requires only the su cient statistics x = 1 n Xn i=1 xi C = 1 n Xn i=1 (xi x)2 It is also very useful to note that x pC pn 1 is Student-t distributed with n 1 degrees of freedom, p2 pnC is generalized gamma distributed with shape parameters c = 2 and a = n 1 2 ; and 2 2 nC is inverted gamma with shape parameter a = n 1 2 : These nal facts can be used to compute estimates and condence intervals from standardized tables and distributions. 16 Table 1: Summary of posterior probability distributions for , , and 2: PDF: f ( jX) Mode E [ ] Var [ ] ( n 2 ) ( n 1 2 )p C h 1 + ( x)2 C i n=2 x x C n 3 n > 3 2(nC=2)(n 1)=2 ( n 1 2 ) n exp nC 2 2 > 0 pC pn 2 ( n 2 2 ) ( n 1 2 ) pC n > 2 n 2 2 n 3 2( n 2 1) 2( n 1 2 ) C n > 3 2 (nC=2)(n 1)=2 ( n 1 2 )( 2)(n+1)=2 exp nC 2 2 > 0 n n+1C n n 3C n > 3 2n2C2 (n 3)2(n 5) n > 5 7 References [1] Jaynes, E. T., Probability Theory: The Logic of Science, Cambridge University Press, 2003. [2] Je reys, Sir Harold, Theory of Probability, 3rd edition, Oxford University Press, 1961. 17