## 9.1.5 Mean Squared Error (MSE)

Suppose that we would like to estimate the value of an unobserved random variable $X$ given that we have observed $Y=y$. In general, our estimate $\hat{x}$ is a function of $y$: \begin{align} \hat{x}=g(y). \end{align} The error in our estimate is given by \begin{align} \tilde{X}&=X-\hat{x}\\ &=X-g(y). \end{align} Often, we are interested in the mean squared error (MSE) given by \begin{align} E[(X-\hat{x})^2 | Y=y]=E[(X-g(y))^2 | Y=y]. \end{align} One way of finding a point estimate $\hat{x}=g(y)$ is to find a function $g(Y)$ that minimizes the mean squared error (MSE). Here, we show that $g(y)=E[X|Y=y]$ has the lowest MSE among all possible estimators. That is why it is called the minimum mean squared error (MMSE) estimate.

For simplicity, let us first consider the case that we would like to estimate $X$ without observing anything. What would be our best estimate of $X$ in that case? Let $a$ be our estimate of $X$. Then, the MSE is given by

\begin{align} h(a)&=E[(X-a)^2]\\ &=EX^2-2aEX+a^2. \end{align} This is a quadratic function of $a$, and we can find the minimizing value of $a$ by differentiation: \begin{align} h'(a)=-2EX+2a. \end{align} Therefore, we conclude the minimizing value of $a$ is \begin{align} a=EX. \end{align} Now, if we have observed $Y=y$, we can repeat the above argument. The only difference is that everything is conditioned on $Y=y$. More specifically, the MSE is given by \begin{align} h(a)&=E[(X-a)^2|Y=y]\\ &=E[X^2|Y=y]-2aE[X|Y=y]+a^2. \end{align} Again, we obtain a quadratic function of $a$, and by differentiation we obtain the MMSE estimate of $X$ given $Y=y$ as \begin{align} \hat{x}_{M}=E[X|Y=y]. \end{align}

Suppose that we would like to estimate the value of an unobserved random variable $X$, by observing the value of a random variable $Y=y$. In general, our estimate $\hat{x}$ is a function of $y$, so we can write \begin{align} \hat{X}=g(Y). \end{align} Note that, since $Y$ is a random variable, the estimator $\hat{X}=g(Y)$ is also a random variable. The error in our estimate is given by \begin{align} \tilde{X}&=X-\hat{X}\\ &=X-g(Y), \end{align} which is also a random variable. We can then define the mean squared error (MSE) of this estimator by \begin{align} E[(X-\hat{X})^2]=E[(X-g(Y))^2]. \end{align} From our discussion above we can conclude that the conditional expectation $\hat{X}_M=E[X|Y]$ has the lowest MSE among all other estimators $g(Y)$.
Mean Squared Error (MSE) of an Estimator

Let $\hat{X}=g(Y)$ be an estimator of the random variable $X$, given that we have observed the random variable $Y$. The mean squared error (MSE) of this estimator is defined as \begin{align} E[(X-\hat{X})^2]=E[(X-g(Y))^2]. \end{align} The MMSE estimator of $X$, \begin{align} \hat{X}_{M}=E[X|Y], \end{align} has the lowest MSE among all possible estimators.

### Properties of the Estimation Error:

Here, we would like to study the MSE of the conditional expectation. First, note that \begin{align} E[\hat{X}_M]&=E[E[X|Y]]\\ &=E[X] \quad \textrm{(by the law of iterated expectations)}. \end{align} Therefore, $\hat{X}_M=E[X|Y]$ has the same expected value as $X$. In other words, for $\hat{X}_M=E[X|Y]$, the estimation error, $\tilde{X}$, is a zero-mean random variable \begin{align} E[\tilde{X}]=EX-E[\hat{X}_M]=0. \end{align} Before going any further, let us state and prove a useful lemma.

Lemma
Define the random variable $W=E[\tilde{X}|Y]$. Let $\hat{X}_M=E[X|Y]$ be the MMSE estimator of $X$ given $Y$, and let $\tilde{X}=X-\hat{X}_M$ be the estimation error. Then, we have
1. $W=0$.
2. For any function $g(Y)$, we have $E[\tilde{X} \cdot g(Y)]=0$.
Proof:
1. We can write \begin{align} W&=E[\tilde{X}|Y]\\ &=E[X-\hat{X}_M|Y]\\ &=E[X|Y]-E[\hat{X}_M|Y]\\ &=\hat{X}_M-E[\hat{X}_M|Y]\\ &=\hat{X}_M-\hat{X}_M=0. \end{align} The last line resulted because $\hat{X}_M$ is a function of $Y$, so $E[\hat{X}_M|Y]=\hat{X}_M$.
2. First, note that \begin{align} E[\tilde{X} \cdot g(Y)|Y]&=g(Y) E[\tilde{X}|Y]\\ &=g(Y) \cdot W=0. \end{align} Next, by the law of iterated expectations, we have \begin{align} E[\tilde{X} \cdot g(Y)]=E\big[E[\tilde{X} \cdot g(Y)|Y]\big]=0. \end{align}

We are now ready to state a very interesting property of the estimation error for the MMSE estimator. Namely, we show that the estimation error, $\tilde{X}$, and $\hat{X}_M$ are uncorrelated. To see this, note that \begin{align} \textrm{Cov}(\tilde{X},\hat{X}_M)&=E[\tilde{X}\cdot \hat{X}_M]-E[\tilde{X}] E[\hat{X}_M]\\ &=E[\tilde{X} \cdot\hat{X}_M] \quad (\textrm{since $E[\tilde{X}]=0$})\\ &=E[\tilde{X} \cdot g(Y)] \quad (\textrm{since $\hat{X}_M$ is a function of }Y)\\ &=0 \quad (\textrm{by Lemma 9.1}). \end{align} Now, let us look at $\textrm{Var}(X)$. The estimation error is $\tilde{X}=X-\hat{X}_M$, so \begin{align} X=\tilde{X}+\hat{X}_M. \end{align} Since $\textrm{Cov}(\tilde{X},\hat{X}_M)=0$, we conclude \begin{align}\label{eq:var-MSE} \textrm{Var}(X)=\textrm{Var}(\hat{X}_M)+\textrm{Var}(\tilde{X}). \hspace{30pt} (9.3) \end{align} The above formula can be interpreted as follows. Part of the variance of $X$ is explained by the variance in $\hat{X}_M$. The remaining part is the variance in estimation error. In other words, if $\hat{X}_M$ captures most of the variation in $X$, then the error will be small. Note also that we can rewrite Equation 9.3 as \begin{align} E[X^2]-E[X]^2=E[\hat{X}^2_M]-E[\hat{X}_M]^2+E[\tilde{X}^2]-E[\tilde{X}]^2. \end{align} Note that \begin{align} E[\hat{X}_M]=E[X], \quad E[\tilde{X}]=0. \end{align} We conclude \begin{align} E[X^2]=E[\hat{X}^2_M]+E[\tilde{X}^2]. \end{align}
Some Additional Properties of the MMSE Estimator

- The MMSE estimator, $\hat{X}_{M}=E[X|Y]$, has the same expectation as $X$, i.e., \begin{align} E[\hat{X}_{M}]=EX, \quad E[\tilde{X}]=0. \end{align}
- The estimation error, $\tilde{X}$, and $\hat{X}_{M}$ are uncorrelated \begin{align} \textrm{Cov}(\tilde{X},\hat{X}_M)=0. \end{align}
- We have \begin{align} \textrm{Var}(X)&=\textrm{Var}(\hat{X}_M)+\textrm{Var}(\tilde{X}),\\ E[X^2]&=E[\hat{X}^2_M]+E[\tilde{X}^2]. \end{align}
Let us look at an example to practice the above concepts. This is an example involving jointly normal random variables. Thus, before solving the example, it is useful to remember the properties of jointly normal random variables. Remember that two random variables $X$ and $Y$ are jointly normal if $aX+bY$ has a normal distribution for all $a,b \in \mathbb{R}$. As we have seen before, if $X$ and $Y$ are jointly normal random variables with parameters $\mu_X$, $\sigma^2_X$, $\mu_Y$, $\sigma^2_Y$, and $\rho$, then, given $Y=y$, $X$ is normally distributed with \begin{align}%\label{} \nonumber &E[X|Y=y]=\mu_X+ \rho \sigma_X \frac{y-\mu_Y}{\sigma_Y},\\ \nonumber &\textrm{Var}(X|Y=y)=(1-\rho^2)\sigma^2_X. \end{align}

Example
Let $X \sim N(0, 1)$ and \begin{align} Y=X+W, \end{align} where $W \sim N(0, 1)$ is independent of $X$.
1. Find the MMSE estimator of $X$ given $Y$, ($\hat{X}_M$).
2. Find the MSE of this estimator, using $MSE=E[(X-\hat{X_M})^2]$.
3. Check that $E[X^2]=E[\hat{X}^2_M]+E[\tilde{X}^2]$.
• Solution
• Since $X$ and $W$ are independent and normal, $Y$ is also normal. Moreover, $X$ and $Y$ are also jointly normal, since for all $a,b \in \mathbb{R}$, we have \begin{align} aX+bY=(a+b)X+bW, \end{align} which is also a normal random variable. Note also, \begin{align} \textrm{Cov}(X,Y)&=\textrm{Cov}(X,X+W)\\ &=\textrm{Cov}(X,X)+\textrm{Cov}(X,W)\\ &=\textrm{Var}(X)=1. \end{align} Therefore, \begin{align} \rho(X,Y)&=\frac{\textrm{Cov}(X,Y)}{\sigma_X \sigma_Y}\\ &=\frac{1}{1 \cdot \sqrt{2}}=\frac{1}{\sqrt{2}}. \end{align}
1. The MMSE estimator of $X$ given $Y$ is \begin{align} \hat{X}_M&=E[X|Y]\\ &=\mu_X+ \rho \sigma_X \frac{Y-\mu_Y}{\sigma_Y}\\ &=\frac{Y}{2}. \end{align}
2. The MSE of this estimator is given by \begin{align} E[(X-\hat{X_M})^2]&=E\left[\left(X-\frac{Y}{2}\right)^2\right]\\ &=E\left[X^2-XY+\frac{Y^2}{4}\right]\\ &=EX^2-E[X(X+W)]+\frac{EY^2}{4}\\ &=EX^2-EX^2-EXEW+\frac{EY^2}{4}\\ &=\frac{\textrm{Var}(Y)+(EY)^2}{4}\\ &=\frac{2+0}{4}=\frac{1}{2}. \end{align}
3. Note that $E[X^2]=1$. Also, \begin{align} E[\hat{X}^2_M]=\frac{EY^2}{4}=\frac{1}{2}. \end{align} In the above, we also found $MSE=E[\tilde{X}^2]=\frac{1}{2}$. Therefore, we have \begin{align} E[X^2]=E[\hat{X}^2_M]+E[\tilde{X}^2]. \end{align}

 The print version of the book is available on Amazon. Practical uncertainty: Useful Ideas in Decision-Making, Risk, Randomness, & AI