Portmanteau Theorem

Statement: $X_n \overset{D}{\to} X$ iff $E[g(X_n)] \to E[g(X)]$ for any bounded, continuous Lipschitz g. We prove this in two parts:

Part 1

Here we show $X_n \overset{D}{\to} X$ implies $E[g(X_n)] \to E[g(X)]$ for any continuous bounded g. Let $F_n$ denote the cdf for $X_n$, and likewise let $F$ be the cdf for $X$.

Pick $\epsilon$, and a continuous g with M as its bound. By assuming convergence, we have $X_n = O_P(1)$, hence for some $R > 0$, we know $P(|X_n| > R) < \epsilon$, and also $P(|X| > R) < \epsilon$. Thus, $\displaystyle \left| \int_{|X_n| > R} g(X_n) dF_n \right| < \epsilon M$, and likewise for X. Since this can be made arbitrarily small, we need only look at convergence on the interval $[-R, R]$.

Let’s approximate g on $[-R, R]$. Since g is continuous, it’s uniformly continuous on this interval, so we can pick $N+1$ points $t_i$ such that $-R = t_0 < t_1 < \ldots < t_N = R$ and also $x, y \in [t_k, t_{k+1}]$ implies $|g(x) – g(y)| < \epsilon$. Furthermore, since F is continuous outside a countable set, we can choose the points such that F is continuous at each $t_i$. Finally, let’s define h, a step function approximation to g, by $x \in [t_k, t_{k+1})$ implies $h(x) = g(t_k)$. (And of course $h(t_N) = g(t_N)$, $h(x) = 0$ everywhere else.)

We need two lemmas:

Lemma 1: $|E[g(Y)] – E[h(Y)]| < \epsilon + \epsilon M$ where $Y \in \{X, X_1, X_2, \ldots\}$.

Proof:

$$
\begin{eqnarray}
|E[g(Y)] – E[h(Y)]| &=& |E[g(Y)1_{|Y|>R}] + (E[g(Y)1_{|Y|\leq R}] – E[h(Y)])| \\
&\leq& |E[g(Y)1_{|Y|>R}]| + \left| \int_{[-R, R]} g(y) – h(y) dY \right| \\
&<& \epsilon M + \epsilon P(|Y| \leq R)\tag{1}\\
&\leq& \epsilon M + \epsilon
\end{eqnarray}
$$

where (1) follows since we’ve constructed h such that $|g(x) – h(x)| < \epsilon$ for $x \in [-R, R]$. $\blacksquare$

Lemma 2: $E[h(X_n)] \to E[h(X)]$.

Proof: It’s really easy to integrate h, since it’s a step function. We have:

$$
\begin{eqnarray}
|E[h(X_n)] – E[h(X)]| &=& \left| \int_{[-R, R]} h(x) dF_n – \int_{[-R, R]} h(x) dF \right| \\
&=& \left| \left( \sum_{k=0}^N g(t_k)(F_n(t_{k+1}) – F_n(t_k)) \right) – \left(  \sum_{k=0}^N g(t_k)(F(t_{k+1}) – F(t_k)) \right)\right| \\
&\leq& \sum_{k=0}^N |g(t_k)(F_n(t_{k+1}) – F(t_{k+1})) +g(t_k)(F_n(t_k) – F(t_k)) |.
\end{eqnarray}
$$

Since there are only finitely many $t_i$’s, and since each $t_i$ is a continuity point of F, and since $F_n \to F$ at each continuity point, it follows that this last quantity tends to 0 as n approaches infinity. $\blacksquare$

Finally, note that

$$
|E[g(X_n) – E[g(X)]| \leq |E[g(X_n)] – E[h(X_n)]| + |E[h(X_n)] – E[h(X)]| \\ + |E[h(X)] – E[g(X)]|
$$

and our two lemmas show that each of the three terms on the right can be made arbitrarily small for large n by picking suitably close approximations h. $\blacksquare$

Part 2

Let $F_n$ and $F$ be the cdf’s as above. Suppose $E[g(X_n)] \to E[g(X)]$ for any bounded, Lipschitz g. We must show that $F_n \to F$ at any continuity point of F. Let x be such a continuity point.

Pick $\epsilon > 0$. Let’s define two Lipschitz continuous functions $a$ and $b$ as follows:

  • Let $a(y) = 1$ for $y \leq x – \epsilon$, $a(y) = 0$ for $y \geq x$, and $a$ is linear on $[x – \epsilon, x]$.
  • Let $b(y) = 1$ for $y \leq x$, $b(y) = 0$ for $y \geq x + \epsilon$, and $b$ is linear on $[x, x + \epsilon]$.

So we have

$$
\int a dF_n \leq \int 1_{X_n \leq x} dF_n = F_n(x) \leq \int b dF_n.
$$

Using the second and third parts here and the convergence of the third part, we get $\displaystyle \limsup_{n\to\infty} F_n(x) – F(x) \leq \int_{[x, x+\epsilon]} dF = F(x+\epsilon) – F(x)$. Likewise, using $a$, we get $\displaystyle \liminf_{n\to\infty} F_n(x) – F(x) \geq F(x – e) – F(x)$. But since F is continuous at x, these can both be made arbitrarily small as $\epsilon$ decreases. Hence $F_n(x) \to F(x)$. $\blacksquare$

Notes: Why is this called the Portmanteau Theorem? I can’t find any references.

Posted in Uncategorized | Tagged , | Comments Off on Portmanteau Theorem

Uniqueness of Laplace transform

Statement: Let $L_f(t) = \int_0^\infty f(x)e^{-tx} dx$. Then if $f(x) = o(e^{ax})$ for some $a > 0$, $L_f = L_g$ implies $f = g$.

Proof: By linearity of the integral, this is equivalent to showing $L_f = 0$ implies $f = 0$. Perform the substitution $x = -log(u)$ to obtain

$$
\begin{eqnarray}
L_f(t) &=& \int_0^\infty f(x)e^{-tx} dx \\
&=& \int_0^1 u^{t-1}f(-log(u)) du
\end{eqnarray}
$$

where for $t \geq a+1$, $\displaystyle \lim_{u\to 0} u^{t-1}f((-log(u)) = \lim_{x\to\infty} e^{-(t-1)x}f(x) = 0$, so this function is continuous at 0 and the above integral makes sense. Thus, if $L_f = 0$, then $\int_0^1 u^{n}f(-log(u)) du = 0$ for all sufficiently large integers $n \geq N$. Hence by linearity $\int_0^1 p(u)f(-log(u)) du = 0$ if $p$ is a polynomial which contains only terms of size $N$ or greater.

Since $f(-log(u))$ is continuous on [0, 1], by Stone-Weierstrass we can choose a sequence of polynomials $p_k$ (having terms only of size $N$ or greater) which uniformly converges to $f(-log(u))$. That is, for any $\epsilon \geq 0$, $|f(-log(u)) – p_k(u)| \leq \epsilon$ for all $u$ when $k$ is sufficiently large. Thus, on the one hand, $k$ sufficiently large implies

$$
\begin{eqnarray}
\left| \int_0^1 (f(-log(u)) – p_k(u))f(-log(u)) du \right|
&\leq& \int_0^1 \epsilon |f(-log(u))| du \\
&=& \epsilon \int_0^1 |f(-log(u))| du
\end{eqnarray}
$$

which can be made arbitrarily small. But on the other hand,

$$
\begin{eqnarray}
\int_0^1 (f(-log(u) – p_k(u))f(-log(u)) du
&=& \int_0^1 f(-log(u))^2 du + \int_0^1 p_k(u)f(-log(u)) du \\
&=& \int_0^1 f(-log(u))^2 du + 0.
\end{eqnarray}
$$

But $f(-log(u))^2$ is non-negative, and the whole integral goes to zero with $n$ (since it’s smaller than any positive $\epsilon$). The only way this is possible is if $f(-log(u)) = 0$ for all $u$ in $[0, 1]$, i.e., $f(x) = 0$ for all $x$ in $[0, \infty)$. $\blacksquare$

Notes: This can also be used to prove the uniqueness of the Fourier transform.

We needed to be able to approximate an arbitrary continuous function $f$ on $[0, 1]$ such that $f(0) = 0$ uniformly with polynomials with terms only of degree greater than or equal to some K. To do this, it suffices to be able to approximate the curve $y = x$ with those polynomials, since then you can use the approximation to $y = x$ to build up other polynomials and exploit Weierstrass approximation. To do this, set $f_n(x) = x(1 – (1 – x)^n)^K$.

Posted in Uncategorized | Tagged , | Comments Off on Uniqueness of Laplace transform

Strong Law of Large Numbers

Truncation Lemma

Statement: Let $X_n$ be a sequence of IID variables with finite mean, and let $Y_n$ = $X_n I_{|X_n| \leq n}$. Then $\sum \frac{Var(Y_n)}{n^2} < \infty$.

Proof:

Lemma: If $c \geq 0$, then $c^2 \sum_{n=1}^\infty \frac{I_{|c| \leq n}}{n^2} \leq 1 + c$.

Proof: The theorem is clearly true for $c = 0$. Otherwise, let $d = ceil(c)$. Then $d > 0$ and so

$$
\begin{eqnarray}
\sum_{n=1}^\infty \frac{I_{c \leq n}}{n^2}
& = & \sum_{n=1}^\infty \frac{I_{d \leq n}}{n^2} \\
& = & \sum_{n=d}^\infty \frac{1}{n^2} \\
& \leq & \frac{1}{d^2} + \int_{d}^\infty \frac{1}{x^2} dx \\
& = & \frac{1}{d^2} + \frac{1}{d}.
\end{eqnarray}
$$

 Multiplying through by $c^2$ and noting that $\frac{c}{d} \leq 1$ gives us the result. $\blacksquare$

Thus we have

$$
\begin{eqnarray}
\sum \frac{Var(Y_n)}{n^2}
& \leq & \sum \frac{E(Y_n^2)}{n^2} \\
& = & \sum \frac{E(X_n^2 I_{|X_n| \leq n})}{n^2} \\
& = & \sum \frac{E(X_1^2 I_{|X_1| \leq n})}{n^2} \\
& = & \sum E\left(\frac{X_1^2 I_{|X_1| \leq n}}{n^2}\right) \\
& = & E\left(\sum \frac{X_1^2 I_{|X_1| \leq n}}{n^2}\right) \tag{1}\\
& = & E\left(|X_1|^2 \sum \frac{I_{|X_1| \leq n}}{n^2}\right) \\
& \leq & E( 1 + |X_1| ) \tag{2} < \infty
\end{eqnarray}
$$

where (1) follows by the Monotone Convergence Theorem and (2) follows by the above Lemma. $\blacksquare$

Kolmogorov’s Strong Law

Statement: Suppose $X_i$ are independent with finite mean $\mu$ and $\sum \frac{Var(X_n)}{n^2} < \infty$. Define $S_n = \frac{X_1 + \ldots + X_n}{n}$. Then $S_n \to \mu$ almost surely.

Proof: We may assume WLOG that $\mu = 0$ by subtracting it from the $X_i$’s. We must thus show $S_n \to 0$ almost surely.

Lemma: If $a_1, a_2, \ldots$ is a sequence of non-negative real numbers and $\sum a_n/n \leq \infty$, then the average of the first n $a_i$’s converges to 0.

Proof: Pick $\epsilon > 0$. Let $S_n = \sum_{k=1}^n a_k/k$ and $A_n = \frac{a_1 + … + a_k}{k}$. Also, let $T_n = \sum_{k=n}^\infty a_k/k$. Since $S_n$ converges to a finite number, $T_n$ converges to 0, hence we can choose some number K such that k > K implies $T_k < \frac{\epsilon}{2}$. Now, observe that n > K implies

$$
\begin{eqnarray}
A_n &=& \frac{a_1 + \ldots + a_K}{n} + \sum_{i=K+1}^n \frac{a_i}{n} \\
&\leq& \frac{a_1 + \ldots + a_K}{n} + \sum_{i=K+1}^n \frac{a_i}{i} \tag{1} \\
&\leq& \frac{a_1 + \ldots + a_K}{n} + T_{K+1} \\
&<& \frac{a_1 + \ldots + a_K}{n} + \frac{\epsilon}{2}
\end{eqnarray}
$$

where (1) follows by the non-negativity of the $a_i$’s. But the left term also converges to zero, hence it’s also bounded by $\frac{\epsilon}{2}$ for sufficiently large n. Thus $A_n < \epsilon$ for large n. $\blacksquare$

Now, let $B_n = \sum_{i=1}^n \frac{X_i}{i}$. Then by our assumption, $Var(B_n)$ converges to some finite number, hence by Kolmogorov’s Two-Series Theorem, $B_n$ converges almost surely. But if $B_n$ converges, our lemma tells us that $S_n$ converges to zero. $\blacksquare$

 

Strong Law of Large Numbers

Statement: Suppose $X_i$ are IID with finite mean $\mu$. Then $S_n = \frac{X_1 + \ldots + X_n}{n}$ converges to $\mu$ almost surely.

Proof: Let $Y_i$ be the truncation of $X_i$ as in the first theorem. Define $Y_n^* = Y_n – E[Y_n]$,$T_n = \frac{Y_1 + \ldots + Y_n}{n}$, $T_n^* = \frac{Y_1^* + \ldots + Y_n^*}{n}$. Since $Var(Y_n^*) = Var(Y_n)$, the truncation lemma tells us that $\sum \frac{Var(Y_i^*)}{i^2}$ is finite, and since the $Y_i^*$’s all have mean 0, Kolmogorov’s strong law tells us $T_n^*$ converges almost surely to 0. Since $E[Y_n] \to \mu$, it follows that $\frac{E[Y1] + \ldots + E[Y_n]}{n}$ also converges to $\mu$. Hence $T_n$ converges almost surely to $\mu$.

Now, $P(Y_n \neq X_n) = P(|X_n| > n) \leq P(|X_n| \geq n) = P(|X_1| \geq n)$. But $\sum_i P(|X_1| \geq i) \leq E[|X_1|] \leq \infty$. Thus by Borel-Cantelli, almost surely $X_n = Y_n$ for sufficiently large $n$. It’s not difficult to see that $S_n – T_n \overset{a.s.}{\to} 0$ follows. Hence $S_n \overset{a.s.}{\to} \mu$. $\blacksquare$

Posted in Uncategorized | Tagged , | Comments Off on Strong Law of Large Numbers

Convergence theorems

Monotone Convergence Theorem

Statement: Suppose we have a sequence of non-negative functions $f_n$ that converges almost everywhere to $f$ on some set $D$ on a measure space. Then $\lim \int_D f_n d\mu = \int_D f d\mu$.

Proof: Since the sequence of functions is increasing, any simple function dominated by an $f_n$ is dominated by $f$. This immediately gets us $\lim \int_D f_n d\mu \leq \int_D f d\mu$.

Now, suppose $\phi$ is a simple function dominated by $f$. Pick any $0 < \alpha < 1$ and define $A_i = \{ x \in D : f_i(x) \geq \alpha \phi(x) \}$. Since the $f_n$’s are increasing, the sets $A_n$ are increasing. Also, $f(x) = 0$ or $\alpha \phi(x) < f(x)$, and since $f_n(x) \to f(x)$, it follows that each $x$ is in some $A_i$. Hence $\cup A_i = D$.

We therefore have $\int_D f_n d\mu \geq \int_{A_n} f_n d\mu$ since the functions are non-negative. On the one hand, $\int_{A_n} f_n d\mu \geq \alpha \int_{A_n} \phi d\mu$. On the other hand, define $\nu(X) = \int_X \phi d\mu$; then $\nu$ is a measure, hence since $A_n$ is increasing, we have $\lim \nu(A_n) = \nu(\lim A_n) = \nu(D) = \int_D \phi d\mu$. Thus,

$$\lim \int_D f_n d\mu \geq \alpha \int_D \phi d\mu.$$

Since $\alpha < 1$ was arbitrary, it follows that $\lim \int_D f_n d\mu \geq \int_D \phi d\mu$. Since $\phi$ was any simple function dominated by $f$, it therefore follows that $\lim \int_D f_n d\mu \geq \int_D f d\mu$. The theorem follows.

Fatou’s Lemma

Statement: Let $f_n$ be non-negative as before. Then $\int_D \liminf f_n d\mu \leq \liminf \int_D f_n d\mu$.

Proof: For each x and N, $\inf_{n \geq N} f_n(x)$ forms an increasing sequence in N. Thus by the Monotone Convergence Theorem above,

$$\int_D \displaystyle \lim_{N\to\infty} \displaystyle \inf_{n > N} f_n d\mu =
\lim_{N\to\infty} \int_D \displaystyle \displaystyle \inf_{n > N} f_n d\mu \leq
\lim_{N\to\infty} \int_D \displaystyle \displaystyle f_N d\mu =
\liminf_{N\to\infty} \int_D \displaystyle \displaystyle f_N d\mu.$$

Reverse Fatou’s Lemma

Statement: If $f_n$ is nonnegative and bounded by an integrable function $g$, then $\limsup \int f_n \leq \int \limsup f_n$.

Proof: $\liminf (-f_n) = – \limsup f_n$. By Fatou’s lemma, $\int \liminf g – f_n \leq\liminf \int g – f_n$. Since $g$ is integrable, $\int g$ can be pulled out of both sides and all the terms cancel, leaving the desired theorem.

Dominated Convergence Theorem

Statement: Suppose $f_n$ convergece to $f$ pointwise and $|f_n| \leq g$ for some integrable $g$. Then $\lim \int f_n = \int f$.

Proof: Clearly $|\lim \int f_n – f| \leq \lim \int |f_n – f|$, where $|f_n – f|$ converges to 0 and is non-negative, so it suffices to show the theorem under the assumptions that $f = 0$ and $f_n \geq 0$.

But then existence of g means we can use the reverse Fatou’s lemma to obtain

$$
0 \leq \liminf \int f_n \leq \limsup \int f_n \leq \int \limsup f_n = \int 0 = 0.
$$

Thus these are all equal and $\lim \int f_n = \int f$.

Posted in Uncategorized | Tagged , | Comments Off on Convergence theorems

Kolmogorov’s Inequality

Statement: Let $X_i$ be independent random variables with mean 0, finite variance and $S_n = X_1 + \ldots + X_n$. Then $P\left( \displaystyle \max_{1 \leq i \leq n} |S_i| \geq a \right) \leq \frac{1}{a^2} Var(S_n)$.

Note that Chebyshev’s inequality already gives us $P\left( |S_n| \geq a \right) \leq \frac{1}{a^2} Var(S_n)$, so this is quite a strengthening.

Proof:

Let $A_i$ be the event that $S_i$ is the first of the $S$’s greater than or equal to $a$. Clearly these are disjoint, and $P\left( \displaystyle \max_{1 \leq i \leq n} |S_i| \geq a \right) = \sum P(A_i)$.

Obviously $E[I_{A_i} S_i^2] \geq a^2 P(A_i)$. So by linearity of expectation, we need to prove that $E[S_n^2] \geq E[\sum I_{A_i}S_i^2]$.

First, $S_n^2 \geq \sum I_{A_i} S_n^2$ since $0 \leq \sum I_{A_i} \leq 1$. Second,

$$\begin{eqnarray}
E[I_{A_i} S_n^2] & = & E[I_{A_i} (S_i + (S_n – S_i))^2 ] \\
& = & E[I_{A_i} S_i^2] + 2E[I_{A_i} S_i (S_n – S_i)] + E[I_{a_i} (S_n – S_i)^2] \\
& = & E[I_{A_i} S_i^2] + 0 + E[I_{A_i} (S_n – S_i)^2] \\
& \geq & E[I_{A_i} S_i^2]
\end{eqnarray}$$

where the third equality holds because $S_i$ and $S_n – S_i$ are independent (they involve different X’s). So $E[S_n^2] \geq E[\sum I_{A_i} S_n^2] \geq E[\sum I_{A_i} S_i^2]$ and the theorem follows.

Kolmogorov’s Two-Series Theorem

Statement: Suppose $X_i$ has mean $\mu_i$ and variance $\sigma_i$, such that $\sum \mu_i$ and $\sum \sigma_i$ converge. Then $\sum X_i$ converges almost surely.

Proof: This follows quickly from Komogorov’s inequality. We may assume WLOG that $\mu_i = 0$ just by subtracting it from $X_i$.

$S_n = \displaystyle \sum_{i=1}^n X_i$ converges iff it Cauchy converges iff for each $\epsilon$, $|S_m – S_n| < \epsilon$ whenever n, m are sufficiently large. This is equivalent to saying $\sup_{m > n} |S_m – S_n| \leq \epsilon$ when n > N (for some sufficiently large N). But $S_m – S_n = X_{n+1} + \ldots + X_m$, so by Kolmogorov’s inequality,

$$\begin{eqnarray}
P\left( \sup_{m > n} |S_m – S_n| \geq \epsilon \right) & = & \lim_{m \to \infty} P\left( \sup_{n < i < m} |S_i – S_n| \geq \epsilon \right) \\
& \leq & \lim_{m \to \infty} \frac{1}{\epsilon^2} Var(X_{n+1} + \ldots + X_m) \\
& = & \frac{1}{\epsilon^2} \sum_{i=n+1}^\infty \sigma_i.
\end{eqnarray}$$

By assumption, this last sum is the tail of a convergent series, hence it’s arbitrarily close to zero when $n$ is suffuciently large. So if we take the limit under $n$ of both sides, we see that

$$
P\left( \lim_{n\to \infty} \sup_{m > n} |S_m – S_n| \geq \epsilon \right) = 0.
$$

So $S_n$ converges almost surely.

Posted in Uncategorized | Tagged , | Comments Off on Kolmogorov’s Inequality

Mahalanobis Distance

Let $\mathbf x$ be a vector, and assume we have a distribution with mean $\mathbf \mu$ and covariance matrix $S$. The Mahalanobis distance with respect to the distribution is defined as

$$D(\mathbf x) = \sqrt{(\mathbf x – \mathbf \mu)^TS^{-1}(\mathbf x – \mathbf \mu)}.$$

Why? Well, $S$ is positive-definite, so we can decompose $\mathbf x – \mu$ into $\sum a_i\mathbf e_i$, where $\mathbf e_i$ are the orthonormal eigenvectors of $S$ (with corresponding eigenvalues $\lambda_i$). Then $S^{-1} a_i \mathbf e_i = \frac{a_i \mathbf e_i}{\lambda_i}$ and so

$$D(\mathbf x) = \sqrt{\sum \frac{a_i^2}{\lambda_i}}.$$

So the Mahalanobis distance takes into account how far out from the mean we are, where the meaning of “far out” depends on the direction you’ve traveled from the mean.

Posted in Uncategorized | Tagged , , | Comments Off on Mahalanobis Distance