Tuesday, November 10, 2015

Random subdivisions of an interval

This is a classical problem, but I find it very appealing.

Drop $n$  points $X_1,\dotsc, X_n$  at random in the unit interval, uniformly and independently. These  points     divide the interval into $n+1$  sub-intervals.  What is the expected length of the longest of these intervals? What is its asymptotic behavior for $n$ large?

I was surprised to find out that the  problem  of finding  the expectation of the largest segment   was solved   in late 19th century.    What follows   is a more detailed presentation  of the arguments in Sec. 6.3 of the book  Order Statistics   by H.A. David and  H.N. Nagaraja.

If you read French, you can click here to see what P. Levy had to say about this problem back in 1939.

Relabel  by $Y_1,\dotsc, Y_n$, the points $\{X_1,\dotsc, X_n\}\subset [0,1]$ so that  $Y_1\leq \cdots \leq Y_n$.  We set

$$ Y_0=0,\;\;Y_{n+1}=1,\;\;\;\;L_n= \max_{1\leq i\leq n+1} Y_i-Y_{i-1},.$$

We want to determine  the expectation $\newcommand{\bE}{\mathbb{E}}$ $E_n$ of $L_n$.

The probability distribution  of $\vec{X}=(X_1,\dotsc, X_n)$  $\newcommand{\bone}{\boldsymbol{1}}$   is

$$ p_{\vec{X}}(dx)=\bone_{C_n}(x_1,\dotsc, x_n) dx_1\cdots dx_n, \;\;C_n=[0,1]^n, $$

where $\bone_A$ denotes the indicator function of a subset $A$ of a given set $S$.

The probability distribution of $\vec{Y}=(Y_1,\dotsc, Y_n)$ is

$$p_{\vec{Y}}(dy)= n! \bone_{S_n}(y_1,\dotsc, y_n) dy_1\cdots dy_n, $$

where $S_n$ is the simplex $\newcommand{\bR}{\mathbb{R}}$

$$ T_n=\bigl\{ y\in\bR^n;\;\;0\leq y_1\leq \cdots \leq  y_n\leq 1\,\bigr\}. $$

The   points $Y_1,\dotsc, Y_n$ determine $n+1$ segments on $[0,1]$ of lengths
$$ S_i= Y_{i}-Y_{i-1},\;\; i=1,\dotsc, n+1. $$
The resulting map $\vec{Y}\to\vec{S}=\vec{S}(\vec{Y})$ induces a linear bijection

$$ T_n\to \Delta_n=\bigl\{\, (s_1,\dotsc, s_{n+1})\in\bR_{\geq 0}^{n+1};\;\;s_1+\cdots+s_{n+1}=1\,\bigr\}. $$

This  shows that the probability density is $\DeclareMathOperator{\vol}{vol}$

$$p_{\vec{S}}(ds) =\frac{1}{\vol(\Delta_n)} dV_{\Delta_n}, $$

where $dV_{\Delta_n}$ denote the usual  Euclidean volume density on  the standard simplex $\Delta_n$. This description  shows that  the random vector $\vec{S}$ is exchangeable, i.e., the probability  density  of $\vec{S}$ is invariant under permutation of the components. Thus the  joint probability distribution of any group  of $k$ components of $\vec{S}$,   $k\leq n$, is the same as the joint distribution of the first $k$-components $S_1,\dotsc, S_k$.

The joint  probability distribution of the first $n$ components is

$$ p_n(ds)= n! \bone_{D_n}(s_1,\dotsc,s_n) ds, $$

where $D_n$ is the simplex

$$ D_n=\{\, (s_1,\dotsc, s_n)\in\bR^n_{\geq 0};\;\;s_1+\cdots +s_n\leq 1\,\}. $$

Indeed, $(S_1,\dotsc, S_n)\in D_n$ and

$$ ds=|ds_1\cdots  ds_n|=|d y_1\wedge d(y_2-y_{1})\wedge  \cdots \wedge d(y_{n}-y_{n-1})|=|dy_1\cdots  d y_n|. $$

The  joint density of the first $n-1$ components $(S_1, \dotsc, S_{n-1})$ is

$$  p_{n-1}(s_1,\dotsc, s_{n-1}) =  n! \int_0^\infty \bone_{D_n}(s_1,\dotsc,  s_n)  ds_n $$

$$=  n! \bone_{D_{n-1}}(s_1,\dotsc,  s_{n-1})\int_0^{1-s+1-\cdots-s_{n-1}}  ds_n= n! \bone_{D_{n-1}}(s_1,\dotsc,  s_{n-1})(1-s_1-\cdots-s_{n-1}) $$

The joint density of the first $(n-2)$ components is

$$ p_{n-2}(s_1,\dots, s_{n-2})=\int_0^\infty p_{n-1}(s_1,\dotsc, s_{n-1}) ds_{n-1} $$

$$ =  n!\bone_{D_{n-2}}(s_1,\dotsc, s_{n-2})\int_0^{1-s_1-\cdots-s_{n-2}} (1-s_1-\cdots -s_{n-1}) d s_{n-1} =  \frac{n!}{2!} \bone_{D_{n-2}}(s_1,\dotsc, s_{n-2})(1-s_1-\cdots -s_{n-2})^2. $$

Iterating, we deduce that the joint probability density of the the first $k$ variables is

$$ p_k(s_1,\dotsc, s_k) =\frac{n!}{(n-k)!}p_{D_k}(s_1,\dotsc, s_k) (1-s_1-\cdots-s_k)^{n-k}. $$

If $1\leq k\leq n$,  then$\newcommand{\bP}{\mathbb{P}}$

$$ \bP[S_1>s,\dotsc, S_k>s]=\frac{n!}{(n-k)!}\int_s^\infty\cdots\int_s^\infty p_{D_k}(s_1,\dotsc, s_k) (1-s_1-\cdots-s_k)^{n-k} ds_1\dotsc ds_k $$

$$=\frac{n!}{(n-k)!}\int_s^1ds_1\int_s^{1-s_1}ds_2\cdots\int_s^{1-s_1-\cdots-s_{k-1}}  (1-s_1-\cdots-s_k)_+^{n-k} ds_k =(1-ks)_+^n,\;\;x_+:=\max(x,0).$$


$$  \bP[S_1>s,\dotsc, S_n>s, S_{n+1}>s] = \bP[S_1>s,\dotsc, S_n>s, S_1+S_2+\cdots +S_n\leq 1-s]$$

$$=n!\int_{\substack{s_1,\dotsc s_{n}>s\\ s_1+\cdots +s_{n}\leq 1-s}}ds_1\cdots d s_{n} $$

$$=n!\int_{\substack{s_1,\dotsc s_{n-1}>s\\ s_1+\cdots +s_{n-1}\leq 1-2s}}ds_1\cdots d s_{n-1}\int_s^{1-s-s_1-\cdots -s_{n-1}} d s_n $$

$$= n!\int_{\substack{s_1,\dotsc s_{n-1}>s\\ s_1+\cdots +s_{n-1}\leq 1-2s}} (1-2s-s_1-\cdots -s_{n-1}) ds_1\cdots  ds_{n-1} $$

$$=n!\int_{\substack{s_1,\dotsc s_{n-2}>s\\ s_1+\cdots +s_{n-2}\leq 1-3s}}ds_1\cdots ds_{n-2}\int_s^{1-2s-s_1-\cdots -s_{n-2}} (1-2s-s_1-\cdots -s_{n-1})  d s_{n-1}  $$

$$=\frac{n!}{2!} \int_{\substack{s_1,\dotsc s_{n-2}>s\\ s_1+\cdots +s_{n-2}\leq 1-3s}} (1-3s-s_1-\cdots -s_{n-2})^2 ds_1\cdots ds_{n-2} $$

$$ =\frac{n!}{2!} \int_{\substack{s_1,\dotsc s_{n-3}>s\\ s_1+\cdots +s_{n-3}\leq 1-4s}}  ds_1\cdots ds_{n-3} \int_s^{1-3s -s_1-\cdots -s_{n-3}} (1-3s-s_1-\cdots -s_{n-2})^2ds_{n-2}$$

$$=\frac{n!}{3!} \int_{\substack{s_1,\dotsc s_{n-3}>s\\ s_1+\cdots +s_{n-3}\leq 1-4s}}  (1-4s-s_1-\cdots -s_{n-3})^3ds_1\cdots ds_{n-3} $$

$$=\cdots = \bigl(\, 1-(n+1)s\,\bigr)_+^n. $$


$$ L_n:=\max_{1\leq i\leq n+1} S_i $$

 we have

$$\bP[L_n>s]=\bP\left[\,\bigcup_{i=1}^{n+1}\bigl\lbrace\, S_i>s\,\bigr\rbrace\,\right] $$

(use inclusion-exclusion  exchangeability)

$$ = \sum_{k=1}^{n+1}(-1)^{k+1}\binom{n+1}{k} \bP\bigl[\, S_1>s,\dotsc, S_k>s\,\bigr] =\sum_{k=1}^{n+1} \binom{n+1}{k}(1-ks)_+^n .$$

The last equality is sometime known as Whitworth  formula and it goes all the way back to 1897. We have

$$ E_n=\bE[L_n]=\int_0^\infty \bP[L_n>s] ds =\sum_{k=1}^{n+1}(-1)^{k+1} \binom{n+1}{k}\int_0^\infty (1-ks)_+^n  ds = \sum_{k=1}^{n+1}(-1)^{k+1} \binom{n+1}{k}\int_0^{1/k}(1-ks)^n  ds $$

$$=\frac{1}{n+1} \sum_{k=1}^{n+1}(-1)^{k+1} \frac{1}{k} \binom{n+1}{k}=\sum_{k=1}^{n+1}(-1)^{k+1} \frac{1}{k^2} \binom{n}{k-1}. $$

Here  is  $E_n$ answers for  small $n$ (thank you MAPLE)

$$ E_1=\frac{3}{4},\;\;E_2=\frac{11}{18},\;\;E_3=\frac{25}{48},\;\;E_4=\frac{137}{300},\;\; E_5=\frac{49}{120}. $$

It turn out that $L_n$ is highly concentrated around its mean. In the work of P. Levy I mentioned earlier he showed  that the random variables   $nL_n-\log n$ converges in probability   to a random  variable $X_\infty$   with probability  distribution  $\mu$ given by

$$ \mu\bigl(\,(-\infty,x])= \exp\bigl(\,-e^{-x}\,\bigr),\;\;x\geq 0.$$


$$\lim_{n\to \infty} \bE[\, nL_n-\log n\,]=\gamma=\bE[X_\infty],\;\;\lim_{n\to \infty}\mathrm{Var}[ nL_n\,]=\frac{\pi^2}{6}=\mathrm{Var}[X_\infty]. $$

$$\frac{n}{\log n} L_n \to 1\;\; \mbox{a.s.} $$

Note that this shows that

$$E_n\sim\frac{\log n}{n}\;\; \mbox{as $n\to \infty$}. $$

I'll stop here, but you can learn more from this paper of Luc Devroye and its copious   list of references.

Post a Comment