Introduction to Hierarchical Model

- 8 mins

Data Mining Center, Renmin University of China

Introduction to Bayesian Framework

Priori, Likelihood and Posteriori

• Bayes Formula $p(\theta\mid y) = \frac{p(\theta)p(y\mid\theta)}{p(y)}$
• Prior $\theta$ $\sim$ Prior Distribution $p(\theta)$ Determined from past information or subjective assessment.
• Observations $y\mid \theta$ $\sim$ Likelihood $p(y \mid \theta)$ Given the parameter $\theta$, the observed data $y$’s distribution.
• Posterior $\theta \mid y$ $\sim$ Posterior Distribution $p(\theta \mid y)$ Updatad distribution of $\theta$ based on its prior and observed data.

Types of Prior

What exactly is prior when we talk about it?

• Past experience
• Historical Research
• Subjective Beliefs

We can define three types of priors according to the information they contain

• Informative Priors Prior distributions giving numerical information that is crucial to estimation of the model.
• Non-informative Priors Uniform or nearly so, and basically allow the information from the likelihood to be interpreted probabilistically.
• Weakly Informative Priors Not supplying any controversial information but are strong enough to pull the data away from inappropriate inferences that are consistent with the likelihood.

Common Types of Priors

What kinds of prior do we usally use?

• Experts’ Prior Prior distributions obtained via consulting experts.
• Conjugate Priors The prior distribution and the posterior distribution are from the same distribution family. For example, if $\theta \sim$beta distribution then $\theta\mid y \sim beta$ distribution
• Non-informative Priors Uniform Prior or Jeffrey Prior

Conjugate Prior

• The prior distribution and the posterior distribution are from the same distribution family.
• Example : $\theta \sim \text{Beta}(\alpha,\beta) \quad p(\theta) \propto \theta^{\alpha -1}(1 - \theta)^{\beta - 1} \quad \theta \in (0,1)$ $y\mid \theta \sim \text{Binomial}(n,\theta) \quad \quad p(y\mid \theta) \propto \theta^{y}(1 - \theta)^{n-y}$
• Hence we can derive the posterior $p(\theta\mid y) \propto p(y\mid \theta)\cdot p(\theta) \propto \theta^{\alpha + y -1}(1 - \theta)^{\beta +n - y - 1}$ Therefore, $\theta \mid y \sim \text{Beta}(\alpha + y,\beta +n - y) \quad \theta \in (0,1)$
• $\theta$’s prior and posterior are both Beta distribution.
Why we use Conjugate Prior?
• They simplify the computation! We can easily derive the posterior distribution if we use conjugate prior.
• Common Conjugate Families

Non-informative Prior

• Uniform $p(\theta) \propto 1$
• Example 1: $p(\theta) = \frac{1}{2} \quad \quad \theta \in (0,2)$
• Example 2: $p(\theta) \propto 1 \quad \quad \theta \in (-\infty,\infty)$ Is that correct?
• Prior is not a distribution! Its density cannot be integrated to 1. We call this prior is improper.
• Improper prior can sometimes lead to proper posterior.
• As long as it can lead to proper posterior, the prior can be useful.
• Example 3: $p(\theta) \propto 1 \quad \quad \theta \in (-\infty,\infty)$ $y\mid \theta \sim N(\theta, 1)$
• Hence we can derive the posterior $p(\theta\mid y) \propto p(y\mid \theta)\cdot 1 = \frac{1}{\sqrt{2\pi}}e^{-\frac{(\theta - y)^2}{2}}$ Therefore, $\theta \mid y \sim N(y,1) \quad \theta \in (-\infty,\infty)$
• It’s a proper posterior!
Jeffrey Prior
• Do we have any other choice for non-informative prior?
• Yes! That is Jeffrey Prior.
• $p(\theta) \propto [J(\theta)]^{\frac{1}{2}}$ where $J(\theta)$ is the {\em Fisher Information} for $\theta$ $J(\theta) = E((\frac{d logp(y\mid \theta)}{d\theta})^2\mid \theta)= - E(\frac{d^{2} logp(y\mid \theta)}{d\theta^{2}}\mid \theta)$
• Jeffrey’s Invariance Principal: No matter how I parametrize $\theta$, the prior density $p(\theta)$ is equivalent.
• $p(\theta) \propto [J(\theta)]^{\frac{1}{2}} \quad \text{Let} \quad \phi = h(\theta) \quad \text{One-to-One mapping}$ We can prove that $p(\phi) \propto [J(\phi)]^{\frac{1}{2}}$

Bayesian Hierarchical Model

How to set the Hyperparameters?

figure missing

• The table displays the values of $\frac{y_{i}}{n_{i}}$ : $i = 1,2,3,…,70$ \centering{(number of rats with tumor) / (total number of rats)}
• Tumor Incidence of rats in historical control groups and current group of rats, from Tarone (1982).

Model Initialization

• Suppose $\theta$ is the probability that the rat had tumor.
• Suppose $y \mid \theta \sim \text{Binomial}(n,\theta)$
• Since Beta-Binomial is conjugate, so we can derive the posterior of $\theta$ easily

Toy Example

• How to set the $\alpha$ and $\beta$?
• We call the parameters in prior distribution hyperparameter.

How to set the priors?

Fixed Prior Distribution

Informative Prior

• We knew that $\theta \sim$ Beta Distribution with known mean and variance.
• $\theta$ vary due to differences in rats and experimental conditions.
• Find the corresponding $\alpha$, $\beta$.
• $\theta \sim$ Beta$(\alpha,\beta)$ as its prior distribution.
Approximate estimate using Historical Data
• Use Historical Data’s Mean and Variance to estimate $\alpha$ and $\beta$.
• $y_i \mid \theta \sim Binomial(n_i, \theta)$ $\theta \sim \text{Beta}(\hat{\alpha},\hat{\beta})$
• $\theta\mid y_1, y_2,\ldots,y_{71} \sim \text{Beta}(\hat{\alpha} + \sum_{i = 1}^{71} y_i ,\hat{\beta} + \sum_{i = 1}^{71}n_{i} - \sum_{i = 1}^{71} y_i)$

• Bayes Estimate %
• Is that Correct?
• NO!
• Overestimate the precision of the posterior. (Data Used Twice)
Set the Hyperparameters without Data

Do we have to use data to set the hyperparameters?

• In most cases in reality, we are not sure what how to set the priors scientifically.
• However, the hyperparameters of the prior may not be that important.
• If lacking information, use non-informative prior such as $Uniform(0,1) = Beta(1,1)$
• In this case, $y_i \mid \theta_i \sim \text{Binomial}(n_i,\theta_i)$ $\theta_i \sim \text{Uniform}(0,1)$ for $i = 1,2,…,70,71$
Can we regard hyperparameters in prior as random variables?

Set one more level of Hierarchical Model

Regard $\alpha$ \& $\beta$ as Random Variables

• If we want to model the uncertainty of $\alpha$ and $\beta$,
• We can assign a prior distributions for $\alpha$ and $\beta$ respectively.
• Just add one more level of Hierarchical Model.
• For example, $y_i \mid \theta_i \sim \text{Binomial}(n_i,\theta_i)$ $\theta_i \mid \alpha, \beta \sim \text{Beta}(\alpha,\beta)$ $\alpha \sim Gamma(1,2) \text{ , } \beta \sim Gamma(3,4)$ for $i = 1,2,…,70,71$
• The level of this model increased from 2 to 3.
• This is Hierarchical Model.

Latent Dirichlet Allocation

• A classic example of Hierarchical Model
• Analyze the model of Text Data

Model Initialization

From Beta Distribution to Dirichlet

• Beta-Binomial is a conjugate distribution.
• $f(x\mid \alpha, \beta) = \frac{1}{\text{Beta}(\alpha,\beta)}x^{\alpha - 1}(1-x)^{\beta - 1}$ for $x \in (0,1)$
• Dirichlet-Multinomial is a conjugate distribution
• $x_1,x_2,…,x_{n-1} \in (0,1) , x_1+x_2+…+x_{n-1} < 1 , x_n = 1 - (x_1 + … + x_{n-1})$

Notation and Assumption

• A Vocabulary indexed by ${1,2,…,V}$
• A word is the basic unit of discrete data and is represented by a V-vector s.t. $w^v = 1 \text{ and } w^u = 0 \text{ for } u \neq v$
• For example $w_i = (0,0,1,0,0...,0)$ If the ith word matches the 3rd word in vocabulary
• A document is a sequence of N words denoted by ${\bf w} = (w_1,w_2,…,w_N)$
• A corpus is a collection of M documents denoted by ${\bf D} = { {\bf w_1},{\bf w_2},…,{\bf w_M} }$
• There are k topics in total.
• Bag-of-words Assumption (Exchangeable)

Where is the “Latent” in LDA?

figure missing

• $w \mid \beta, z \sim \text{Multinomial}$ $z \mid \theta \sim \text{Multinomial}(\theta)$ $\theta \sim \text{Dirichlet}(\alpha)$
• So $\alpha$ and $\beta$ are the Hyperparameters in this model. #(k + kV)

• where ${\beta_{ij}} = p(w^{j} = 1 \mid z^i = 1)$

Posteriorl Inference

Intractable Posterior

• We want to find the posterior distribution of $\theta$ and $z$ $p(\theta,{\bf z} \mid {\bf w}, \alpha, \beta) = \frac{p(\theta,{\bf z},{\bf w}\mid \alpha \beta)}{p({\bf w} \mid \alpha, \beta)}$
• However, the posterior distribution is intractable. (Denominator Part)
• How to get the posterior?

Reference

• D. Blei, A. Ng, and M. Jordan. (2003) Latent Dirichlet Allocation, Journal of Machine Learning Research 3:993-1022.
• K. Nigam, A.McCallum, S. Thrun, and T. Mitchell (2000) Text classification from labeled and unlabeled documents using EM.,Machine Learning 39(2/3):103-134
• A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari, and D.B. Rubin (2013),Bayesian Data Analysis,CRC Press 39(2/3):101-103