Approximation of KL distance between mixtures of Gaussians

So I was thinking about how I should estimate the KL distance between two mixtures of Gaussians.  For discussion purposes, assume that the first mixture has pdf $f(x)$ and the second has pdf $g(x)$.

My fist thought was to randomly generate samples from $f(x)$ and to compute the average value of $\log(g(x))$ from the other mixture.  That would give me $E_f[\log(g(X))] = -H(f,g)$  and  $KL(f,g) = H(f,g) – H(f)$ where $H(f,g)$ is the cross entropy and $H(f)$ is the entropy.  This converges at order $1/\sqrt{n}$ and it is easy to program.

My second thought was to borrow an idea from Unscented Kalman filters.  I thought about creating a non-random sampling of from one distribution at specific points and estimating $E_f[\log(g(X))]$ from those points.

My third thought was to try Google.  Google suggested “Lower and Upper Bounds for Approximation of the Kullback-Leibler Divergence Between Gaussian Mixture Models” by Durrien, Thiran, and Kelly (2012) and “Approximating the Kullback Leibler divergence between Gaussian Mixture Models” by Hershey and Olsen (2007).  Here are some notes from their papers:

1.  The KL distance between two Gaussians $f$ and $g$ is

$D_{KL}( f || g ) = {1\over2}\left( \log\left( { \det(\Sigma_g)}\over { \det(\Sigma_f)}\right) + Tr( \Sigma_g^{-1} \Sigma_f) + ||\mu_f – \mu_g||_g^2 -d \right)$ where $d$ is the dimension of the space, $\Sigma$ is the covariance matrix, $\mu$ is the mean, $Tr$ is the trace, and

$|| x ||^2_g = x^T (\Sigma_g^{-1}) x$.

2.   Hershey and Olsen review several methods for estimating the divergence:

• Monte-Carlo methods,
• Unscented methods (unscented methods are simple and an unscented approximation of $\int f(x) g(x) dx$ is exact if $f$ is a Gaussian and $g$ is quadratic),
• Gaussian Approximation (this is bad, don’t do it, if you do do it, “I told you so”),
• Product of Gaussian approximations using Jensen’s inequality (this is cute, I like it, but I’m not sure how accurate it is), and
• Match Bound approximation by Do (2003) and Goldberg et al (2003) (just match each Gaussian with another Gaussian in the other mixture and compute those KL distances).

3.  Hershey and Olsen introduce a delightful improvement over Match Bound approximation using variational methods.  They have the same idea as Match Bound, but they significantly reduce the error in Jensen’s inequality by introducing weighted averages.  Since Jensen’s inequality produces a lower bound using the weighted average, they maximize the lower bound under all possible weightings. The maximizer happens to have a very simple form, so the bound is also is very simple to compute.  Very nice.  (Numerical results are given at the end of the paper.)  I’ve got to try this one out.

4.  Durrien, Thiran, and Kelly improve on the Hershey and Olsen method, but I’m not sure how much better the new method is.  More research required.

1. 2. 