This tutorial was generated from an IPython notebook that can be downloaded here.

Scaling Gaussian Processes to big datasets

This notebook was made with the following version of george:

import george

One of the biggest technical challenges faced when using Gaussian Processes to model big datasets is that the computational cost naïvely scales as \(\mathcal{O}(N^3)\) where \(N\) is the number of points in you dataset. This cost can be prohibitive even for moderately sized datasets. There are a lot of methods for making these types of problems tractable by exploiting structure or making approximations. George comes equipped with one approximate method with controllable precision that works well with one-dimensional inputs (time series, for example). The method comes from this paper and it can help speed up many—but not all—Gaussian Process models.

To demonstrate this method, in this tutorial, we’ll benchmark the two Gaussian Process “solvers” included with george. For comparison, we’ll also measure the computational cost of the same operations using the popular GPy library and the new scikit-learn interface. Note that GPy is designed a Gaussian Process toolkit and it comes with a huge number state-of-the-art algorithms for the application of Gaussian Processes and it is not meant for efficiently computing marginalized likelihoods so the comparison isn’t totally fair.

As usual, we’ll start by generating a large fake dataset:

import numpy as np
import matplotlib.pyplot as pl

x = np.sort(np.random.uniform(0, 10, 50000))
yerr = 0.1 * np.ones_like(x)
y = np.sin(x)

The standard method for computing the marginalized likelihood of this dataset under a GP model is:

from george import kernels
kernel = np.var(y) * kernels.ExpSquaredKernel(1.0)

gp_basic = george.GP(kernel)
gp_basic.compute(x[:100], yerr[:100])

When using only 100 data points, this computation is very fast but we could also use the approximate solver as follows:

gp_hodlr = george.GP(kernel, solver=george.HODLRSolver, seed=42)
gp_hodlr.compute(x[:100], yerr[:100])

The new scikit-learn interface is quite similar (you’ll need to install a recent version of scikit-learn to execute this cell):

import sklearn
print("sklearn version: {0}".format(sklearn.__version__))
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process import GaussianProcessRegressor

kernel_skl = np.var(y) * RBF(length_scale=1.0)
gp_skl = GaussianProcessRegressor(kernel_skl,
                                  copy_X_train=False)[:100, None], y[:100])
sklearn version: 0.19.1

To implement this same model in GPy, you would do something like (I’ve never been able to get the heteroscedastic regression to work in GPy):

import GPy

print("GPy version: {0}".format(GPy.__version__))
kernel_gpy = GPy.kern.RBF(input_dim=1, variance=np.var(y), lengthscale=1.)
gp_gpy = GPy.models.GPRegression(x[:100, None], y[:100, None], kernel_gpy)
gp_gpy['.*Gaussian_noise'] = yerr[0]**2
GPy version: 1.8.4

Now that we have working implementations of this model using all of the different methods and modules, let’s run a benchmark to look at the computational cost and scaling of each option. The code here doesn’t matter too much but we’ll compute the best-of-“K” runtime for each method where “K” depends on how long I’m willing to wait. This cell takes a few minutes to run.

import time

ns = np.array([50, 100, 200, 500, 1000, 5000, 10000, 50000], dtype=int)
t_basic = np.nan + np.zeros(len(ns))
t_hodlr = np.nan + np.zeros(len(ns))
t_gpy = np.nan + np.zeros(len(ns))
t_skl = np.nan + np.zeros(len(ns))
for i, n in enumerate(ns):
    # Time the HODLR solver.
    best = np.inf
    for _ in range(100000 // n):
        strt = time.time()
        gp_hodlr.compute(x[:n], yerr[:n])
        dt = time.time() - strt
        if dt < best:
            best = dt
    t_hodlr[i] = best

    # Time the basic solver.
    best = np.inf
    for _ in range(10000 // n):
        strt = time.time()
        gp_basic.compute(x[:n], yerr[:n])
        dt = time.time() - strt
        if dt < best:
            best = dt
    t_basic[i] = best

    # Compare to the proposed scikit-learn interface.
    best = np.inf
    if n <= 10000:
        gp_skl = GaussianProcessRegressor(kernel_skl,
                                          copy_X_train=False)[:n, None], y[:n])
        for _ in range(10000 // n):
            strt = time.time()
            dt = time.time() - strt
            if dt < best:
                best = dt
    t_skl[i] = best

    # Compare to GPy.
    best = np.inf
    for _ in range(5000 // n):
        kernel_gpy = GPy.kern.RBF(input_dim=1, variance=np.var(y), lengthscale=1.)
        strt = time.time()
        gp_gpy = GPy.models.GPRegression(x[:n, None], y[:n, None], kernel_gpy)
        gp_gpy['.*Gaussian_noise'] = yerr[0]**2
        dt = time.time() - strt
        if dt < best:
            best = dt
    t_gpy[i] = best

Finally, here are the results of the benchmark plotted on a logarithmic scale:

pl.loglog(ns, t_gpy, "-o", label="GPy")
pl.loglog(ns, t_skl, "-o", label="sklearn")
pl.loglog(ns, t_basic, "-o", label="basic")
pl.loglog(ns, t_hodlr, "-o", label="HODLR")
pl.xlim(30, 80000)
pl.ylim(1.1e-4, 50.)
pl.xlabel("number of datapoints")
pl.ylabel("time [seconds]")
pl.legend(loc=2, fontsize=16);

The sklearn and basic solver perform similarly with george being consistently slightly faster. This is not surprising because they both use LAPACK (via numpy/scipy) to naïvely compute the likelihood. GPy is consistently slower (probably because of Python overheads) even for small datasets but, like I mentioned previously, this wasn’t really what GPy was designed to do and it comes with a lot of other features. For large datasets (\(N \gtrsim 1000\)), the HODLRSolver really shines. In practice, this gain is less significant for multidimensional inputs and some other kernels but for reasonably well-behaved time-series models, it might solve all of your problems!