Performance Comparison of Dimension Reduction Implementations

Different dimension reduction techniques can have quite different computational complexity. Beyond the algorithm itself there is also the question of how exactly it is implemented. These two factors can have a significant role in how long it actually takes to run a given dimension reduction. Furthermore the nature of the data you are trying to reduce can also matter – mostly the involves the dimensionality of the original data. Here we will take a brief look at the performance characterstics of a number of dimension reduction implementations.

To start let’s get the basic tools we’ll need loaded up – numpy and pandas obviously, but also tools to get and resample the data, and the time module so we can perform some basic benchmarking.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.utils import resample
import time

Next we’ll need the actual dimension reduction implementations. For the purposes of this explanation we’ll mostly stick with scikit-learn, but for the sake of comparison we’ll also include the MulticoreTSNE implementation of t-SNE, which has significantly better performance than the current scikit-learn t-SNE.

from sklearn.manifold import TSNE, LocallyLinearEmbedding, Isomap, MDS, SpectralEmbedding
from sklearn.decomposition import PCA
from MulticoreTSNE import MulticoreTSNE
from umap import UMAP

Next we’ll need out plotting tools, and, of course, some data to work with. For this performance comparison we’ll default to the now standard benchmark of manifold learning: the MNIST digits dataset. We can use scikit-learn’s fetch_mldata to grab it for us.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(context='notebook',
        rc={'figure.figsize':(12,10)},
        palette=sns.color_palette('tab10', 10))
mnist = fetch_openml('Fashion-MNIST', version=1)

Now it is time to start looking at performance. To start with let’s look at how performance scales with increasing dataset size.

Performance scaling by dataset size

As the size of a dataset increases the runtime of a given dimension reduction algorithm will increase at varying rates. If you ever want to run your algorithm on larger datasets you will care not just about the comparative runtime on a single small dataset, but how the performance scales out as you move to larger datasets. We can similate this by subsampling from MNIST digits (via scikit-learn’s convenient resample utility) and looking at the runtime for varying sized subsamples. Since there is some randomness involved here (both in the subsample selection, and in some of the algorithms which have stochastic aspects) we will want to run a few examples for each dataset size. We can easily package all of this up in a simple function that will return a convenient pandas dataframe of dataset sizes and runtimes given an algorithm.

def data_size_scaling(algorithm, data, sizes=[100, 200, 400, 800, 1600], n_runs=5):
    result = []
    for size in sizes:
        for run in range(n_runs):
            subsample = resample(data, n_samples=size)
            start_time = time.time()
            algorithm.fit(subsample)
            elapsed_time = time.time() - start_time
            del subsample
            result.append((size, elapsed_time))
    return pd.DataFrame(result, columns=('dataset size', 'runtime (s)'))

Now we just want to run this for each of the various dimension reduction implementations so we can look at the results. Since we don’t know how long these runs might take we’ll start off with a very small set of samples, scaling up to only 1600 samples.

all_algorithms = [
    PCA(),
    UMAP(),
    MulticoreTSNE(),
    LocallyLinearEmbedding(),
    SpectralEmbedding(),
    Isomap(),
    TSNE(),
    MDS(),
]
performance_data = {}
for algorithm in all_algorithms:
    alg_name = str(algorithm)
    if 'MulticoreTSNE' in alg_name:
        alg_name = 'MulticoreTSNE'
    else:
        alg_name = alg_name.split('(')[0]

    performance_data[alg_name] = data_size_scaling(algorithm, mnist.data, n_runs=3)

Now let’s plot the results so we can see what is going on. We’ll use seaborn’s regression plot to interpolate the effective scaling.

for alg_name, perf_data in performance_data.items():
    sns.regplot('dataset size', 'runtime (s)', perf_data, order=2, label=alg_name)
plt.legend();
_images/performance_14_1.png

We can see straight away that there are some outliers here. The scikit-learn t-SNE is clearly much slower than most of the other algorithms. It does not have the scaling properties of MDS however; for larger dataset sizes MDS is going to quickly become completely unmanageable. At the same time MulticoreTSNE demonstrates that t-SNE can run fairly efficiently. It is hard to tell much about the other implementations other than the fact that PCA is far and away the fastest option. To see more we’ll have to look at runtimes on larger dataset sizes. Both MDS and scikit-learn’s t-SNE are going to take too long to run so let’s restrict ourselves to the fastest performing implementations and see what happens as we extend out to larger dataset sizes.

fast_algorithms = [
    PCA(),
    UMAP(),
    MulticoreTSNE(),
    LocallyLinearEmbedding(),
    SpectralEmbedding(),
    Isomap(),
]
fast_performance_data = {}
for algorithm in fast_algorithms:
    alg_name = str(algorithm)
    if 'MulticoreTSNE' in alg_name:
        alg_name = 'MulticoreTSNE'
    else:
        alg_name = alg_name.split('(')[0]

    fast_performance_data[alg_name] = data_size_scaling(algorithm, mnist.data,
                                                   sizes=[800, 1600, 3200, 6400, 12800], n_runs=3)
for alg_name, perf_data in fast_performance_data.items():
    sns.regplot('dataset size', 'runtime (s)', perf_data, order=2, label=alg_name)
plt.legend();
_images/performance_17_1.png

At this point we begin to see some significant differentiation among the different implementations. In the earlier plot MulticoreTSNE looked to be slower than some of the other algorithms, but as we scale out to larger datasets we see that its relative scaling performance is far superior to the scikit-learn implementations of Isomap, spectral embedding, and locally linear embedding.

It is probably worth extending out further – up to the full MNIST digits dataset. To manage to do that in any reasonable amount of time we’ll have to restrict out attention to an even smaller subset of implementations. We will pare things down to just MulticoreTSNE, PCA and UMAP.

very_fast_algorithms = [
    PCA(),
    UMAP(),
    MulticoreTSNE(),
]
vfast_performance_data = {}
for algorithm in very_fast_algorithms:
    alg_name = str(algorithm)
    if 'MulticoreTSNE' in alg_name:
        alg_name = 'MulticoreTSNE'
    else:
        alg_name = alg_name.split('(')[0]

    vfast_performance_data[alg_name] = data_size_scaling(algorithm, mnist.data,
                                                    sizes=[3200, 6400, 12800, 25600, 51200, 70000], n_runs=2)
for alg_name, perf_data in vfast_performance_data.items():
    sns.regplot('dataset size', 'runtime (s)', perf_data, order=2, label=alg_name)
plt.legend();
_images/performance_20_1.png

Here we see UMAP’s advantages over t-SNE really coming to the forefront. While UMAP is clearly slower than PCA, its scaling performance is dramatically better than MulticoreTSNE, and for even larger datasets the difference is only going to grow.

This concludes our look at scaling by dataset size. The short summary is that PCA is far and away the fastest option, but you are potentially giving up a lot for that speed. UMAP, while not competitive with PCA, is clearly the next best option in terms of performance among the implementations explored here. Given the quality of results that UMAP can provide we feel it is clearly a good option for dimension reduction.