UMAP Reproducibility ==================== UMAP is a stochastic algorithm -- it makes use of randomness both to speed up approximation steps, and to aid in solving hard optimization problems. This means that different runs of UMAP can produce different results. UMAP is relatively stable -- thus the variance between runs should ideally be relatively small -- but different runs may have variations none the less. To ensure that results can be reproduced exactly UMAP allows the user to set a random seed state. Since version 0.4 UMAP also support multi-threading for faster performance; when performing optimization this exploits the fact that race conditions between the threads are acceptable within certain optimization phases. Unfortunately this means that the randomness in UMAP outputs for the multi-threaded case depends not only on the random seed input, but also on race conditions between threads during optimization, over which no control can be had. This means that multi-threaded UMAP results cannot be explicitly reproduced. In this tutorial we'll look at how UMAP can be used in multi-threaded mode for performance purposes, and alternatively how we can fix random states to ensure exact reproducibility at the cost of some performance. First let's load the relevant libraries and get some data; in this case the MNIST digits dataset. .. code:: python3 import numpy as np import sklearn.datasets import umap import umap.plot .. code:: python3 data, labels = sklearn.datasets.fetch_openml( 'mnist_784', version=1, return_X_y=True ) With data in hand let's run UMAP on it, and note how long it takes to run: .. code:: python3 %%time mapper1 = umap.UMAP().fit(data) .. parsed-literal:: CPU times: user 3min 18s, sys: 3.84 s, total: 3min 22s Wall time: 1min 29s The thing to note here is that the "Wall time" is significantly smaller than the CPU time -- this means that multiple CPU cores were used. For this particular demonstration I am making use of the latest version of PyNNDescent for nearest neighbor search (UMAP will use it if you have it installed) which supports multi-threading as well. The result is a very fast fitting to the data that does an effective job of using several cores. If you are on a large server with many cores available and don't wish to use them *all* (which is the default situation) you can currently control the number of cores used by setting the numba environment variable ``NUMBA_NUM_THREADS``; see the `numba documentation `__ for more details. Now let's plot our result to see what the embedding looks like: .. code:: python3 umap.plot.points(mapper1, labels=labels) .. image:: images/reproducibility_6_1.png Now, let's run UMAP again and compare the results to that of our first run. .. code:: python3 %%time mapper2 = umap.UMAP().fit(data) .. parsed-literal:: CPU times: user 2min 53s, sys: 4.16 s, total: 2min 57s Wall time: 1min 5s You will note that this time we ran *even faster*. This is because during the first run numba was still JIT compiling some of the code in the background. In contrast, this time that work has already been done, so it no longer takes up any of our run-time. We see that we are still making use of mutliple cores well. Now let's plot the results of this second run and compare to the first: .. code:: python3 umap.plot.points(mapper2, labels=labels) .. image:: images/reproducibility_10_1.png Qualitatively this looks very similar, but a little closer inspection will quickly show that the results are actually different between the runs. Note that even in versions of UMAP prior to 0.4 this would have been the case -- since we fixed no specific random seed, and were thus using the current random state of the system which will naturally differ between runs. This is the default behaviour, as is standard with sklearn estimators that are stochastic. Rather than having a default random seed the user is required to explicitly provide one should they want a reproducible result. As noted by Vito Zanotelli ... setting a random seed is like signing a waiver "I am aware that this is a stochastic algorithm and I have done sufficient tests to confirm that my main conclusions are not affected by this randomness". With that in mind, let's see what happens if we set an explicit ``random_state`` value: .. code:: python3 %%time mapper3 = umap.UMAP(random_state=42).fit(data) .. parsed-literal:: CPU times: user 2min 27s, sys: 4.16 s, total: 2min 31s Wall time: 1min 56s The first thing to note that that this run took significantly longer (despite having all the functions JIT compiled by numba already). Then note that the Wall time and CPU times are now much closer to each other -- we are no longer exploiting multiple cores to anywhere near the same degree. This is because by setting a ``random_state`` we are effectively turning off any of the multi-threading that does not support explicit reproducibility. Let's plot the results: .. code:: python3 umap.plot.points(mapper3, labels=labels) .. image:: images/reproducibility_14_1.png We arrive at much the same results as before from a qualitative point of view, but again inspection will show that there are some differences. More importantly this result should now be reproducible. Thus we can run UMAP again, with the same ``random_state`` set ... .. code:: python3 %%time mapper4 = umap.UMAP(random_state=42).fit(data) .. parsed-literal:: CPU times: user 2min 26s, sys: 4.13 s, total: 2min 30s Wall time: 1min 54s Again, this takes longer than the earlier runs with no ``random_state`` set. However when we plot the results of the second run we see that they look not merely qualitatively similar, but instead appear to be almost identical: .. code:: python3 umap.plot.points(mapper4, labels=labels) .. image:: images/reproducibility_18_1.png We can, in fact, check that the results are identical by verifying that each and every coordinate of the resulting embeddings match perfectly: .. code:: python3 np.all(mapper3.embedding_ == mapper4.embedding_) .. parsed-literal:: True So we have, in fact, reproduced the embedding exactly.