UMAP on sparse data¶
Sometimes datasets get very large, and potentially very very high
dimensional. In many such cases, however, the data itself is sparse –
that is, while there are many many features, any given sample has only a
small number of non-zero features observed. In such cases the data can
be represented much more efficiently in terms of memory usage by a
sparse matrix data structure. It can be hard to find dimension reduction
techniques that work directly on such sparse data – often one applies a
basic linear technique such as TruncatedSVD
from sklearn (which does
accept sparse matrix input) to get the data in a format amenable to
other more advanced dimension reduction techniques. In the case of UMAP
this is not necessary – UMAP can run directly on sparse matrix input.
This tutorial will walk through a couple of examples of doing this.
First we’ll need some libraries loaded. We need numpy
obviously, but
we’ll also make use of scipy.sparse
which provides sparse matrix
data structures. One of our examples will be purely mathematical, and
we’ll make use of sympy
for that; the other example is test based
and we’ll use sklearn for that (specifically
sklearn.feature_extraction.text
). Beyond that we’ll need umap, and
plotting tools.
import numpy as np
import scipy.sparse
import sympy
import sklearn.datasets
import sklearn.feature_extraction.text
import umap
import umap.plot
import matplotlib.pyplot as plt
%matplotlib inline
A mathematical example¶
Our first example constructs a sparse matrix of data out of pure math. This example is inspired by the work of John Williamson, and if you haven’t looked at that work you are strongly encouraged to do so. The dataset under consideration will be the integers. We will represent each integer by a vector of its divisibility by distinct primes. Thus our feature space is the space of prime numbers (less than or equal to the largest integer we will be considering) – potentially very high dimensional. In practice a given integer is divisible by only a small number of distinct primes, so each sample will be mostly made up of zeros (all the primes that the number is not divisible by), and thus we will have a very sparse dataset.
To get started we’ll need a list of all the primes. Fortunately we have
sympy
at our disposal and we can quickly get that information with a
single call to primerange
. We’ll also need a dictionary mapping the
different primes to the column number they correspond to in our data
structure; effectively we’ll just be enumerating the primes.
primes = list(sympy.primerange(2, 110000))
prime_to_column = {p:i for i, p in enumerate(primes)}
Now we need to construct our data in a format we can put into a sparse matrix easily. At this point a little background on sparse matrix data structures is useful. For this purpose we’ll be using the so called “LIL” format. LIL is short for “List of Lists”, since that is how the data is internally stored. There is a list of all the rows, and each row is stored as a list giving the column indices of the non-zero entries. To store the data values there is a parallel structure containing the value of the entry corresponding to a given row and column.
To put the data together in this sort of format we need to construct
such a list of lists. We can do that by iterating over all the integers
up to a fixed bound, and for each integer (i.e. each row in our dataset)
generating the list of column indices which will be non-zero. The column
indices will simply be the indices corresponding to the primes that
divide the number. Since sympy
has a function primefactors
which
returns a list of the unique prime factors of any integer we simply need
to map those through our dictionary to covert the primes into column
numbers.
Parallel to that we’ll construct the corresponding structure of values to insert into a matrix. Since we are only concerned with divisibility this will simply be a one in every non-zero entry, so we can just add a list of ones of the appropriate length for each row.
%%time
lil_matrix_rows = []
lil_matrix_data = []
for n in range(100000):
prime_factors = sympy.primefactors(n)
lil_matrix_rows.append([prime_to_column[p] for p in prime_factors])
lil_matrix_data.append([1] * len(prime_factors))
CPU times: user 2.07 s, sys: 26.4 ms, total: 2.1 s
Wall time: 2.1 s
Now we need to get that into a sparse matrix. Fortunately the
scipy.sparse
package makes this easy, and we’ve already built the
data in a fairly useful structure. First we create a sparse matrix of
the correct format (LIL) and the right shape (as many rows as we have
generated, and as many columns as there are primes). This is essentially
just an empty matrix however. We can fix that by setting the rows
attribute to be the rows we have generated, and the data
attribute
to be the corresponding structure of values (all ones). The result is a
sparse matrix data structure which can then be easily manipulated and
converted into other sparse matrix formats easily.
factor_matrix = scipy.sparse.lil_matrix((len(lil_matrix_rows), len(primes)), dtype=np.float32)
factor_matrix.rows = np.array(lil_matrix_rows)
factor_matrix.data = np.array(lil_matrix_data)
factor_matrix
<100000x10453 sparse matrix of type '<class 'numpy.float32'>'
with 266398 stored elements in LInked List format>
As you can see we have a matrix with 100000 rows and over 10000 columns. If we were storing that as a numpy array it would take a great deal of memory. In practice, however, there are only 260000 or so entries that are not zero, and that’s all we really need to store, making it much more compact.
The question now is how can we feed that sparse matrix structure into UMAP to have it learn an embedding. The answer is surprisingly straightforward – we just hand it directly to the fit method. Just like other sklearn estimators that can handle sparse input UMAP will detect the sparse matrix and just do the right thing.
%%time
mapper = umap.UMAP(metric='cosine', random_state=42, low_memory=True).fit(factor_matrix)
CPU times: user 9min 36s, sys: 6.76 s, total: 9min 43s
Wall time: 9min 7s
That was easy! But is it really working? We can easily plot the results:
umap.plot.points(mapper, values=np.arange(100000), theme='viridis')

And this looks very much in line with the results John Williamson
got with the
proviso that we only used 100,000 integers instead of 1,000,000 to
ensure that most users should be able to run this example (the full
million may require a large memory compute node). So it seems like this
is working well. The next question is whether we can use the
transform
functionality to map new data into this space. To test
that we’ll need some more data. Fortunately there are more integers.
We’ll grab the next 10,000 and put them in a sparse matrix, much as we
did for the first 100,000.
%%time
lil_matrix_rows = []
lil_matrix_data = []
for n in range(100000, 110000):
prime_factors = sympy.primefactors(n)
lil_matrix_rows.append([prime_to_column[p] for p in prime_factors])
lil_matrix_data.append([1] * len(prime_factors))
CPU times: user 214 ms, sys: 1.99 ms, total: 216 ms
Wall time: 222 ms
new_data = scipy.sparse.lil_matrix((len(lil_matrix_rows), len(primes)), dtype=np.float32)
new_data.rows = np.array(lil_matrix_rows)
new_data.data = np.array(lil_matrix_data)
new_data
<10000x10453 sparse matrix of type '<class 'numpy.float32'>'
with 27592 stored elements in LInked List format>
To map the new data we generated we can simply hand it to the
transform
method of our trained model. This is a little slow, but it
does work.
new_data_embedding = mapper.transform(new_data)
And we can plot the results. Since we just got the locations of the points this time (rather than a model) we’ll have to resort to matplotlib for plotting.
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(111)
plt.scatter(new_data_embedding[:, 0], new_data_embedding[:, 1], s=0.1, c=np.arange(10000), cmap='viridis')
ax.set(xticks=[], yticks=[], facecolor='black');

The color scale is different in this case, but you can see that the data has been mapped into locations corresponding to the various structures seen in the original embedding. Thus, even with large sparse data we can embed the data, and even add new data to the embedding.
A text analysis example¶
Let’s look at a more classical machine learning example of working with
sparse high dimensional data – working with text documents. Machine
learning on text is hard, and there is a great deal of literature on the
subject, but for now we’ll just consider a basic approach. Part of the
difficulty of machine learning with text is turning language into
numbers, since numbers are really all most machine learning algorithms
understand (at heart anyway). One of the most straightforward ways to do
this for documents is what is known as the “bag-of-words”
model. In this
model we view a document as simply a multi-set of the words contained in
it – we completely ignore word order. The result can be viewed as a
matrix of data by setting the feature space to be the set of all words
that appear in any document, and a document is represented by a vector
where the value of the ith entry is the number of times the ith
word occurs in that document. This is a very common approach, and is
what you will get if you apply sklearn’s CountVectorizer
to a text
dataset for example. The catch with this approach is that the feature
space is often very large, since we have a feature for each and every
word that ever occurs in the entire corpus of documents. The data is
sparse however, since most documents only use a small portion of the
total possible vocabulary. Thus the default output format of
CountVectorizer
(and other similar feature extraction tools in
sklearn) is a scipy.sparse
format matrix.
For this example we’ll make use of the classic 20newsgroups dataset, a
sampling of newsgroup messages from the old NNTP newsgroup system
covering 20 different newsgroups. The sklearn.datasets
module can
easily fetch the data, and, in fact, we can fetch a pre-vectorized
version to save us the trouble of running CountVectorizer
ourselves.
We’ll grab both the training set, and the test set for later use.
news_train = sklearn.datasets.fetch_20newsgroups_vectorized(subset='train')
news_test = sklearn.datasets.fetch_20newsgroups_vectorized(subset='test')
If we look at the actual data we have pulled back, we’ll see that
sklearn has run a CountVectorizer
and produced the data in the sparse
matrix format.
news_train.data
<11314x130107 sparse matrix of type '<class 'numpy.float64'>'
with 1787565 stored elements in Compressed Sparse Row format>
The value of the sparse matrix format is immediately obvious in this
case; while there are only 11,000 samples there are 130,000 features! If
the data were stored in a standard numpy
array we would be using up
10GB of memory! And most of that memory would simply be storing the
number zero, over and over again. In the sparse matrix format it easily fits
in memory on most machines. This sort of dimensionality of data is very
common with text workloads.
The raw counts are, however, not ideal since common words such as “the” and
“and” will dominate the counts for most documents, while contributing
very little information about the actual content of the document. We can
correct for this by using a TfidfTransformer
from sklearn, which
will convert the data into TF-IDF
format. There are lots
of ways to think about the transformation done by TF-IDF, but I like to
think of it intuitively as follows. The information content of a word
can be thought of as (roughly) proportional to the negative log of the
frequency of the word; the more often a word is used, the less
information it tends to carry, and infrequent words carry more
information. What TF-IDF is going to do can be thought of as akin to
re-weighting the columns according to the information content of the
word associated to that column. Thus the common words like “the” and
“and” will get down-weighted, as carrying less information about the
document, while infrequent words will be deemed more imporant and have
their associated columns up-weighted. We can apply this transformation
to both the train and test sets (using the same transformer trained on
the training set).
tfidf = sklearn.feature_extraction.text.TfidfTransformer(norm='l1').fit(news_train.data)
train_data = tfidf.transform(news_train.data)
test_data = tfidf.transform(news_test.data)
The result is still a sparse matrix, since TF-IDF doesn’t change the zero elements at all, nor the number of features.
train_data
<11314x130107 sparse matrix of type '<class 'numpy.float64'>'
with 1787565 stored elements in Compressed Sparse Row format>
Now we need to pass this very high dimensional data to UMAP. Unlike some
other non-linear dimension reduction techniques we don’t need to apply
PCA first to get the data down to a reasonable dimensionality; nor do we
need to use other techniques to reduce the data to be able to be
represented as a dense numpy
array; we can work directly on the
130,000 dimensional sparse matrix.
%%time
mapper = umap.UMAP(metric='hellinger', random_state=42).fit(train_data)
CPU times: user 8min 40s, sys: 3.07 s, total: 8min 44s
Wall time: 8min 43s
Now we can plot the results, with labels according to the target variable of the data – which newsgroup the posting was drawn from.
umap.plot.points(mapper, labels=news_train.target)

We can see that even going directly from a 130,000 dimensional space down to only 2 dimensions UMAP has done a decent job of separating out many of the different newsgroups.
We can now attempt to add the test data to the same space using the
transform
method.
test_embedding = mapper.transform(test_data)
While this is somewhat expensive computationally, it does work, and we can plot the end result:
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(111)
plt.scatter(test_embedding[:, 0], test_embedding[:, 1], s=1, c=news_test.target, cmap='Spectral')
ax.set(xticks=[], yticks=[]);
