# Combining multiple UMAP models¶

It is possible to combine together multiple UMAP models, assuming that they are operating on the same underlying data. To get an idea of how this works recall that UMAP uses an intermediate fuzzy topological representation (see How UMAP Works). Given different views of the same underlying data this will generate different fuzzy topological representations. We can apply intersections or unions to these representations to get a new composite fuzzy topological representation which we can then embed into low dimensional space in the standard UMAP way. The key is that, to be able to sensibly intersect or union these representations, there must be one-to-one correspondences between the data samples from the two different views.

To get an idea of how this might work it is useful to see it in practice. Let’s load some libraries and get started.

```
import sklearn.datasets
from sklearn.preprocessing import RobustScaler
import seaborn as sns
import pandas as pd
import numpy as np
import umap
import umap.plot
```

## MNIST digits example¶

To begin with let’s use a relatively familiar dataset – the MNIST digits dataset that we’ve used in other sections of this tutorial. The data is (grayscale) 28x28 pixel images of handwritten digits (0 through 9); in total there are 70,000 such images, and each image is unrolled into a 784 element vector.

```
mnist = sklearn.datasets.fetch_openml("mnist_784")
```

To ensure we have an idea of what this dataset looks like through the lens of UMAP we can run UMAP on the full dataset.

```
mapper = umap.UMAP(random_state=42).fit(mnist.data)
```

```
umap.plot.points(mapper, labels=mnist.target, width=500, height=500)
```

To make the problem more interesting let’s carve the dataset in two – not into two sets of 35,000 samples, but instead carve each image in half. That is, we’ll end up with 70,000 samples each of which is the top half of the image of the handwritten digit, and another 70,000 samples each of which is the bottom half of the image of the handwritten digit.

```
top = mnist.data[:, :28 * 14]
bottom = mnist.data[:, 28 * 14:]
```

This is a little artificial, but it provides us with an example dataset where we have two distinct views of the data which we can still well understand. In practice this situation would be more likely to arise when there are two different data collection processes sampling from the same underlying population. In our case we could simply glue the data back together (hstack the numpy arrays for example), but potentially this isn’t feasible as the different data views may have different scales or modalities. So, despite the fact that we could glue things back together in this case, we will proceed as if we can’t – as may be the case for many real world problems.

Let’s first look at what UMAP does individually on each dataset. We’ll start with the top halves of the digits:

```
top_mapper = umap.UMAP(random_state=42).fit(top)
```

```
umap.plot.points(top_mapper, labels=mnist.target, width=500, height=500)
```

While UMAP still manages to mostly separate the different digit classes we can see the results are quite different from UMAP on the full standard MNIST dataset. The twos and threes are blurred together (as we would expect given that we don’t have the bottom half of the image wich would let us tell them apart); The twos and threes are also in a large grouping that pulls together all of the eights, sevens and nines (again, what we would expect given only the top half of the digit), while the fives and sixes are somewhat distinct, but clearly are similar to each other. It is only the ones, fours and zeros that are very clearly discernible.

Now let’s see what sorts of results we get with the bottom halves of the digits:

```
bot_mapper = umap.UMAP(random_state=42).fit(bottom)
```

```
umap.plot.points(bot_mapper, labels=mnist.target, width=500, height=500)
```

This is clearly a very different view of the data. Now it is the fours
and nines that blur together (presumably many of the nines are drawn
with straight rather than curved stems), with sevens nearby. The twos
and the threes are very distinct from each other, but the threes and the
fives are combined (as one might expect given that the bottom halves
*should* look similar). Zeros and sixes are distinct, but close to each
other. Ones, eights and twos are the most distinctive digits in this
view.

So, assuming we can’t just glue the raw data together and stick a
reasonable metric on it, what can we do? We can perform intersections or
unions on the fuzzy topological representations. There is also some work
to be done re-asserting UMAP’s theoretical assumptions (local
connectivity, approximately uniform distributions). Fortunately UMAP
makes this relatively easy as long as you have a copy of fitted UMAP
models on hand (which we do in this case). To intersect two models
simply use the `*`

operator; to union them use the `+`

operator.
Note that this will actually take some time since we need to compute the
2D embedding of the combined model.

```
intersection_mapper = top_mapper * bot_mapper
union_mapper = top_mapper + bot_mapper
```

With that complete we can visualize the results. First let’s look at the intersection:

```
umap.plot.points(intersection_mapper, labels=mnist.target, width=500, height=500)
```

As you can see, while this isn’t as good as a UMAP plot for the full MNIST dataset it has recovered the individual digits quite well. The worst of the remaining overlap is between the threes and fives in the center, which is it still struggling to fully distinguish. But note, also, that we have recovered more of the overall structure than either of the two different individual views, with the layout of different digit classes more closely resembling that of the UMAP run on the full dataset.

Now let’s look at the union.

```
umap.plot.points(union_mapper, labels=mnist.target, width=500, height=500)
```

Given that UMAP is agnostic to rotation or reflection of the final layout, this is essentially the same result as the intersection since it is almost the reflection of it in the y-axis. This sort of result (intersection and union being similar) is not always the case (in fact it is not that common), but since the underlying structure of the digits dataset is so clear we find that either way of piecing it together from the two half datasets manage to find the same core underlying structure.

If you are willing to try something a little more experimental there is
also a third option using the `-`

operator which effectively
intersects with the fuzzy set complement (and is thus not commutative,
just as `-`

implies). The goal here is to try to provide a sense of
what the data looks like when we contrast it against a second view.

```
contrast_mapper = top_mapper - bot_mapper
```

```
umap.plot.points(contrast_mapper, labels=mnist.target, width=500, height=500)
```

In this case the result is not overly dissimilar from the embedding of just the top half, so the contrast has perhaps not shown is as much as we might have hoped.

## Diamonds dataset example¶

Now let’s try the same approach on a different dataset where the option of just running UMAP on the full dataset is not available. For this we’ll use the diamonds dataset. In this dataset each row represents a different diamond and provides details on the weight (carat), cut, color, clarity, size (depth, table, x, y, z) and price of the given diamond. How these different factors interplay is somewhat complicated.

```
diamonds = sns.load_dataset('diamonds')
diamonds.head()
```

carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |

1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |

2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |

3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |

4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |

For our purposes let’s take “price” as a “target” variable (as is often the case when the dataset is used in machine learning contexts). What we would like to do is provide a UMAP embedding of the data using the remaining features. This is tricky since we can’t exactly use a euclidean metric over the whole thing. What we can do, however, is split the data into two distinct types: the purely numeric features relating to size and weight, and the categorical features of color, cut and clarity. Let’s pull each of those feature sets out so we can work with them independently.

```
numeric = diamonds[["carat", "table", "x", "y", "z"]].copy()
ordinal = diamonds[["cut", "color", "clarity"]].copy()
```

Now we have a new problem: the numeric features are not at all on the
same scales, so any sort of standard distance metric across them will be
dominated by those features with the largest ranges. We can correct for
that by performing feature scaling. To do that we’ll make use of
sklearn’s `RobustScaler`

which uses robust statistics (such as the
median and interquartile range) to center and rescale the data feature
by feature. If we look at the results on the first five rows we see that
the different features are all now reasonably comparable, and it is
reasonable to apply something like euclidean distance across them.

```
scaled_numeric = RobustScaler().fit_transform(numeric)
scaled_numeric[:5]
```

```
array([[-0.734375 , -0.66666667, -0.95628415, -0.95054945, -0.97345133],
[-0.765625 , 1.33333333, -0.98907104, -1.02747253, -1.07964602],
[-0.734375 , 2.66666667, -0.90163934, -0.9010989 , -1.07964602],
[-0.640625 , 0.33333333, -0.81967213, -0.81318681, -0.79646018],
[-0.609375 , 0.33333333, -0.7431694 , -0.74725275, -0.69026549]])
```

What is the best way to handle the categorical features? If they are
purely categorical it would make sense to one-hot encode the categories
and use “dice” distance between them. A downside of that is that, with
so few categories, it is a very coarse metric which will fail to provide
much differentiation. For the diamonds dataset, however, the categories
come with a strict order: Ideal cut is better than Premium cut, which is
better than Very Good cut and so on. Color grades work similarly, and
there is a distinct grading scheme for clarity as well. We can use an
ordinal encoding on these categories. Now, while the *ranges* of values
may vary, the differences between them are all comparable – a difference
of 1 for each grade level. That means we don’t need to rescale this data
after the ordinal coding.

```
ordinal["cut"] = ordinal.cut.map({"Fair":0, "Good":1, "Very Good":2, "Premium":3, "Ideal":4})
ordinal["color"] = ordinal.color.map({"D":0, "E":1, "F":2, "G":3, "H":4, "I":5, "J":6})
ordinal["clarity"] = ordinal.clarity.map({"I1":0, "SI2":1, "SI1":2, "VS2":3, "VS1":4, "VVS2":5, "VVS1":6, "IF":7})
```

```
ordinal
```

cut | color | clarity | |
---|---|---|---|

0 | 4 | 1 | 1 |

1 | 3 | 1 | 2 |

2 | 1 | 1 | 4 |

3 | 3 | 5 | 3 |

4 | 1 | 6 | 1 |

... | ... | ... | ... |

53935 | 4 | 0 | 2 |

53936 | 1 | 0 | 2 |

53937 | 2 | 0 | 2 |

53938 | 3 | 4 | 1 |

53939 | 4 | 0 | 1 |

53940 rows × 3 columns

As noted we can use euclidean as a sensible distance on the rescaled
numeric data. On the other hand since the different ordinal categories
are entirelty independent of each other, and we have a strict ordinal
codin, the socalled “manhattan” metric makes more sense here – it is
simply the sum of the absolute differences in each category. As before
we can now train UMAP models on each dataset – this time, however, since
the datasets are different we need different metrics and even different
values of `n_neighbors`

.

```
numeric_mapper = umap.UMAP(n_neighbors=15, random_state=42).fit(scaled_numeric)
ordinal_mapper = umap.UMAP(metric="manhattan", n_neighbors=150, random_state=42).fit(ordinal.values)
```

We can look at the results of each of these independent views of the dataset reduced to 2D using UMAP. Let’s first look at the numeric data on size and weight of the diamonds. We can colour by the price to get some idea of how the dataset fits together.

```
umap.plot.points(numeric_mapper, values=diamonds["price"], cmap="viridis")
```

We see that while the data generally correlates somewhat with the price of the diamonds there are distinctly different threads in the data, presumably corresponding to different styles of cut, and how that results in different sizing of diamonds in the various dimensions, depending on the weight.

In contrast we ca look at the ordinal data. In this case we’ll colour it by the different categories as well as by price.

```
fig, ax = umap.plot.plt.subplots(2, 2, figsize=(12,12))
umap.plot.points(ordinal_mapper, labels=diamonds["color"], ax=ax[0,0])
umap.plot.points(ordinal_mapper, labels=diamonds["clarity"], ax=ax[0,1])
umap.plot.points(ordinal_mapper, labels=diamonds["cut"], ax=ax[1,0])
umap.plot.points(ordinal_mapper, values=diamonds["price"], cmap="viridis", ax=ax[1,1])
```

As you can see this is a markedly different result! The ordinal data has
a relatively coarse metric, since the different categories can only take
on a small range of discrete values. This means that, with respect to
the trio of color, cut, and clarity, diamonds are largely either almost
identical, or quite distinct. The result is very tight groupings which
have very high density. You can see a gradient of color from left to
right in the plot; colouring by cut or clarity show different
stratifications. The combination of these very distinct statifications
results in this highly clustered embedding. It is exactly for this
reason that we need such a high `n_neighbors`

value: the very local
structure of the data is merely clusters of identical categories; we
need to see wider to learn more structure.

Given these radically different views of the data, what do we get if we try to integrate them together? As before we can use the intersection and union operators to simply combine the models. As noted before this is a somewhat time-consuming operation as a new 2D representation for the combined models needs to be optimized.

```
intersection_mapper = numeric_mapper * ordinal_mapper
union_mapper = numeric_mapper + ordinal_mapper
```

Let’s start by looking at the intersection; here we are only really
decreasing connectivity since edges are assigned the probability of
existing in *both* data views (before re-asserting local connectivity
and uniform distribution assumptions).

```
umap.plot.points(intersection_mapper, values=diamonds["price"], cmap="viridis")
```

What we get most closely represents the numeric data view. Why is this? Because the categorical data view has points either connected with certainty (because they are, or are nearly, identical) or very loosely. The points connected with near certainty are very dense clusters – almost points in the plot – and mostly what we are doing with the intersection is breaking up those clusters with the more fine-grained and variable connectivity provided by the numerical data. At th esame time we have shifted the result significantly from the numerical data view on its own; the categorical information has made each cluster more uniform (rather than being a gradient) in its price.

Given this result, what would you expect of the union?

```
umap.plot.points(union_mapper, labels=diamonds["color"])
```

What we get in practice looks a lot more like the categorical view of
the data. This time we are only *increasing* the connectivity (prior to
re-asserting local connectivity and uniform distribution assumptions);
thus we retain most of the structure of the high-connectivity
categorical view. Note, however, that we have created more connected and
coherent clusters in the center of the plot, showing a range of diamond
colors, and the introduction of the numerical size and weight
information has induced a rearrangement of the individual clusters
around the fringes.

We can go a step further and experiment with the contrast composition method.

```
contrast_mapper = numeric_mapper - ordinal_mapper
```

```
umap.plot.points(contrast_mapper, values=diamonds["price"], cmap="viridis")
```

Here we see that we’ve retained a lot of the structure of the numeric data view, but have refined and broken it down further into clear clusters with price gradients running through each of them.

To further demonstrate the power of this approach we can go a step
further and intersect a higher `n_neighbors`

based embedding of the
numeric data view with our existing union of numeric and categorical
data – providing a model that is a composition of three simpler models.

```
intersect_union_mapper = umap.UMAP(random_state=42, n_neighbors=60).fit(numeric) * union_mapper
```

```
umap.plot.points(intersect_union_mapper, values=diamonds["price"], cmap="viridis")
```

Here the greater global structure from the larger `n_neighbors`

value
glues together longer strands and we get an interesting result out. In
this case it is not necessarily particularly informative, but it is
included as a demonstration that even composed models can be composed
with each other, stacking together potentially many different views.