UMAP API Guide¶
UMAP has only a single class UMAP
.
UMAP¶

class
umap.umap_.
UMAP
(n_neighbors=15, n_components=2, metric='euclidean', metric_kwds=None, output_metric='euclidean', output_metric_kwds=None, n_epochs=None, learning_rate=1.0, init='spectral', min_dist=0.1, spread=1.0, low_memory=True, n_jobs=1, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, a=None, b=None, random_state=None, angular_rp_forest=False, target_n_neighbors=1, target_metric='categorical', target_metric_kwds=None, target_weight=0.5, transform_seed=42, transform_mode='embedding', force_approximation_algorithm=False, verbose=False, tqdm_kwds=None, unique=False, densmap=False, dens_lambda=2.0, dens_frac=0.3, dens_var_shift=0.1, output_dens=False, disconnection_distance=None)[source]¶ Uniform Manifold Approximation and Projection
Finds a low dimensional embedding of the data that approximates an underlying manifold.
 n_neighbors: float (optional, default 15)
 The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.
 n_components: int (optional, default 2)
 The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to 100.
 metric: string or function (optional, default ‘euclidean’)
The metric to use to compute distances in high dimensional space. If a string is passed it must match a valid predefined metric. If a general metric is required a function that takes two 1d arrays and returns a float can be provided. For performance purposes it is required that this be a numba jit’d function. Valid string metrics include:
 euclidean
 manhattan
 chebyshev
 minkowski
 canberra
 braycurtis
 mahalanobis
 wminkowski
 seuclidean
 cosine
 correlation
 haversine
 hamming
 jaccard
 dice
 russelrao
 kulsinski
 ll_dirichlet
 hellinger
 rogerstanimoto
 sokalmichener
 sokalsneath
 yule
Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.
 n_epochs: int (optional, default None)
 The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).
 learning_rate: float (optional, default 1.0)
 The initial learning rate for the embedding optimization.
 init: string (optional, default ‘spectral’)
 How to initialize the low dimensional embedding. Options are:
 ‘spectral’: use a spectral embedding of the fuzzy 1skeleton
 ‘random’: assign initial embedding positions at random.
 A numpy array of initial embedding positions.
 min_dist: float (optional, default 0.1)
 The effective minimum distance between embedded points. Smaller values
will result in a more clustered/clumped embedding where nearby points
on the manifold are drawn closer together, while larger values will
result on a more even dispersal of points. The value should be set
relative to the
spread
value, which determines the scale at which embedded points will be spread out.  spread: float (optional, default 1.0)
 The effective scale of embedded points. In combination with
min_dist
this determines how clustered/clumped the embedded points are.  low_memory: bool (optional, default True)
 For some datasets the nearest neighbor computation can consume a lot of memory. If you find that UMAP is failing due to memory constraints consider setting this option to True. This approach is more computationally expensive, but avoids excessive memory use.
 set_op_mix_ratio: float (optional, default 1.0)
 Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product tnorm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
 local_connectivity: int (optional, default 1)
 The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.
 repulsion_strength: float (optional, default 1.0)
 Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.
 negative_sample_rate: int (optional, default 5)
 The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.
 transform_queue_size: float (optional, default 4.0)
 For transform operations (embedding new points using a trained model_ this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.
 a: float (optional, default None)
 More specific parameters controlling the embedding. If None these
values are set automatically as determined by
min_dist
andspread
.  b: float (optional, default None)
 More specific parameters controlling the embedding. If None these
values are set automatically as determined by
min_dist
andspread
.  random_state: int, RandomState instance or None, optional (default: None)
 If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
 metric_kwds: dict (optional, default None)
 Arguments to pass on to the metric, such as the
p
value for Minkowski distance. If None then no arguments are passed on.  angular_rp_forest: bool (optional, default False)
 Whether to use an angular random projection forest to initialise the approximate nearest neighbor search. This can be faster, but is mostly on useful for metric that use an angular style distance such as cosine, correlation etc. In the case of those metrics angular forests will be chosen automatically.
 target_n_neighbors: int (optional, default 1)
 The number of nearest neighbors to use to construct the target simplcial
set. If set to 1 use the
n_neighbors
value.  target_metric: string or callable (optional, default ‘categorical’)
 The metric used to measure distance for a target array is using supervised dimension reduction. By default this is ‘categorical’ which will measure distance in terms of whether categories match or are different. Furthermore, if semisupervised is required target values of 1 will be trated as unlabelled under the ‘categorical’ metric. If the target array takes continuous values (e.g. for a regression problem) then metric of ‘l1’ or ‘l2’ is probably more appropriate.
 target_metric_kwds: dict (optional, default None)
 Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.
 target_weight: float (optional, default 0.5)
 weighting factor between data topology and target topology. A value of 0.0 weights entirely on data, a value of 1.0 weights entirely on target. The default of 0.5 balances the weighting equally between data and target.
 transform_seed: int (optional, default 42)
 Random seed used for the stochastic aspects of the transform operation. This ensures consistency in transform operations.
 verbose: bool (optional, default False)
 Controls verbosity of logging.
 tqdm_kwds: dict (optional, defaul None)
 Key word arguments to be used by the tqdm progress bar.
 unique: bool (optional, default False)
 Controls if the rows of your data should be uniqued before being embedded. If you have more duplicates than you have n_neighbour you can have the identical data points lying in different regions of your space. It also violates the definition of a metric. For to map from internal structures back to your data use the variable _unique_inverse_.
 densmap: bool (optional, default False)
 Specifies whether the densityaugmented objective of densMAP should be used for optimization. Turning on this option generates an embedding where the local densities are encouraged to be correlated with those in the original space. Parameters below with the prefix ‘dens’ further control the behavior of this extension.
 dens_lambda: float (optional, default 2.0)
 Controls the regularization weight of the density correlation term in densMAP. Higher values prioritize density preservation over the UMAP objective, and vice versa for values closer to zero. Setting this parameter to zero is equivalent to running the original UMAP algorithm.
 dens_frac: float (optional, default 0.3)
 Controls the fraction of epochs (between 0 and 1) where the densityaugmented objective is used in densMAP. The first (1  dens_frac) fraction of epochs optimize the original UMAP objective before introducing the density correlation term.
 dens_var_shift: float (optional, default 0.1)
 A small constant added to the variance of local radii in the embedding when calculating the density correlation objective to prevent numerical instability from dividing by a small number
 output_dens: float (optional, default False)
 Determines whether the local radii of the final embedding (an inverse measure of local density) are computed and returned in addition to the embedding. If set to True, local radii of the original data are also included in the output for comparison; the output is a tuple (embedding, original local radii, embedding local radii). This option can also be used when densmap=False to calculate the densities for UMAP embeddings.
 disconnection_distance: float (optional, default np.inf or maximal value for bounded distances)
 Disconnect any vertices of distance greater than or equal to disconnection_distance when approximating the manifold via our knn graph. This is particularly useful in the case that you have a bounded metric. The UMAP assumption that we have a connected manifold can be problematic when you have points that are maximally different from all the rest of your data. The connected manifold assumption will make such points have perfect similarity to a random set of other points. Too many such points will artificially connect your space.

fit
(X, y=None)[source]¶ Fit X into an embedded space.
Optionally use y for supervised dimension reduction.
 X : array, shape (n_samples, n_features) or (n_samples, n_samples)
 If the metric is ‘precomputed’ X must be a square distance matrix. Otherwise it contains a sample per row. If the method is ‘exact’, X may be a sparse matrix of type ‘csr’, ‘csc’ or ‘coo’.
 y : array, shape (n_samples)
 A target array for supervised dimension reduction. How this is
handled is determined by parameters UMAP was instantiated with.
The relevant attributes are
target_metric
andtarget_metric_kwds
.

fit_transform
(X, y=None)[source]¶ Fit X into an embedded space and return that transformed output.
 X : array, shape (n_samples, n_features) or (n_samples, n_samples)
 If the metric is ‘precomputed’ X must be a square distance matrix. Otherwise it contains a sample per row.
 y : array, shape (n_samples)
 A target array for supervised dimension reduction. How this is
handled is determined by parameters UMAP was instantiated with.
The relevant attributes are
target_metric
andtarget_metric_kwds
.
 X_new : array, shape (n_samples, n_components)
 Embedding of the training data in lowdimensional space.
or a tuple (X_new, r_orig, r_emb) if
output_dens
flag is set, which additionally includes: r_orig: array, shape (n_samples)
 Local radii of data points in the original data space (logtransformed).
 r_emb: array, shape (n_samples)
 Local radii of data points in the embedding (logtransformed).

inverse_transform
(X)[source]¶ Transform X in the existing embedded space back into the input data space and return that transformed output.
 X : array, shape (n_samples, n_components)
 New points to be inverse transformed.
 X_new : array, shape (n_samples, n_features)
 Generated data points new data in data space.
A number of internal functions can also be accessed separately for more fine tuned work.
Useful Functions¶

class
umap.umap_.
UMAP
(n_neighbors=15, n_components=2, metric='euclidean', metric_kwds=None, output_metric='euclidean', output_metric_kwds=None, n_epochs=None, learning_rate=1.0, init='spectral', min_dist=0.1, spread=1.0, low_memory=True, n_jobs=1, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, a=None, b=None, random_state=None, angular_rp_forest=False, target_n_neighbors=1, target_metric='categorical', target_metric_kwds=None, target_weight=0.5, transform_seed=42, transform_mode='embedding', force_approximation_algorithm=False, verbose=False, tqdm_kwds=None, unique=False, densmap=False, dens_lambda=2.0, dens_frac=0.3, dens_var_shift=0.1, output_dens=False, disconnection_distance=None)[source] Uniform Manifold Approximation and Projection
Finds a low dimensional embedding of the data that approximates an underlying manifold.
 n_neighbors: float (optional, default 15)
 The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.
 n_components: int (optional, default 2)
 The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to 100.
 metric: string or function (optional, default ‘euclidean’)
The metric to use to compute distances in high dimensional space. If a string is passed it must match a valid predefined metric. If a general metric is required a function that takes two 1d arrays and returns a float can be provided. For performance purposes it is required that this be a numba jit’d function. Valid string metrics include:
 euclidean
 manhattan
 chebyshev
 minkowski
 canberra
 braycurtis
 mahalanobis
 wminkowski
 seuclidean
 cosine
 correlation
 haversine
 hamming
 jaccard
 dice
 russelrao
 kulsinski
 ll_dirichlet
 hellinger
 rogerstanimoto
 sokalmichener
 sokalsneath
 yule
Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.
 n_epochs: int (optional, default None)
 The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).
 learning_rate: float (optional, default 1.0)
 The initial learning rate for the embedding optimization.
 init: string (optional, default ‘spectral’)
 How to initialize the low dimensional embedding. Options are:
 ‘spectral’: use a spectral embedding of the fuzzy 1skeleton
 ‘random’: assign initial embedding positions at random.
 A numpy array of initial embedding positions.
 min_dist: float (optional, default 0.1)
 The effective minimum distance between embedded points. Smaller values
will result in a more clustered/clumped embedding where nearby points
on the manifold are drawn closer together, while larger values will
result on a more even dispersal of points. The value should be set
relative to the
spread
value, which determines the scale at which embedded points will be spread out.  spread: float (optional, default 1.0)
 The effective scale of embedded points. In combination with
min_dist
this determines how clustered/clumped the embedded points are.  low_memory: bool (optional, default True)
 For some datasets the nearest neighbor computation can consume a lot of memory. If you find that UMAP is failing due to memory constraints consider setting this option to True. This approach is more computationally expensive, but avoids excessive memory use.
 set_op_mix_ratio: float (optional, default 1.0)
 Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product tnorm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
 local_connectivity: int (optional, default 1)
 The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.
 repulsion_strength: float (optional, default 1.0)
 Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.
 negative_sample_rate: int (optional, default 5)
 The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.
 transform_queue_size: float (optional, default 4.0)
 For transform operations (embedding new points using a trained model_ this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.
 a: float (optional, default None)
 More specific parameters controlling the embedding. If None these
values are set automatically as determined by
min_dist
andspread
.  b: float (optional, default None)
 More specific parameters controlling the embedding. If None these
values are set automatically as determined by
min_dist
andspread
.  random_state: int, RandomState instance or None, optional (default: None)
 If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
 metric_kwds: dict (optional, default None)
 Arguments to pass on to the metric, such as the
p
value for Minkowski distance. If None then no arguments are passed on.  angular_rp_forest: bool (optional, default False)
 Whether to use an angular random projection forest to initialise the approximate nearest neighbor search. This can be faster, but is mostly on useful for metric that use an angular style distance such as cosine, correlation etc. In the case of those metrics angular forests will be chosen automatically.
 target_n_neighbors: int (optional, default 1)
 The number of nearest neighbors to use to construct the target simplcial
set. If set to 1 use the
n_neighbors
value.  target_metric: string or callable (optional, default ‘categorical’)
 The metric used to measure distance for a target array is using supervised dimension reduction. By default this is ‘categorical’ which will measure distance in terms of whether categories match or are different. Furthermore, if semisupervised is required target values of 1 will be trated as unlabelled under the ‘categorical’ metric. If the target array takes continuous values (e.g. for a regression problem) then metric of ‘l1’ or ‘l2’ is probably more appropriate.
 target_metric_kwds: dict (optional, default None)
 Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.
 target_weight: float (optional, default 0.5)
 weighting factor between data topology and target topology. A value of 0.0 weights entirely on data, a value of 1.0 weights entirely on target. The default of 0.5 balances the weighting equally between data and target.
 transform_seed: int (optional, default 42)
 Random seed used for the stochastic aspects of the transform operation. This ensures consistency in transform operations.
 verbose: bool (optional, default False)
 Controls verbosity of logging.
 tqdm_kwds: dict (optional, defaul None)
 Key word arguments to be used by the tqdm progress bar.
 unique: bool (optional, default False)
 Controls if the rows of your data should be uniqued before being embedded. If you have more duplicates than you have n_neighbour you can have the identical data points lying in different regions of your space. It also violates the definition of a metric. For to map from internal structures back to your data use the variable _unique_inverse_.
 densmap: bool (optional, default False)
 Specifies whether the densityaugmented objective of densMAP should be used for optimization. Turning on this option generates an embedding where the local densities are encouraged to be correlated with those in the original space. Parameters below with the prefix ‘dens’ further control the behavior of this extension.
 dens_lambda: float (optional, default 2.0)
 Controls the regularization weight of the density correlation term in densMAP. Higher values prioritize density preservation over the UMAP objective, and vice versa for values closer to zero. Setting this parameter to zero is equivalent to running the original UMAP algorithm.
 dens_frac: float (optional, default 0.3)
 Controls the fraction of epochs (between 0 and 1) where the densityaugmented objective is used in densMAP. The first (1  dens_frac) fraction of epochs optimize the original UMAP objective before introducing the density correlation term.
 dens_var_shift: float (optional, default 0.1)
 A small constant added to the variance of local radii in the embedding when calculating the density correlation objective to prevent numerical instability from dividing by a small number
 output_dens: float (optional, default False)
 Determines whether the local radii of the final embedding (an inverse measure of local density) are computed and returned in addition to the embedding. If set to True, local radii of the original data are also included in the output for comparison; the output is a tuple (embedding, original local radii, embedding local radii). This option can also be used when densmap=False to calculate the densities for UMAP embeddings.
 disconnection_distance: float (optional, default np.inf or maximal value for bounded distances)
 Disconnect any vertices of distance greater than or equal to disconnection_distance when approximating the manifold via our knn graph. This is particularly useful in the case that you have a bounded metric. The UMAP assumption that we have a connected manifold can be problematic when you have points that are maximally different from all the rest of your data. The connected manifold assumption will make such points have perfect similarity to a random set of other points. Too many such points will artificially connect your space.

fit
(X, y=None)[source] Fit X into an embedded space.
Optionally use y for supervised dimension reduction.
 X : array, shape (n_samples, n_features) or (n_samples, n_samples)
 If the metric is ‘precomputed’ X must be a square distance matrix. Otherwise it contains a sample per row. If the method is ‘exact’, X may be a sparse matrix of type ‘csr’, ‘csc’ or ‘coo’.
 y : array, shape (n_samples)
 A target array for supervised dimension reduction. How this is
handled is determined by parameters UMAP was instantiated with.
The relevant attributes are
target_metric
andtarget_metric_kwds
.

fit_transform
(X, y=None)[source] Fit X into an embedded space and return that transformed output.
 X : array, shape (n_samples, n_features) or (n_samples, n_samples)
 If the metric is ‘precomputed’ X must be a square distance matrix. Otherwise it contains a sample per row.
 y : array, shape (n_samples)
 A target array for supervised dimension reduction. How this is
handled is determined by parameters UMAP was instantiated with.
The relevant attributes are
target_metric
andtarget_metric_kwds
.
 X_new : array, shape (n_samples, n_components)
 Embedding of the training data in lowdimensional space.
or a tuple (X_new, r_orig, r_emb) if
output_dens
flag is set, which additionally includes: r_orig: array, shape (n_samples)
 Local radii of data points in the original data space (logtransformed).
 r_emb: array, shape (n_samples)
 Local radii of data points in the embedding (logtransformed).

inverse_transform
(X)[source] Transform X in the existing embedded space back into the input data space and return that transformed output.
 X : array, shape (n_samples, n_components)
 New points to be inverse transformed.
 X_new : array, shape (n_samples, n_features)
 Generated data points new data in data space.

transform
(X)[source] Transform X into the existing embedded space and return that transformed output.
 X : array, shape (n_samples, n_features)
 New data to be transformed.
 X_new : array, shape (n_samples, n_components)
 Embedding of the new data in lowdimensional space.

umap.umap_.
compute_membership_strengths
[source]¶ Construct the membership strength data for the 1skeleton of each local fuzzy simplicial set – this is formed as a sparse matrix where each row is a local fuzzy simplicial set, with a membership strength for the 1simplex to each other data point.
 knn_indices: array of shape (n_samples, n_neighbors)
 The indices on the
n_neighbors
closest points in the dataset.  knn_dists: array of shape (n_samples, n_neighbors)
 The distances to the
n_neighbors
closest points in the dataset.  sigmas: array of shape(n_samples)
 The normalization factor derived from the metric tensor approximation.
 rhos: array of shape(n_samples)
 The local connectivity adjustment.
 return_dists: bool (optional, default False)
 Whether to return the pairwise distance associated with each edge
 bipartite: bool (optional, default False)
 Does the nearest neighbour set represent a bipartite graph? That is are the nearest neighbour indices from the same point set as the row indices?
 rows: array of shape (n_samples * n_neighbors)
 Row data for the resulting sparse matrix (coo format)
 cols: array of shape (n_samples * n_neighbors)
 Column data for the resulting sparse matrix (coo format)
 vals: array of shape (n_samples * n_neighbors)
 Entries for the resulting sparse matrix (coo format)
 dists: array of shape (n_samples * n_neighbors)
 Distance associated with each entry in the resulting sparse matrix

umap.umap_.
discrete_metric_simplicial_set_intersection
(simplicial_set, discrete_space, unknown_dist=1.0, far_dist=5.0, metric=None, metric_kws={}, metric_scale=1.0)[source]¶ Combine a fuzzy simplicial set with another fuzzy simplicial set generated from discrete metric data using discrete distances. The target data is assumed to be categorical label data (a vector of labels), and this will update the fuzzy simplicial set to respect that label data.
TODO: optional category cardinality based weighting of distance
 simplicial_set: sparse matrix
 The input fuzzy simplicial set.
 discrete_space: array of shape (n_samples)
 The categorical labels to use in the intersection.
 unknown_dist: float (optional, default 1.0)
 The distance an unknown label (1) is assumed to be from any point.
 far_dist: float (optional, default 5.0)
 The distance between unmatched labels.
 metric: str (optional, default None)
 If not None, then use this metric to determine the distance between values.
 metric_scale: float (optional, default 1.0)
 If using a custom metric scale the distance values by this value – this controls the weighting of the intersection. Larger values weight more toward target.
 simplicial_set: sparse matrix
 The resulting intersected fuzzy simplicial set.

umap.umap_.
fast_intersection
[source]¶ Under the assumption of categorical distance for the intersecting simplicial set perform a fast intersection.
 rows: array
 An array of the row of each nonzero in the sparse matrix representation.
 cols: array
 An array of the column of each nonzero in the sparse matrix representation.
 values: array
 An array of the value of each nonzero in the sparse matrix representation.
 target: array of shape (n_samples)
 The categorical labels to use in the intersection.
 unknown_dist: float (optional, default 1.0)
 The distance an unknown label (1) is assumed to be from any point.
 far_dist float (optional, default 5.0)
 The distance between unmatched labels.
None

umap.umap_.
fast_metric_intersection
[source]¶ Under the assumption of categorical distance for the intersecting simplicial set perform a fast intersection.
 rows: array
 An array of the row of each nonzero in the sparse matrix representation.
 cols: array
 An array of the column of each nonzero in the sparse matrix representation.
 values: array of shape
 An array of the values of each nonzero in the sparse matrix representation.
 discrete_space: array of shape (n_samples, n_features)
 The vectors of categorical labels to use in the intersection.
 metric: numba function
 The function used to calculate distance over the target array.
 scale: float
 A scaling to apply to the metric.
None

umap.umap_.
find_ab_params
(spread, min_dist)[source]¶ Fit a, b params for the differentiable curve used in lower dimensional fuzzy simplicial complex construction. We want the smooth curve (from a predefined family with simple gradient) that best matches an offset exponential decay.

umap.umap_.
fuzzy_simplicial_set
(X, n_neighbors, random_state, metric, metric_kwds={}, knn_indices=None, knn_dists=None, angular=False, set_op_mix_ratio=1.0, local_connectivity=1.0, apply_set_operations=True, verbose=False, return_dists=None)[source]¶ Given a set of data X, a neighborhood size, and a measure of distance compute the fuzzy simplicial set (here represented as a fuzzy graph in the form of a sparse matrix) associated to the data. This is done by locally approximating geodesic distance at each point, creating a fuzzy simplicial set for each such point, and then combining all the local fuzzy simplicial sets into a global one via a fuzzy union.
 X: array of shape (n_samples, n_features)
 The data to be modelled as a fuzzy simplicial set.
 n_neighbors: int
 The number of neighbors to use to approximate geodesic distance. Larger numbers induce more global estimates of the manifold that can miss finer detail, while smaller values will focus on fine manifold structure to the detriment of the larger picture.
 random_state: numpy RandomState or equivalent
 A state capable being used as a numpy random state.
 metric: string or function (optional, default ‘euclidean’)
The metric to use to compute distances in high dimensional space. If a string is passed it must match a valid predefined metric. If a general metric is required a function that takes two 1d arrays and returns a float can be provided. For performance purposes it is required that this be a numba jit’d function. Valid string metrics include:
 euclidean (or l2)
 manhattan (or l1)
 cityblock
 braycurtis
 canberra
 chebyshev
 correlation
 cosine
 dice
 hamming
 jaccard
 kulsinski
 ll_dirichlet
 mahalanobis
 matching
 minkowski
 rogerstanimoto
 russellrao
 seuclidean
 sokalmichener
 sokalsneath
 sqeuclidean
 yule
 wminkowski
Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.
 metric_kwds: dict (optional, default {})
 Arguments to pass on to the metric, such as the
p
value for Minkowski distance.  knn_indices: array of shape (n_samples, n_neighbors) (optional)
 If the knearest neighbors of each point has already been calculated you can pass them in here to save computation time. This should be an array with the indices of the knearest neighbors as a row for each data point.
 knn_dists: array of shape (n_samples, n_neighbors) (optional)
 If the knearest neighbors of each point has already been calculated you can pass them in here to save computation time. This should be an array with the distances of the knearest neighbors as a row for each data point.
 angular: bool (optional, default False)
 Whether to use angular/cosine distance for the random projection forest for seeding NNdescent to determine approximate nearest neighbors.
 set_op_mix_ratio: float (optional, default 1.0)
 Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product tnorm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
 local_connectivity: int (optional, default 1)
 The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.
 verbose: bool (optional, default False)
 Whether to report information on the current progress of the algorithm.
 return_dists: bool or None (optional, default None)
 Whether to return the pairwise distance associated with each edge.
 fuzzy_simplicial_set: coo_matrix
 A fuzzy simplicial set represented as a sparse matrix. The (i, j) entry of the matrix represents the membership strength of the 1simplex between the ith and jth sample points.

umap.umap_.
init_graph_transform
(graph, embedding)[source]¶  Given a bipartite graph representing the 1simplices and strengths between the
 new points and the original data set along with an embedding of the original points
initialize the positions of new points relative to the strengths (of their neighbors in the source data).
If a point is in our original data set it embeds at the original points coordinates. If a point has no neighbours in our original dataset it embeds as the np.nan vector. Otherwise a point is the weighted average of it’s neighbours embedding locations.
 graph: csr_matrix (n_new_samples, n_samples)
 A matrix indicating the the 1simplices and their associated strengths. These strengths should be values between zero and one and not normalized. One indicating that the new point was identical to one of our original points.
 embedding: array of shape (n_samples, dim)
 The original embedding of the source data.
 new_embedding: array of shape (n_new_samples, dim)
 An initial embedding of the new sample points.

umap.umap_.
init_transform
[source]¶ Given indices and weights and an original embeddings initialize the positions of new points relative to the indices and weights (of their neighbors in the source data).
 indices: array of shape (n_new_samples, n_neighbors)
 The indices of the neighbors of each new sample
 weights: array of shape (n_new_samples, n_neighbors)
 The membership strengths of associated 1simplices for each of the new samples.
 embedding: array of shape (n_samples, dim)
 The original embedding of the source data.
 new_embedding: array of shape (n_new_samples, dim)
 An initial embedding of the new sample points.

umap.umap_.
make_epochs_per_sample
(weights, n_epochs)[source]¶ Given a set of weights and number of epochs generate the number of epochs per sample for each weight.
 weights: array of shape (n_1_simplices)
 The weights ofhow much we wish to sample each 1simplex.
 n_epochs: int
 The total number of epochs we want to train for.
An array of number of epochs per sample, one for each 1simplex.

umap.umap_.
nearest_neighbors
(X, n_neighbors, metric, metric_kwds, angular, random_state, low_memory=True, use_pynndescent=True, n_jobs=1, verbose=False)[source]¶ Compute the
n_neighbors
nearest points for each data point inX
undermetric
. This may be exact, but more likely is approximated via nearest neighbor descent. X: array of shape (n_samples, n_features)
 The input data to compute the kneighbor graph of.
 n_neighbors: int
 The number of nearest neighbors to compute for each sample in
X
.  metric: string or callable
 The metric to use for the computation.
 metric_kwds: dict
 Any arguments to pass to the metric computation function.
 angular: bool
 Whether to use angular rp trees in NN approximation.
 random_state: np.random state
 The random state to use for approximate NN computations.
 low_memory: bool (optional, default True)
 Whether to pursue lower memory NNdescent.
 verbose: bool (optional, default False)
 Whether to print status data during the computation.
 knn_indices: array of shape (n_samples, n_neighbors)
 The indices on the
n_neighbors
closest points in the dataset.  knn_dists: array of shape (n_samples, n_neighbors)
 The distances to the
n_neighbors
closest points in the dataset.  rp_forest: list of trees
 The random projection forest used for searching (if used, None otherwise)

umap.umap_.
raise_disconnected_warning
(edges_removed, vertices_disconnected, disconnection_distance, total_rows, threshold=0.1, verbose=False)[source]¶ A simple wrapper function to avoid large amounts of code repetition.

umap.umap_.
reset_local_connectivity
(simplicial_set, reset_local_metric=False)[source]¶ Reset the local connectivity requirement – each data sample should have complete confidence in at least one 1simplex in the simplicial set. We can enforce this by locally rescaling confidences, and then remerging the different local simplicial sets together.
 simplicial_set: sparse matrix
 The simplicial set for which to recalculate with respect to local connectivity.
 simplicial_set: sparse_matrix
 The recalculated simplicial set, now with the local connectivity assumption restored.

umap.umap_.
simplicial_set_embedding
(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric=CPUDispatcher(<function euclidean_grad>), output_metric_kwds={}, euclidean_output=True, parallel=False, verbose=False, tqdm_kwds=None)[source]¶ Perform a fuzzy simplicial set embedding, using a specified initialisation method and then minimizing the fuzzy set cross entropy between the 1skeletons of the high and low dimensional fuzzy simplicial sets.
 data: array of shape (n_samples, n_features)
 The source data to be embedded by UMAP.
 graph: sparse matrix
 The 1skeleton of the high dimensional fuzzy simplicial set as represented by a graph for which we require a sparse matrix for the (weighted) adjacency matrix.
 n_components: int
 The dimensionality of the euclidean space into which to embed the data.
 initial_alpha: float
 Initial learning rate for the SGD.
 a: float
 Parameter of differentiable approximation of right adjoint functor
 b: float
 Parameter of differentiable approximation of right adjoint functor
 gamma: float
 Weight to apply to negative samples.
 negative_sample_rate: int (optional, default 5)
 The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.
 n_epochs: int (optional, default 0)
 The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If 0 is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).
 init: string
 How to initialize the low dimensional embedding. Options are:
 ‘spectral’: use a spectral embedding of the fuzzy 1skeleton
 ‘random’: assign initial embedding positions at random.
 A numpy array of initial embedding positions.
 random_state: numpy RandomState or equivalent
 A state capable being used as a numpy random state.
 metric: string or callable
 The metric used to measure distance in high dimensional space; used if multiple connected components need to be layed out.
 metric_kwds: dict
 Key word arguments to be passed to the metric function; used if multiple connected components need to be layed out.
 densmap: bool
 Whether to use the densityaugmented objective function to optimize the embedding according to the densMAP algorithm.
 densmap_kwds: dict
 Key word arguments to be used by the densMAP optimization.
 output_dens: bool
 Whether to output local radii in the original data and the embedding.
 output_metric: function
 Function returning the distance between two points in embedding space and the gradient of the distance wrt the first argument.
 output_metric_kwds: dict
 Key word arguments to be passed to the output_metric function.
 euclidean_output: bool
 Whether to use the faster code specialised for euclidean output metrics
 parallel: bool (optional, default False)
 Whether to run the computation using numba parallel. Running in parallel is nondeterministic, and is not used if a random seed has been set, to ensure reproducibility.
 verbose: bool (optional, default False)
 Whether to report information on the current progress of the algorithm.
 tqdm_kwds: dict
 Key word arguments to be used by the tqdm progress bar.
 embedding: array of shape (n_samples, n_components)
 The optimized of
graph
into ann_components
dimensional euclidean space.  aux_data: dict
 Auxiliary output returned with the embedding. When densMAP extension
is turned on, this dictionary includes local radii in the original
data (
rad_orig
) and in the embedding (rad_emb
).

umap.umap_.
smooth_knn_dist
[source]¶ Compute a continuous version of the distance to the kth nearest neighbor. That is, this is similar to knndistance but allows continuous k values rather than requiring an integral k. In essence we are simply computing the distance such that the cardinality of fuzzy set we generate is k.
 distances: array of shape (n_samples, n_neighbors)
 Distances to nearest neighbors for each samples. Each row should be a sorted list of distances to a given samples nearest neighbors.
 k: float
 The number of nearest neighbors to approximate for.
 n_iter: int (optional, default 64)
 We need to binary search for the correct distance value. This is the max number of iterations to use in such a search.
 local_connectivity: int (optional, default 1)
 The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.
 bandwidth: float (optional, default 1)
 The target bandwidth of the kernel, larger values will produce larger return values.
 knn_dist: array of shape (n_samples,)
 The distance to kth nearest neighbor, as suitably approximated.
 nn_dist: array of shape (n_samples,)
 The distance to the 1st nearest neighbor for each point.