When fitting a single UMAP model on the full dataset and then embedding only the female samples,
I expect:
umap.transform(X_female) to be identical (up to numerical noise) to
umap.transform(X_all)[female_idx].
Instead, I consistently get large differences between the two, despite using the same fitted model, fixed random_state, and effectively single-threaded execution (UMAP prints the warning that random_state disables parallelism).
Expected behavior
umap.transform(X_subset) should produce the same embedding as umap.transform(X_all)[subset_idx].
The results statistics:
same fitted model, transform females only vs transform all, then subset:
max |Δ| = 2.443e+00
RMSE = 4.176e-01
Actual behavior
Significant coordinate shifts appear even with fixed seeds, single-threaded execution, and identical preprocessing.
Attached are three figures:
case1_overlay.png — overlay of both embeddings (colored by transform path)
case1_delta_scatter.png — 2D scatter of pointwise Δ
case1_delta_hist.png — histogram of ‖Δ‖ per sample
These visuals clearly show that transform(X_subset) and transform(X_all)[subset_idx] yield different embeddings.
Why does the UMAP transform work like this?
