matmols

The Illustrated AlphaFold

2024-07-10T00:00:00+00:00

Introduction

Who should read this

Do you want to understand exactly how AlphaFold3 works? The architecture is quite complicated and the description in the paper can be overwhelming, so we made a much more friendly (but just as detailed!) visual walkthrough.

This is mostly written for an ML audience and multiple points assume familiarity with the steps of attention. If you’re rusty, see Jay Alammar’s The Illustrated Transformer for a thorough visual explanation. That post is one of the best explanations of a model architecture at the level of individual matrix operations and also the inspiration for the diagrams and naming.

There are already many great explanations of the motivation for protein structure prediction, the CASP competition, model failure modes, debates about evaluations, implications for biotech, etc. so we don’t focus on any of that. Instead we explore the how.

How are these molecules represented in the model and what are all of the operations that convert them into a predicted structure?

This is probably more exhaustive than most people are looking for, but if you want to understand all the details and you like learning via diagrams, this should help :)

Architecture Overview

We’ll start by pointing out that goals of the model are a bit different than previous AlphaFold models: instead of just predicting the structure of individual protein sequences (AF2) or protein complexes (AF-multimeter), it predicts the structure of a protein, optionally complexed with other proteins, nucleic acids, or small molecules, all from sequence alone. So while previous AF models only had to represent sequences of standard amino acids, AF3 has to represent more complex input types, and thus there is a more complex featurization/tokenization scheme. Tokenization is described in its own section, but for now just know that when we say “token” it either represents a single amino acid (for proteins), nucleotide (for DNA/RNA), or an individual atom if that atom is not part of a standard amino acid/nucleotide.

Interactive Table of Contents

Full architecture. If you click on any part of the architecture, it will take you to that section of the post. If you resize the page, you might need to refresh to keep the interactive part working. (Diagram modified from AF3 paper)

Throughout the post, we highlight where you are in this diagram so you don't get lost!

The model can be broken down into 3 main sections:

Input Preparation The user provides sequences of some molecules to predict structures for and these need to be embedded into numerical tensors. Furthermore, the model retrieves a collection of other molecules that are presumed to have similar structures to the user-provided molecules. The input preparation step identifies these molecules and also embeds these as their own tensors.
Representation learning Given the Single and Pair tensors created in section 1, we use many variants of attention to update these representations.
Structure prediction We use these improved representations, and the original inputs created in section 1 to predict the structure using conditional diffusion.

Skip to a specific section by via its name here or the by clicking the relevant part of the architecture in the diagram above.

We also have additional sections describing 4. the loss function, confidence heads, and other relevant training details and 5. some thoughts on the model from an ML trends perspective.

Notes on the variables and diagrams

Throughout the model a protein complex is represented in two primary forms: the “single” representation which represents all the tokens in our protein complex, and a “pair” representation which represents the relationships (e.g. distance, potential interactions) between all pairs of amino acids / atoms in the complex. Each of these can be represented at an atom-level or a token-level, and will always be shown with these names (as established in the AF3 paper) and colors:

The diagrams abstract away the model weights and only visualize how the shapes of activations change
The activation tensors are always labeled with the dimension names used in the paper and the sizes of the diagrams vaguely aim to follow when these dimensions grow/shrink. The hidden dimension names usually start with "c" for "channel". For reference the main dimensions used are c_z=128, c_m=64, c_atom=128, c_atompair=16, c_token=768, c_s=384.
Whenever possible, the names above the tensors in this (and every) diagram match the names of the tensors use in the AF3 supplement. Typically, a tensor maintains its name as it goes through the model. However, in some cases, we use different names to distinguish between versions of a tensor at different stages of processing. For example, in the atom-level single representation, c represents the initial atom-level single representation while q represents the updated version of this representation as it progresses through the Atom Transformer.
We also ignore most of the LayerNorms for simplicity but they are used everywhere.

1. Input Preparation

The actual input a user provides to AF3 is the sequence of one protein and optionally additional molecules. The goal of this section is to convert these sequences into a series of 6 tensors that will be used as the input to the main trunk of the model as outlined in this diagram. These tensors are s, our token-level single representation, z, our token-level pair representation, q, our atom-level single representation, p, our atom-level pair representation, m, our MSA representation, and t, our template representation.

This section contains:

Tokenization describes how molecules are tokenized and clarifyies the difference between atom-level and token-level
Retrieval (Create MSA and Templates) expalains why and how we include additional inputs to the model. It creates our MSA (m) and structure templates (t).
Create Atom-Level Representations creates our first atom-level representations q (single) and p (pair) and includes information about generated conformers of the molecules.
Update Atom-Level Representations (Atom Transformer) is the main “Input Embedder” block, also called the “Atom Transformer”, which gets repreated 3 times and updates the atom-level single representation (q). The building blocks introduced here (Adaptive LayerNorm, Attention with Pair Bias, Conditioned Gating, and Conditioned Transition) are also relevant later in the model.
Aggregate Atom-Level -> Token-Level takes our atom-level representations (q, p) and aggregates all the atoms that at part of multi-atom tokens to create token-level representations s (single) and z (pair) and includes information from the MSA (m) and any user-provided information about known bonds that involve ligands.

Tokenization

See where this fits into the full architecture

In AF2, as the model only represented proteins with a fixed set of amino acids, each amino acid was represented with its own token. This is maintained in AF3, but additional tokens are also introduced for the additional molecule types that AF3 can handle:

Standard amino acid: 1 token (as per AF2)
Standard nucleotide: 1 token
Non-standard amino acids or nucleotides (methylated nucleotide, amino acid with post-translational modification, etc.): 1 token per atom
Other molecules: 1 token per atom

As a result, we can think of some tokens (like those for amino acids) as being associated with multiple atoms, while other tokens (like those for an atom in a ligand) are associated with only a single atom. So, while a protein with 35 standard amino acids (likely > 600 atoms) would be represented by 35 tokens, a ligand with 35 atoms would also be represented by 35 tokens.

Retrieval (Create MSA and Templates)

See where this fits into the full architecture

One of the key early steps in AF3 is something akin to Retrieval Augmented Generation RAG in language models. We find similar sequences to our protein and RNA sequences of interest (collected into a multiple sequence alignment, “MSA”), and any structures related to those (called the “templates”), then include them as additional inputs to the model called m and t, respectively.

(Image from AF2)

Why do we want to include MSA and templates?

Mapping Chemical Space with UMAP

2021-04-06T11:46:00+00:00

Note: this was originally written for the Reverie Labs blog which got taken down after acquisition so now it’s re-posted here

This blog post discusses why it is important to visualize the latent space of chemical datasets, what makes UMAP a useful tool for this purpose, and how we use UMAP at Reverie Labs. As an example, we use the Blood Brain Barrier Permeability (BBBP) dataset from MoleculeNet for our visualizations and the code tutorial. This dataset has measurements from over 2000 unique compounds, many of which are approved drugs, each labeled as “permeable” or “not permeable”. We look into the details of how this dataset gets embedded by various dimensionality reduction methods and reveal some fascinating properties of UMAP.

For a walkthrough using this dataset of how to use UMAP to visualize chemical space, see this Colab notebook:

Motivation

A fundamental assumption behind most machine learning methods is that data are independent and identically distributed (IID). However, in drug discovery datasets, compounds are almost never sampled independently, as they are typically extracted from experiments for specific therapeutic programs. Measurements often follow the patterns of the drug development efforts that generate them. Any biases in the data-generation process can also sneak into the training and evaluation of models. In practice, this means that open source and industry datasets are often “clumpy”, consisting of measurements for compounds that are very similar to one another and non-uniformly cover chemical space. Visualizing the chemical space of a dataset does not solve these issues, but it helps us better understand them within the context of our datasets.

At Reverie Labs, we work with dozens of datasets spanning a range of different properties. When ingesting a new dataset, we begin with a systematized analysis that identifies biases and helps determine how best to prepare and use the dataset for modeling. As with most machine learning problems, we want to understand the distribution of the measured property we are modeling (What is the class balance within a classification dataset? Does our regression dataset have truncated measurements due to assay limits? etc.) We want to make sure the measurements look reasonable and compare their distributions in our training and test data. However, we cannot stop there. We must also look at the distribution of the chemical structures that the measurements are taken from because the compounds themselves are typically selected from a biased generation process.

Visualizing the distribution of our compounds in chemical space allows us to gauge how much we expect models trained on a given dataset to generalize to new chemical matter. These visualizations can help with manual inspection of Structure-Activity / Structure-Property Relationships (SAR/SPR), expose potential quirks or biases in the dataset, and reveal insights for how we might want to split the dataset into training and evaluation sets. These are many methods we can use for visualizing chemical space, but at Reverie, we have selected a default procedure involving UMAP that optimizes accuracy, speed, and ease-of-use.

Embeddings

That sounds great, but how do we actually visualize these distributions? The key here is that we need to embed our compounds into a low-dimensional vector form that can be easily interpreted (in this post we stick to 2D for simplicity’s sake but 3D would also work). Representing our molecules as 2048-bit Extended-Connectivity Fingerprints (ECFPs) gives us high dimensional vectors that we can then project into a 2-dimensional space for visualization. PCA and t-SNE are commonly used for this kind of dimensionality reduction, and have been used for many biology and chemistry purposes. Since usage of these tools has been thoroughly documented, here we focus on the utility of a more recent addition to the dimensionality reduction repertoire: UMAP.

Uniform Manifold Approximation and Projection (UMAP) constructs a high-dimensional graph representation of the entire dataset then tries to re-create a low dimensional version of this initial graph that maintains as much of the local and global structure as possible. This method is somewhat similar to t-SNE, but with a few key differences that lead to important advantages for our purposes. The technical differences between UMAP and t-SNE would take up a full blog post, so we will not detail them here but more details can be found in these resources: the original paper, Understanding UMAP or How Exactly UMAP Works.

The most relevant benefits UMAP provides us are speed, the ability to maintain some of the local / global structure of the data, and an easy interface for applying an embedding from one dataset to a different dataset.

Speed

To evaluate the speed of these methods, we compute ECFPs of various sizes and embed each using each PCA, t-SNE, and UMAP:

We can see that PCA is the most efficient, UMAP is slightly less efficient, and t-SNE is by far the least efficient. For small datasets, these time differences are not particularly relevant, but they really add up as the number of compounds grows. To be fair, we used the most generic implementations (PCA and t-SNE from sklearn) and there are more efficient variants of t-SNE. However, the UMAP Docs contain a similar performance analysis on MNIST, including a wider variety of the performant methods, and find similar results.

Additionally, the creators of UMAP recently released a new version of the algorithm, ParametricUMAP, that uses a neural network to reduce the dimensions of the graphical embedding, giving it even greater speed improvements. For this post we stick to the original, non-parametric UMAP, but if you want to optimize the speed of embedding new compounds into a pre-fit UMAP model, ParametricUMAP can be quite useful.

Local / Global Structure

If PCA is so fast, why don’t we just use that? Our goal is to understand both the local structure (organization of similar compounds) and global structure (organization of groups of different compounds) of complex datasets. Speed is important, but we also evaluate the methods according to how informative they are for those tasks. We can use the example BBBP dataset to explore the local and global structure of the embeddings created by the various methods:

Click these links for interactive versions of the plots [PCA, t-SNE, UMAP]

At a first glance, we notice some high-level differences in the overall structures of the embeddings. PCA looks like the intersection of two orthogonal lines, and the compounds within them are somewhat uniformly distributed. t-SNE has a flowery shape that resembles a 2D gaussian, with a variety of isolated clusters around the edges. UMAP has many disjoint, tight clusters that do not follow a specific pattern. Beyond surface-level observations, we cannot ascertain much more about these embeddings from these plots alone. To do that, we need to look more into the actual compounds that are represented in these images.

These links [PCA, t-SNE, UMAP] take you to pages with interactive views of these plots, which is how we typically examine our datasets. Exploring the data in an interactive form helps builds intuition for how the different algorithms organize the chemical space of the dataset. To help guide this interactive exploration, we’ve selected a few clusters from the UMAP plot and highlighted where each of their compounds are embedded in t-SNE and PCA:

We see that local areas of the embedding contain compounds that not only look similar, but also belong to the same drug class. The clusters we’ve selected here contain steroids, tetracycline antibiotics and β-lactam antibiotics. Diving into the details of these clusters and their respective drug types serves as a great case study through which we can better understand and compare the structure of the embeddings.

Case study

In the steroid cluster there are many compounds with the 4-ring system that is characteristic of steroids, and even some non-steroid compounds with a similar structure. Each method separates these compounds out from the rest, although UMAP appears to isolate them out and group them together the most strongly.

In a nearby but fully isolated cluster we find a collection of tetracycline antibiotics antibiotics. These compounds also contain a fused 4-ring system, however the tetracycline rings differ from the steroid rings due to their shape, relative positioning, and bond orders.

Comparing the embedding of the steroid and tetracycline clusters we see: 1) Steroids appear more isolated than the tetracyclines. If we assume the global distances between clusters are meaningful, this implies that the steroids are more unique from the rest of the dataset than the tetracycline are. 2) While the steroid compounds are spread out within their cluster, the tetracyclines are all embedded practically on top of each other. If we assume the spread of a cluster reflects the local diversity, this implies that the steroids are more diverse than the tetracyclines. 3) The two clusters are relatively near each other. If we assume the relationships between clusters are meaningful, this implies that these clusters are more similar to each-other than they are to the rest of the dataset.

To evaluate whether these assumptions about the global and local structure are true, we can examine the validity of the claims they imply. Should steroids be placed farther away from the rest of the compounds than the tetracyclines are? Both groups appear to stand out from the rest of the dataset, but it is difficult to manually compare their levels of uniqueness. Should tetracyclines and steroids actually be placed near each other? Given that both groups contain 4-fused ring systems, it seems reasonable for them to be located near each other but this similarity could just be superficial. To more quantitatively address these questions we look at the measured similarities between steroids, tetracyclines, and the rest of the compounds in the dataset. For each steroid compound, we calculate its average similarity to the other steroids, the tetracyclines, and every other compound. This is repeated with the tetracyclines. We define chemical similarity between compounds using Tanimoto similarity between ECFPs for the compounds, the same metric used to create the embeddings. The higher the Tanimoto similarity, the more similar to compounds are to one another.

These plots contextualize some of our observations about the steroid and tetracycline clusters:

The steroids and tetracyclines have roughly equivalent levels of similarity to the rest of the dataset (average similarity of 0.09 and 0.1 respectively). The distributions are not identical but their difference is not large enough to explain the extra isolation of the steroid cluster. This supports the general belief that these global distances are not always interpretable.
Tetracyclines are indeed more homogenous than steroids. Tetracyclines have, on average 0.15 higher tanimoto similarity with other tetracyclines than steroids do with other steroids. This means that the spread of intra-group distributions actually reflect the local chemical diversity and are not just a strange quirk of the UMAP embedding.
Tetracyclines and steroids have higher similarity to each other than they do to the rest of the dataset. This means that the visual relationship between these embedded clusters actually reflects a real relationship between the compounds that the clusters represent.

This reveals that the global structure of this dataset is not maintained through exact distances between groups of compounds, but rather the relationships between them. Local structure is expressed by maintaining the local diversity of groups through their intra-group distribution. There actually is an even more detailed level of local structure hidden in these embeddings that we’ll examine later, but given the information we have so far these are the main conclusions. The embedding of two groups of compounds does not prove anything about UMAP that we expect to hold up for all embeddings. But their examples highlight important patterns in how chemical datasets get embedded.

We can compare UMAP’s arrangement of the steroids and tetracyclines with PCA and t-SNE’s to look for differences in the local and global structures of these embeddings:

	Global Structure			Local Structure
	Clusters Identifiable		Relationship between groups	Intra-group distribution
	Steroids	Tetracyclines	Relationship between groups	Steroids	Tetracyclines
PCA	yes	no	nearby	disperse	disperse
t-SNE	yes	yes	far apart	disperse	tight
UMAP	yes	yes	nearby	disperse	tight

Based on these observations, PCA is not as successful at maintaining the structure of these groups within the dataset. Specifically, t-SNE and UMAP highlight the uniqueness and homogeneity of tetracyclines, whereas PCA spreads the tetracyclines out amidst various other scaffolds in an unidentifiable way. This again supports that, although PCA maintains a few key elements of the global structure, t-SNE and UMAP preserve the global and local structure more consistently throughout the dataset.

Differences between the embeddings are less noticeable when examining the steroids Each method embeds the steroids in a clearly identifiable, yet disperse cluster. UMAP’s steroid cluster is the most isolated but as discussed earlier, this extra separation is not particularly meaningful. Both PCA and UMAP place the chemically similar steroids and tetracyclines nearby each other while t-SNE does not. This seemingly implies t-SNE’s global structure is not as informative. However, t-SNE places the tetracyclines near the β-lactam antibiotics, which, as we will read below, actually makes sense.

The differences between the methods are most apparent when we examine the β-lactam antibiotics. We see the namesake β-lactam ring present in every compound and, in terms of global structure, all three methods separate out the β-lactam antibiotic compounds as significantly unique from the rest. Again, the placement in PCA is not as isolated as it is in the other methods but the cluster still stands out.

The β-lactam antibiotics are interesting because they give us a view into the level of local organization that t-SNE and UMAP have within the subclasses of this drug-type:

We have labelled each of the various subclasses of β-lactam antibiotics. They vary based on the particular details of the β-lactam ring system in a given compound. You don’t need to actually understand the differences between the various β-lactam subclasses but know they are fairly small. The importance of visualizing them is to highlight their placements in each of the embeddings.

PCA has all of the subtypes mixed together, which makes sense given that in PCA, the principal components we are visualizing are meant to have maximal variance. The nuances within a class of drugs are not particularly high variance so a 2-D PCA plot loses this local structure. On the other hand, t-SNE and UMAP both maintain local structures in the dataset by embedding the β-lactam antibiotics in a way that separates the compounds based on their subclass.

Zooming into the bottom of our UMAP plot where the β-lactam compounds are embedded, we can better examine the details of each subclass:

Not only are the β-lactam antibiotics contained in this section of the UMAP embedding, but the individual subclasses are each grouped together. If we look at the interactive versions of the plots (PCA, TSNE, UMAP), we see a similar phenomena in the t-SNE embedding. This ability to maintain both global structure and such specificity in the local structure is what makes these methods useful for easily exploring a dataset.

If we look even closer at the placement of the β-lactam antibiotics, we discover a fascinating property of our embedding. In the annotated plot above there are two main clusters of β-lactam antibiotics and one outlier compound, in the upper left corner, that has the substructure of a Penicillin (penam), yet appears to be located far away; in fact, it is placed in the tetracycline cluster.

The tetracycline antibiotics are structurally quite different from the β-lactam antibiotics, and yet they are embedded relatively nearby in both UMAP and t-SNE. We do not expect the distances between clusters to necessarily mean anything, but in this case, their relative positioning actually does.

The outlier compound, Penimocycline, contains both the substructure of the Penicillins (penam) and the substructure of the Tetracycline. It is actually classified as both types of antibiotics. When constructing a graph representation of this dataset, this compound is likely connected to both of these two well-connected subgraphs. This would link the two groups together, ultimately leading to their respective clusters of compounds being located near each-other in the final embedding.

To investigate this assumption, we can remove Penimocycline from our dataset and generate a new embedding. If this compound is functioning as a link between the tetracycline antibiotics and β-lactam antibiotics, embedding the dataset without it would break the connection between the groups and their positions would no longer be close to each other.

As hypothesized, in the new embedding on the right, we see that the tetracycline antibiotics are no longer placed near the β-lactam antibiotics. Tetracycline antibiotics are still near the steroids in each embedding variant, yet the induced separation between the tetracycline and β-lactam antibiotics has shifted much of the embedding. UMAP is non-deterministic so we re-ran these embeddings (both the original on the left and the modified dataset on the right) multiple times and this phenomena held up. This implies that the single compound actually is the influential node linking the β-lactam and tetracycline antibiotic clusters. It also reveals how the structure of UMAP is greatly influenced by individual compounds with strong connections between otherwise disconnected subgraphs of the dataset.

If you continue exploring the other areas of this dataset in the interactive links or Colab notebook provided you will also find collections of narcotics, sedatives, NSAIDs etc. Examining where specific compounds get placed in each of the embeddings helps explain the structural differences between the embeddings.

Hyperparameters

UMAP has several hyperparameters that give the user a bit more control over the structure of the final embedding based on their particular priorities:

n_components is the dimensionality of the final embedding. To create visualizations in 2 dimensions we keep this fixed at 2 components.
metric is the metric used to determine distance between points. Because we are comparing ECFPs, we use Jaccard distance (typically referred to as Tanimoto distance in cheminformatics).
n_neighbors determines the prioritization of local versus global structure in the embedding. This value constrains the number of neighbors that a given compound has in the graph representation of the dataset. If n_neighbors is small then the embedding focuses on optimizing the distances between similar compounds to ensure the small differences between them are well represented. If n_neighbors is larger, then the distances between less similar compounds is prioritized.
min_dist is the minimum distance between any two points. This affects the tightness of the embedding. The larger min_dist, the more spread out the compounds will be.

n_neighbors and min_dist can be tuned based on the dataset’s properties and the user’s preferences. The ideal settings may vary based on the dataset. These plots show how varying n_neighbors (along the rows) and min_dist (along the columns) can influence the embedding:

As these plots reveal, the spread of the embedding is quite dependent on the relationship between these two parameters. When min_dist is very small, compounds that are very similar to each-other are placed almost directly on top of each other, which makes it very easy to identify unique clusters. However, it is more difficult to decipher the actual quantity of compounds and the nuanced differences between them. When min_dist is large, it is easier to gauge the full spread of the compounds but more difficult to isolate specific clusters. When n_neighbors is small there are meaningful patterns within clusters but the global structure is less interpretable. As this value increases we see the relationships between the clusters becoming more noticeable as the clusters become less sparse. As with the t-SNE / UMAP comparison, there is no clear answer that one particular set of hyperparameters is always best. Depending on what the user is looking for, there are many great options.

Other Works

For a non-comprehensive list of examples of others using UMAP for chemistry and biology purposes see:

This year yet another alternative to t-SNE, TMAP has been developed. We haven’t extensively investigated that method yet, but it does seem promising.

Caveat

It is impossible to distill all of the complexities of chemical space into 2-dimensions and a lot of information gets lost in the process. Our low-dimensional embeddings can only be as good as their high-dimensional predecessors, the ECFPs. ECFPs are an imperfect, yet important, method for vectorizing molecules and using the distances between these sparse vectors as the basis for generating the underlying UMAP graphs means that our embeddings can only be as good as those distance metrics. Despite these flaws, we still find these plots to provide significant value to our design efforts.

Practical Uses for UMAP at Reverie Labs

We use UMAP in two main ways when examining a dataset:

Dataset Specific Embeddings: Examine the particular distribution of compounds within a specific dataset
Dataset-Agnostic Embeddings: Examine where the compounds of a dataset fit into our general embedding of global chemical space

We can also use these two embeddings as new lenses to view the distribution of measured properties and other physicochemical properties.

Dataset Specific Embeddings

To create a dataset-specific embedding, we fit a UMAP model on the molecules of the particular dataset we are investigating. All of the visualizations we have used so far are based on dataset-specific embeddings of our example BBBP dataset. As we saw when examining the local and global structure of the UMAP embedding, a dataset-specific plot provides a good understanding of the nuances within the dataset.

To add an extra dimension, we can color the plots based on any other relevant data we have on the compounds. Earlier we did this using the drug-types of certain compounds, but we can just as easily visualize the compounds colored by the date-of-synthesis, the measured property (in our case blood brain barrier permeability), physicochemical properties, or the dataset-split.

This can help answer questions such as: Are there clusters of compounds that have not been actively developed in years? Are all of the most potent compounds in one area of chemical space? If splitting the dataset for model training and evaluation, do certain splitting methods lead to any particular artifacts?

Measured Properties

Here we color the points on the plot by the measured property of our dataset: whether the compounds can permeate the blood brain barrier or not. Many of the areas in this space contain exclusively permeable or exclusively impermeable compounds. This reveals that, within this dataset, there are certain types of compounds that are consistently permeable or not and the general scaffold of the compound is sufficient to determine the permeability of many compounds.

If we were to use machine learning to model this dataset, we might want to ensure that individual homogeneously-labeled scaffolds are not split between the training and test sets as that could misleadingly inflate performance metrics.

Physicochemical properties

Using Molecular Weight as our example physicochemical property, we observe that UMAP groups the heaviest compounds all together as part of one cluster, which contains many large macrocyclic compounds that are all impermeable.

Visualizing the Molecular Weight (MW) of the permeable/impermeable compounds we see a similar phenomena in which the majority of the heavy compounds (MW > 500) are impermeable. This aligns with traditional assumptions that compounds heavier than 500 Da will struggle to permeate the blood brain barrier.

The UMAP plot is useful in quickly identifying where the compounds of a given molecular weight lie, what is the diversity of molecular weight within given clusters, and how these properties interplay with blood brain barrier permeability.

Dataset-Agnostic Embeddings

To create a dataset-agnostic embedding, we take advantage of the way that UMAP treats fitting and transforming a dataset as two separate steps. We start by fitting a UMAP model on a large in-house corpus of drug-like compounds that we consider to be representative of ‘drug-like chemical space’. Now, when we want to examine a new dataset, we load up this cached model and use it to transform the compounds of our dataset into this universal embedding space. With Dataset Specific embeddings, the dimensionality reduction method itself depends on the relationship between all the compounds in the dataset we seek to visualize. With Dataset-Agnostic embeddings, the dimensionality reduction method is treated as fixed so the location of a given compound is agnostic to the other compounds in the dataset.

This allows us to examine multiple datasets all with a consistent frame of reference. We can visualize if the data is diverse and covers a wide area in this drug-like space or if it is contained to a few specific areas.

Here we visualize both the original embedding of our global chemical space compounds used to fit the general UMAP model, and a Dataset-Agnostic embedding of the BBBP dataset created with this fixed model. The compounds of each are colored by their cluster assignment in this embedded space. To get these cluster assignments we have pre-fit a clustering model on the original global chemical space compounds in their UMAP embedding. We then use this pre-trained clustering model on new datasets to quickly determine which of the global UMAP clusters our new compounds fit into. We can even calculate the percentage of global clusters covered by our new dataset to create a quick, quantitative heuristic of chemical space covered by the dataset.

As we saw in our case study, distances in UMAP space aren’t necessarily meaningful and global structure can be greatly influenced by individual compounds. Thus, clusters based off of these embedded distances should be interpreted with caution. However, they can definitely help with the visual examination of the dataset, and can establish a quick, quantitative approximation of chemical space coverage. When combining these cluster heuristics with the visuals of the global embedding, we can start to understand the diversity of a given chemical dataset.

Final Thoughts

UMAP is useful because it is easy and quick to create local and global embeddings. This allows us to treat this analysis as a standard piece of our internal pipeline for cleaning, analyzing, and preparing new datasets for analysis and machine learning. Overall UMAP seems to be a great alternative to the more popular methods for dataset embedding, and it would be exciting to see more examples of groups using it for chemical data.

This post highlighted UMAP’s value for exploring the chemical space of datasets relevant to drug design. The t-SNE vs. UMAP debate is still greatly contested, but we don’t aim to use this example to prove that UMAP is fundamentally better than t-SNE (although many have argued this, some have refuted it, and others have even refuted the refutation!) Similarly, even though PCA is less useful for our particular goals, PCA also has many advantages over its nonlinear alternatives that make it very useful for other purposes. Ultimately, there are many great dimensionality reduction tools to choose from, but we hope that this post has helped to put UMAP on your map.