ChemFIE-BED (ChemSELFIES Embedding)

gbyuvd
Similitud de oraciones

ChemFIE-BED es un transformador de oraciones basado en gbyuvd/chemselfies-base-bertmlm afinado con alrededor de (por ahora) 2 millones de pares de SELFIES de moléculas válidas (Krenn et al. 2020) tomadas de COCONUTDB (Sorokina et al. 2021) y (Zdrazil et al. 2023). Mapea las Secuencias Embebidas de Auto-referencia de los compounded SELFIES en un espacio vectorial denso de 320 dimensiones, potencialmente puede ser utilizado para similitud química, búsqueda de similitud, clasificación, agrupamiento y más. Aunque hay más datos para entrenar el modelo, las métricas de prueba en datos no vistos de productos naturales combinados y bioactivos ya son suficientes por ahora.

Como usar

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:
pip install -U sentence-transformers

Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer

# Specify preffered dimensions
# 320, 160, or 80
dimensions = 320

# Download the model from the 🤗 Hub
model = SentenceTransformer("gbyuvd/chembed-chemselfies-bed", truncate_dim=dimensions)

# Run inference
sentences = [
  '[C]  [C]  [=C]  [C]  [=C]  [Branch2]  [Ring2]  [S]  [C]  [C]  [N]  [C]  [=Branch1]  [C]  [=O]  [C]  [=C]  [N]  [Branch1]  [C]  [C]  [C]  [=C]  [C]  [=C]  [Branch2]  [Ring1]  [Ring1]  [S]  [=Branch1]  [C]  [=O]  [=Branch1]  [C]  [=O]  [N]  [C]  [C]  [C]  [Branch1]  [C]  [C]  [C]  [C]  [Ring1]  [#Branch1]  [C]  [=C]  [Ring1]  [S]  [C]  [Ring2]  [Ring1]  [Branch1]  [=O]  [C]  [=C]  [Ring2]  [Ring1]  [P]',
  '[O]  [=C]  [Branch1]  [C]  [O]  [C]  [C]  [C]  [C]  [=C]  [C]  [=C]  [C]  [=C]  [C]  [=C]  [C]  [=C]',
  '[C]  [N]  [C]  [=N]  [C]  [Branch2]  [Branch1]  [C]  [S]  [=Branch1]  [C]  [=O]  [=Branch1]  [C]  [=O]  [N]  [Branch1]  [#Branch2]  [C]  [C]  [C]  [C]  [N]  [C]  [C]  [Ring1]  [=Branch1]  [C]  [C]  [C]  [=C]  [C]  [Branch1]  [Ring1]  [C]  [#N]  [=C]  [C]  [=C]  [Ring1]  [Branch2]  [N]  [Branch1]  [#Branch2]  [C]  [C]  [=C]  [N]  [=C]  [N]  [Ring1]  [Branch1]  [C]  [C]  [Ring2]  [Ring1]  [Ring1]  [=C]  [Ring2]  [Ring2]  [Ring1]',
]

embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 320]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Funcionalidades

Transformador de oraciones
Modelo base: gbyuvd/chemselfies-base-bertmlm
Longitud máxima de secuencia: 512 tokens
Dimensionalidad de salida: 320 tokens
Función de similitud: Similitud del coseno
Pooling: Pooling de la media
Dataset de entrenamiento: Pares de SELFIES generados de COCONUTDB y ChemBL34
Idioma: SELFIES
Licencia: CC-BY-NC-SA 4.0
Arquitectura del modelo: SentenceTransformer con un transformador y pooling de la media

Casos de uso

Búsqueda de similitud química
Clasificación de compuestos químicos
Agrupación de moléculas
Cribado virtual basado en acoplamiento
Validación de compuestos mediante métricas de similitud