ChemFIE-BED (ChemSELFIES Embedding)
gbyuvd
Similitud de oraciones
ChemFIE-BED es un transformador de oraciones basado en gbyuvd/chemselfies-base-bertmlm afinado con alrededor de (por ahora) 2 millones de pares de SELFIES de moléculas válidas (Krenn et al. 2020) tomadas de COCONUTDB (Sorokina et al. 2021) y (Zdrazil et al. 2023). Mapea las Secuencias Embebidas de Auto-referencia de los compounded SELFIES en un espacio vectorial denso de 320 dimensiones, potencialmente puede ser utilizado para similitud química, búsqueda de similitud, clasificación, agrupamiento y más. Aunque hay más datos para entrenar el modelo, las métricas de prueba en datos no vistos de productos naturales combinados y bioactivos ya son suficientes por ahora.
Como usar
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Specify preffered dimensions
# 320, 160, or 80
dimensions = 320
# Download the model from the 🤗 Hub
model = SentenceTransformer("gbyuvd/chembed-chemselfies-bed", truncate_dim=dimensions)
# Run inference
sentences = [
'[C] [C] [=C] [C] [=C] [Branch2] [Ring2] [S] [C] [C] [N] [C] [=Branch1] [C] [=O] [C] [=C] [N] [Branch1] [C] [C] [C] [=C] [C] [=C] [Branch2] [Ring1] [Ring1] [S] [=Branch1] [C] [=O] [=Branch1] [C] [=O] [N] [C] [C] [C] [Branch1] [C] [C] [C] [C] [Ring1] [#Branch1] [C] [=C] [Ring1] [S] [C] [Ring2] [Ring1] [Branch1] [=O] [C] [=C] [Ring2] [Ring1] [P]',
'[O] [=C] [Branch1] [C] [O] [C] [C] [C] [C] [=C] [C] [=C] [C] [=C] [C] [=C] [C] [=C]',
'[C] [N] [C] [=N] [C] [Branch2] [Branch1] [C] [S] [=Branch1] [C] [=O] [=Branch1] [C] [=O] [N] [Branch1] [#Branch2] [C] [C] [C] [C] [N] [C] [C] [Ring1] [=Branch1] [C] [C] [C] [=C] [C] [Branch1] [Ring1] [C] [#N] [=C] [C] [=C] [Ring1] [Branch2] [N] [Branch1] [#Branch2] [C] [C] [=C] [N] [=C] [N] [Ring1] [Branch1] [C] [C] [Ring2] [Ring1] [Ring1] [=C] [Ring2] [Ring2] [Ring1]',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 320]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Funcionalidades
- Transformador de oraciones
- Modelo base: gbyuvd/chemselfies-base-bertmlm
- Longitud máxima de secuencia: 512 tokens
- Dimensionalidad de salida: 320 tokens
- Función de similitud: Similitud del coseno
- Pooling: Pooling de la media
- Dataset de entrenamiento: Pares de SELFIES generados de COCONUTDB y ChemBL34
- Idioma: SELFIES
- Licencia: CC-BY-NC-SA 4.0
- Arquitectura del modelo: SentenceTransformer con un transformador y pooling de la media
Casos de uso
- Búsqueda de similitud química
- Clasificación de compuestos químicos
- Agrupación de moléculas
- Cribado virtual basado en acoplamiento
- Validación de compuestos mediante métricas de similitud