moco-sentencedistilbertV2.1
bongsoo
Similitud de oraciones
Este es un modelo de sentence-transformers: Mapea oraciones y párrafos a un espacio vectorial denso de 768 dimensiones y puede ser utilizado para tareas como la agrupación o la búsqueda semántica.
Como usar
Uso (Sentence-Transformers)
Instalar sentence-transformers:
pip install -U sentence_transformers
Usar el modelo:
from sentence_transformers import SentenceTransformer
sentences = ["서울은 한국이 수도이다", "The capital of Korea is Seoul"]
model = SentenceTransformer('bongsoo/moco-sentencedistilbertV2.1')
embeddings = model.encode(sentences)
print(embeddings)
from sklearn.metrics.pairwise import paired_cosine_distances
cosine_scores = 1 - (paired_cosine_distances(embeddings[0].reshape(1,-1), embeddings[1].reshape(1,-1)))
print(f'*cosine_score:{cosine_scores[0]}')
Uso (HuggingFace Transformers)
Instalar transformers:
pip install transformers[torch]
Usar el modelo:
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ["서울은 한국이 수도이다", "The capital of Korea is Seoul"]
tokenizer = AutoTokenizer.from_pretrained('bongsoo/moco-sentencedistilbertV2.1')
model = AutoModel.from_pretrained('bongsoo/moco-sentencedistilbertV2.1')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
from sklearn.metrics.pairwise import paired_cosine_distances
cosine_scores = 1 - (paired_cosine_distances(sentence_embeddings[0].reshape(1,-1), sentence_embeddings[1].reshape(1,-1)))
print(f'*cosine_score:{cosine_scores[0]}')
Resultados
# Embedded results
[[ 0.27124503 -0.5836643 0.00736023 ... -0.0038319 0.01802095 -0.09652182]
[ 0.2765149 -0.5754248 0.00788184 ... 0.07659392 -0.07825544 -0.06120609]]
*cosine_score:0.9513546228408813
# Sentence embeddings results
Sentence embeddings:
censor([ 0.2712, -0.5837, 0.0074, ..., -0.0038, 0.0180, -0.0965])
cosine_score:0.951354622848813
Funcionalidades
- Transformers
- distilbert
- extracción de características
- ko
- en
- embeddings de texto
- inferencia
Casos de uso
- Búsqueda semántica
- Clustering de oraciones y párrafos
- Análisis de similitud semántica