How to create a keyword embedding model?
I am currently engaged as a data scientist in a tech-based company and my task is to improve a text-based search engine. My goal is to enhance the relevancy of the search results by better understanding the context of queries of the users. My team has decided to use keyword embedding technology to capture the semantic meaning of words and phrases. Describe the steps to me of how can I create a keyword embedding model that can handle various user queries in multiple languages.
In the context of data science, here are the detailed steps given of how you can build and train a keyword embedding model for your given scenario:-
Data collection and preprocessing
Firstly, you would need to gather a diverse text in multiple languages that are relevant to the search queries. You should try to ensure that it covers the various domains and it should be large enough to train embedding effectively.
Text cleaning
You can try to perform text preprocessing such as tokenization, normalization, and removal of stop words and also the special characters for each language.
Handling multilingual data
You can use language detection tools to separate the corpus by language. For this libraries such as “langdetect” can be used:-
From langdetect import detect
From transformers import BertTokenizer
# Example of text normalization and tokenization
Text = “Bonjour! Comment ça va?”
Tokenizer = BertTokenizer.from_pretrained(‘bert-base-multilingual-cased’)
Tokens = tokenizer.tokenize(text.lower())
Print(tokens)
Choose and embed the model
Now you can use a pre-trained multilingual model. These models are capable of handling the multiple languages and they are also more trained on a vast corpus of multilingual text.
Embedding layer
You can use the embedding layer from the pre-trained model which can provide a dense vector representation of words in multiple languages. You should also perform fine-tuning the model on your specific task to adjust the embedding for the search context.
From transformers import AutoModel, AutoTokenizer
Model_name = “xlm-roberta-base”
Tokenizer = AutoTokenizer.from_pretrained(model_name)
Model = AutoModel.from_pretrained(model_name)
Training the model
You can create a pair of queries and also the relevant documents or even keywords. You should also try to ensure that the data is automated with the language information for better understanding.
You can use the contrastive loss or even triplet loss for training the embedding. You can also implement the training loop with the mini-batches and backpropagation for updating the weight of the model.
Import torch
From torch import nn, optim
From torch.utils.data import DataLoader
# Define a simple dataset and data loader
Class QueryDataset(torch.utils.data.Dataset):
Def __init__(self, queries, documents):
Self.queries = queries
Self.documents = documents
Def __len__(self):
Return len(self.queries)
Def __getitem__(self, idx):
Return self.queries[idx], self.documents[idx]
# Example training loop
Dataset = QueryDataset(queries, documents)
Data_loader = DataLoader(dataset, batch_size=32, shuffle=True)
Optimizer = optim.Adam(model.parameters(), lr=1e-5)
Criterion = nn.TripletMarginLoss()
For epoch in range(num_epochs):
For query, doc in data_loader:
Optimizer.zero_grad()
Query_embeddings = model(**tokenizer(query, return_tensors=’pt’))[0]
Doc_embeddings = model(**tokenizer(doc, return_tensors=’pt’))[0]
Loss = criterion(query_embeddings, doc_embeddings, negative_embeddings)
Loss.backward()
Optimizer.step()
Cross-lingual performance
Fine-tune the model so that you can ensure that it maintains the performance across the language.
From sklearn.metrics import precision_score, recall_score, f1_score
# Example evaluation metrics calculation
Precision = precision_score(y_true, y_pred, average=’macro’)
Recall = recall_score(y_true, y_pred, average=’macro’)
F1 = f1_score(y_true, y_pred, average=’macro’)
Print(f”Precision: {precision}, Recall: {recall}, F1 Score: {f1}”)
Scalability
You should try to ensure that the deployment should scale to handle a high volume of search queries.
From fastapi import FastAPI
App = FastAPI()
@app.post(“/embed/”)
Async def embed_query(query: str):
Tokens = tokenizer(query, return_tensors=’pt’)
Embeddings = model(**tokens)[0]
Return {“embedding”: embeddings.tolist()}