Please Explain and share a clear picture of Noise Contrastive Estimation loss or nce loss

397 Asked by Aashishchaursiya in Data Science , Asked on Sep 25, 2024

I have been following nce from several sources, most of which I have read from TensorFlow’s writeup and the original paper. But I find it very hard to understand so I need some help

1. What is nce loss and how it works?

2. What is the difference between nce and negative sampling - I have used word2vec, I wanted to know more about the NCE’s formula as it is not clear to me how it is different from negative sampling

3. I also wanted to know when to use NCE and when Negative Sampling
4. Which one is better NCE or Negative Sampling

Please share a clear response.

Answered by avocetyeast

Let’s understand the meaning of NCE and NCE loss, NCE stands for “Noise Contrastive Estimation” and is a general method used to approximate a probability distribution. Generally dealing with large datasets or large output spaces like vocabularies. While NCE loss is a specific loss function used in NCE.

Noise Contrastive Estimation: this method treats large classification tasks, such as the prediction of the next word in a sentence, as a binary classification task between "real" data, comprising pairs of words in the appropriate order, and "noise" or "fake" data, in random word pairs. When we train a model on a given task, instead of predicting the actual probability of a word from a huge vocabulary, NCE transforms the problem into a binary classification task that distinguishes between "real" data, representing correct word pairs, and "noise" or "fake" data, representing random word pairs.

Here’s how it works:

Real Data: A pair of words that is present together in the context (e.g., "mobile" and "network").

Noise Data: A pair of words that do not belong together (e.g., "mobile" and a random word like "fruit").

The issue

There are some issues with learning word vectors with a "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a word window (the network's input).

Predicting the next word is analogous to predicting a class in a classification problem. That is, this network is simply a "standard" multinomial (multi-class) classifier. And the number of output neurons in this network must equal the number of classes. When classes are actual words, the number of neurons increases dramatically.

A "standard" neural network is typically trained with a cross-entropy cost function, which requires the output neurons' values to represent probabilities. This means that the output "scores" computed by the network for each class must be normalized and converted into actual probabilities for each class. This normalization step is carried out using the softmax function. Softmax is extremely expensive when applied to a large output layer.

The Solution

To deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) and then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.

The basic idea is to convert a multinomial classification problem (such as predicting the next word) into a binary classification problem. That is, instead of using softmax to estimate the output word's true probability distribution, binary logistic regression (binary classification) is used.

For each training sample, the enhanced (optimized) classifier receives a true pair (a center word and another word in its context) as well as several k randomly corrupted pairs (the center word and a randomly selected word from the vocabulary). The classifier will eventually learn the word vectors by distinguishing between true and corrupted pairs.

Note: Instead of predicting the next word (the "standard" training method), the optimized classifier predicts whether a pair of words is good or bad.

Word2Vec calls the process "negative sampling" after slight customization. In Word2Vec, the words for the negative samples (used for corrupted pairs) are drawn from a specially designed distribution that favors less frequent words being drawn more frequently.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.

Please Explain and share a clear picture of Noise Contrastive Estimation loss or nce loss

Your Answer