How to calculate the entropy of a term document in R?

626 Asked by NiharikaDeshpande in Data Science , Asked on Nov 9, 2019

For doing that, we can use the following function in R

library("quanteda")

textstat_entropy <- function(x, base = exp(1), k = 1) {

# this works because of R's recycling and column-major order, but requires t()

p_ij <- t(t(x) / colSums(x))

log_p_ij <- log(p_ij, base = base)

k - colSums(p_ij * log_p_ij / log(ndoc(x), base = base), na.rm = TRUE)

}

textstat_entropy(data_dfm_lbgexample, base = 2)

# A B C D E F G H I J K

# 1.000000 1.000000 1.000000 1.000000 1.000000 1.045226 1.045825 1.117210 1.173655 1.277210 1.378934

# L M N O P Q R S T U V

# 1.420161 1.428939 1.419813 1.423840 1.436201 1.440159 1.429964 1.417279 1.410566 1.401663 1.366412

# W X Y Z ZA ZB ZC ZD ZE ZF ZG

# 1.302785 1.279927 1.277210 1.287621 1.280435 1.211205 1.143650 1.092113 1.045825 1.045226 1.000000

# ZH ZI ZJ ZK

# 1.000000 1.000000 1.000000 1.000000

It will match with the weight function in the lsa package, when the base is e:

library("lsa")

all.equal(

gw_entropy(as.matrix(t(data_dfm_lbgexample))),

textstat_entropy(data_dfm_lbgexample, base = exp(1))

)

# [1] TRUE

Your Answer