Interpretable Topic Extraction and Word Embedding Learning Using Non-Negative Tensor DEDICOM

Unsupervised topic extraction is a vital step in automatically extracting concise contentual information from large text corpora. Existing topic extraction methods lack the capability of linking relations between these topics which would further help text understanding. Therefore we propose utilizing the Decomposition into Directional Components (DEDICOM) algorithm which provides a uniquely interpretable matrix factorization for symmetric and asymmetric square matrices and tensors. We constrain DEDICOM to row-stochasticity and non-negativity in order to factorize pointwise mutual information matrices and tensors of text corpora. We identify latent topic clusters and their relations within the vocabulary and simultaneously learn interpretable word embeddings. Further, we introduce multiple methods based on alternating gradient descent to efficiently train constrained DEDICOM algorithms. We evaluate the qualitative topic modeling and word embedding performance of our proposed methods on several datasets, including a novel New York Times news dataset, and demonstrate how the DEDICOM algorithm provides deeper text analysis than competing matrix factorization approaches.

  • Published in:
    Machine Learning and Knowledge Extraction
  • Type:
    Article
  • Authors:
    L. Hillebrand, D. Biesner, C. Bauckhage, R. Sifa
  • Year:
    2021

Citation information

L. Hillebrand, D. Biesner, C. Bauckhage, R. Sifa: Interpretable Topic Extraction and Word Embedding Learning Using Non-Negative Tensor DEDICOM, Machine Learning and Knowledge Extraction, 2021, 3, 19, 123-167, https://doi.org/10.3390/make3010007, Hillebrand.etal.2021,