Automatic categorization of online store items from product descriptions

00 Blog Brito Automatische Kategorisierung neu - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)
© ZinetroN & ML2R

When shopping online, users expect a wide range of products that can no longer be managed manually. Standard online shops classify each item under one or more categories to simplify product searches. How can this process be automated?

One potential solution is through text categorization. Product descriptions and their respective categories from a catalog can be used to train ML models that predict suitable categories based on a short text. Today, such problems are typically solved with pre-trained contextual embedding vectors, particularly with transformer-based models like BERT (Bidirectional Encoder Representations from Transformers). These are deep neural networks capable of modeling multiple languages from extensive text corpora. This enables the rapid training of a text classifier using product descriptions. While the results from these approaches can be surprisingly good, certain characteristics make their practical implementation challenging:

  1. The most powerful current models generally require specialized hardware (usually GPUs or TPUs) to deliver results within a reasonable timeframe.
  2. Even if training time is disregarded, computational complexity during inference can cause additional latency in web applications, ultimately diminishing user satisfaction.
  3. Black-box models like BERT are challenging to verify once problems arise (e.g., “Why was this unexpected category assigned to this product?”).

The problem of the trade-off triangle

A specific use case occurs when a resource-efficient solution is needed where certain categories must not be confused (e.g., “sex toys” must not be mistakenly classified as “toys,” making fully black-box models unsuitable). At the same time, the highest possible classification quality is desired. This is where a “trade-off triangle” of three aspects emerges, which cannot be fully achieved simultaneously when selecting a classification model architecture: performance, explainability, and low resource requirements. Therefore, the goal is to quantify the performance gap between black-box BERT-based methods and lighter, more explainable models. DistilBERT is compared with a k-nearest neighbors (k-NN) classification pipeline for this purpose. While DistilBERT generally performs slightly worse, it is significantly more compact than equivalent BERT-based models. The k-NN classification pipeline is trained on a range of different text representations with varying computational complexity and explainability, including topic models and neural language models.

This framework is explainable because, in a straightforward manner, the neighbors (the most similar product descriptions) can be shown as explanations for each categorized item.

A resource-efficient solution with optimal classification

Although a performance gap is observed, this difference may hardly be worthwhile when accounting for the much larger CPU times measured for DistilBERT. Furthermore, due to its lack of interpretability, the latest neural network available may not necessarily be the best option, especially if the model needs to run in a production environment.

The explainable pipeline consists of a language model that generates vector representations for text descriptions and a nearest neighbor classifier. This means, according to the respective language model, that “similar” items (based on their descriptions) adopt the categories of the most similar known items. The tested language models include:

  • Latent Dirichlet Allocation (LDA): The classic topic model, for which only nouns from the texts are used.
  • Anchored CorEx: Also a topic model that allows the use of anchor words to influence topic modeling. Like LDA, only nouns are used to train two distinct models: one “unsupervised” and one with 10 anchor words per product category, specifically those with the highest mutual information.
  • FastText: A neural language model that can efficiently run on CPUs or “common” machines.
  • SentenceBERT (SBERT): A BERT-based model that specifically encodes semantic similarity.

The different models are evaluated on the data, resulting in the following metrics:

Tabelle Metriken - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)

The results appear to confirm the stated “trade-off triangle”: the most complex model (DistilBERT) significantly outperforms the simpler interpretable models (LDA and CorEx), while the intermediate models achieve middling results in terms of either interpretability (SBERT is slightly more interpretable than DistilBERT due to its similarity-based nature) or computational complexity (FastText). However, the relevance of the performance gap between DistilBERT and SBERT on this dataset is questionable. Combining SBERT with a $k$-nearest neighbors classifier could be a good compromise: its performance is close to the best model and can retain explainable predictions (an item receives a category because similar items belong to that category). FastText also seems to be a good option in setups where no GPU is available.

Conclusion

Since many online shops have datasets of similar complexity, or datasets where the superiority of the most complex transformer models is barely noticeable, the question arises whether the most powerful model available should be applied by default, disregarding the significant computational and implementation costs.

All technical details are available in the associated publication:

Assessing the Performance Gain on Retail Article Categorization at the Expense of Explainability and Resource Efficiency.
Brito, Eduardo, et al. (German Conference on Artificial Intelligence (Künstliche Intelligenz)), 2022. Link

Eduardo Alfredo Brito Chacón

Eduardo Brito is a data scientist and project manager at Fraunhofer IAIS in the Natural Language Understanding team. He is currently researching explainable semantic text similarity functions for various information retrieval use cases, e.g. to categorize retail products or to identify relevant passages in legal documents. He is particularly interested in developing Informed Machine Learning models that are to some extent explainable and resource-aware, but at the same time remain […]

More blog posts