Evaluation of Document Deduplication Algorithms for Large Text Corpora

Performance of large language models (LLMs) is correlated with diverse and high-quality training data. One aspect of data quality is the presence of duplicate documents in the training data. In this paper, we evaluate five algorithms for deduplicating large data sets, namely MinHash/LSH, Exact Hashes, SimHash, Scalable Bloom filter, and Suffix Array. We report on their precision, recall, memory requirements and runtime on deduplicating OpenSubtitles and Oscar data for five languages (EN, DE, ES, FR, IT). We find that the best overall performance is achieved by using a MinHash/LSH, but other options such as Scalable Bloom filter can still be attractive in resource-critical situations. While precision varies between 0.833 and 0.985 across algorithms, recall varies between 0.247 and 0.989, indicating different levels of aggressiveness. We conclude that MinHash/LSH is the most suitable algorithm to deduplicate pretraining data for LLMs.

  • Published in:
    International Conference on machine Learning, Optimization and Data science
  • Type:
    Article
  • Authors:
    Leveling, Johannes; Helmer, Lennard; Stein, Benny; Wegener, Dennis; Sheikh, Zoha; Fernandes, Elanton; Abdelwahab, Hammam; Jha, Nishikant Gaurikant
  • Year:
    2024

Citation information

Leveling, Johannes; Helmer, Lennard; Stein, Benny; Wegener, Dennis; Sheikh, Zoha; Fernandes, Elanton; Abdelwahab, Hammam; Jha, Nishikant Gaurikant: Evaluation of Document Deduplication Algorithms for Large Text Corpora, International Conference on machine Learning, Optimization and Data science, 2024, Leveling.etal.2024a,

Associated Lamarr Researchers

lamarr institute person Wegener Dennis - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)

Dennis Wegener

Autor to the profile