Cross-lingual text clustering in a large system

Nicole R. Schneider
Hanan Samet
2023 7th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2023) (2023) (to appear)

Abstract

The multilingual world needs systems that can cluster text written
in multiple languages into the same thread or topic. A practical
approach for clustering text in different languages is to first translate

into a common language, such as English, and then cluster it post-
translation. While this approach seems sensible, the performance

and pitfalls of this approach have not been well studied. The
reference architecture used for the study is a news system that
has continuously indexed news articles over many years in over
19 languages. Through the analysis of these documents and their
clusters, the clustering quality is shown to be dependent on the
translator’s ability to normalize proper noun usage, the geographic
focus of the text, and the document topic.