In text analysis, spherical k-means (SKM) is a specialized k-means clustering algorithm widely utilized for grouping documents represented in high-dimensional, sparse termdocument matrices, often normalized using techniques like TF-IDF. Researchers frequently seek to cluster not only documents but also the terms associated with them into coherent groups. To address this dual clustering requirement, we introduce spherical double k-means (SDKM), a novel methodology that simultaneously clusters documents and terms. This methodology offers several advantages, such as enabling more effective topic identification and keyword extraction, enhancing interpretability, computational efficiency, and efficiency in capturing dynamic changes in thematic content over time. It also facilitates the uncovering of nuanced patterns and structures of textual data.We apply SDKMto simulated and real data. The real data applications are on the corpus of US presidential inaugural addresses, spanning from George Washington in 1789 to Joe Biden in 2021, and to the 20 Newsgroups corpus. Our analysis reveals distinct clusters of words and documents that correspond to significant themes and periods, showcasing the method’s ability to facilitate a deeper understanding of the data. Our findings demonstrate the efficacy of SDKM in uncovering underlying patterns in textual data.
Bombelli, I., Iezzi, D.f., Seri, E., Vichi, M. (2026). Spherical double k-means: a co-clustering approach for textual data analysis. JOURNAL OF CLASSIFICATION [10.1007/s00357-026-09544-7].
Spherical double k-means: a co-clustering approach for textual data analysis
IEZZI D. F.;SERI E.
Methodology
;
2026-03-01
Abstract
In text analysis, spherical k-means (SKM) is a specialized k-means clustering algorithm widely utilized for grouping documents represented in high-dimensional, sparse termdocument matrices, often normalized using techniques like TF-IDF. Researchers frequently seek to cluster not only documents but also the terms associated with them into coherent groups. To address this dual clustering requirement, we introduce spherical double k-means (SDKM), a novel methodology that simultaneously clusters documents and terms. This methodology offers several advantages, such as enabling more effective topic identification and keyword extraction, enhancing interpretability, computational efficiency, and efficiency in capturing dynamic changes in thematic content over time. It also facilitates the uncovering of nuanced patterns and structures of textual data.We apply SDKMto simulated and real data. The real data applications are on the corpus of US presidential inaugural addresses, spanning from George Washington in 1789 to Joe Biden in 2021, and to the 20 Newsgroups corpus. Our analysis reveals distinct clusters of words and documents that correspond to significant themes and periods, showcasing the method’s ability to facilitate a deeper understanding of the data. Our findings demonstrate the efficacy of SDKM in uncovering underlying patterns in textual data.| File | Dimensione | Formato | |
|---|---|---|---|
|
Bombelli_et_al-2026-Journal_of_Classification.pdf
accesso aperto
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
2.14 MB
Formato
Adobe PDF
|
2.14 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


