Spherical double k-means: a co-clustering approach for textual data analysis

IRIS

In text analysis, spherical k-means (SKM) is a specialized k-means clustering algorithm widely utilized for grouping documents represented in high-dimensional, sparse termdocument matrices, often normalized using techniques like TF-IDF. Researchers frequently seek to cluster not only documents but also the terms associated with them into coherent groups. To address this dual clustering requirement, we introduce spherical double k-means (SDKM), a novel methodology that simultaneously clusters documents and terms. This methodology offers several advantages, such as enabling more effective topic identification and keyword extraction, enhancing interpretability, computational efficiency, and efficiency in capturing dynamic changes in thematic content over time. It also facilitates the uncovering of nuanced patterns and structures of textual data.We apply SDKMto simulated and real data. The real data applications are on the corpus of US presidential inaugural addresses, spanning from George Washington in 1789 to Joe Biden in 2021, and to the 20 Newsgroups corpus. Our analysis reveals distinct clusters of words and documents that correspond to significant themes and periods, showcasing the method’s ability to facilitate a deeper understanding of the data. Our findings demonstrate the efficacy of SDKM in uncovering underlying patterns in textual data.

Bombelli, I., Iezzi, D.f., Seri, E., Vichi, M. (2026). Spherical double k-means: a co-clustering approach for textual data analysis. JOURNAL OF CLASSIFICATION [10.1007/s00357-026-09544-7].

Spherical double k-means: a co-clustering approach for textual data analysis

BOMBELLI I.;IEZZI D. F.;SERI E.^Methodology;VICHI M.

2026-03-01

Abstract

In text analysis, spherical k-means (SKM) is a specialized k-means clustering algorithm widely utilized for grouping documents represented in high-dimensional, sparse termdocument matrices, often normalized using techniques like TF-IDF. Researchers frequently seek to cluster not only documents but also the terms associated with them into coherent groups. To address this dual clustering requirement, we introduce spherical double k-means (SDKM), a novel methodology that simultaneously clusters documents and terms. This methodology offers several advantages, such as enabling more effective topic identification and keyword extraction, enhancing interpretability, computational efficiency, and efficiency in capturing dynamic changes in thematic content over time. It also facilitates the uncovering of nuanced patterns and structures of textual data.We apply SDKMto simulated and real data. The real data applications are on the corpus of US presidential inaugural addresses, spanning from George Washington in 1789 to Joe Biden in 2021, and to the 20 Newsgroups corpus. Our analysis reveals distinct clusters of words and documents that correspond to significant themes and periods, showcasing the method’s ability to facilitate a deeper understanding of the data. Our findings demonstrate the efficacy of SDKM in uncovering underlying patterns in textual data.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				mar-2026
			
	Status di pubblicazione
	
				Online ahead of print
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1007/s00357-026-09544-7
			
	Rilevanza
	
				Rilevanza internazionale
			
	Tipo
	
				Articolo
			
	Referee
	
				Esperti anonimi
			
	Settore disciplinare dell'articolo (valido fino a 24/06/2024)
	
				Settore SECS-S/05
			
	Settore disciplinare dell'articolo (valido dal 09/05/2024)
	
				Settore STAT-03/B - Statistica sociale
			
	Lingua del contenuto
	
				English
			
	Impact Factor ISI
	
				Con Impact Factor ISI
			
	Parole chiave
	
				Textual data
Co-clustering
Topic modeling
Spherical double k-means
			
	Citazione
	
				Bombelli, I., Iezzi, D.f., Seri, E., Vichi, M. (2026). Spherical double k-means: a co-clustering approach for textual data analysis. JOURNAL OF CLASSIFICATION [10.1007/s00357-026-09544-7].
			
	Tutti gli autori
	
						Bombelli, I; Iezzi, Df; Seri, E; Vichi, M
					
	Tipologia
	
				Articolo su rivista
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Bombelli_et_al-2026-Journal_of_Classification.pdf accesso aperto Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 2.14 MB Formato Adobe PDF Visualizza/Apri	2.14 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2108/452743

Citazioni

ND

ND

ND

social impact