Most of existing cardinality estimation algorithms do not support natively interval queries under a sliding window model and are thereby insensitive to data recency. We present Staggered-HyperLogLog (ST-HLL), a probabilistic data structure that takes inspiration from HyperLogLog (HLL) and provides nearly continuous-time estimation of cardinality rates, rather than absolute counts. Our solution has zero-bit overhead with respect to vanilla HLL and negligible additional computational complexity. It is based on a periodic staggered reset of HLL registers and a register equalization operation at query times to compensate for staggered counting. We tested ST-HLL over both synthetic and real Internet traffic traces, showing its ability to track variations of the flow cardinality, quickly adapting to variations under non-stationary flow arrival processes. We show that for the same amount of memory footprint, our algorithm improves the accuracy up to a factor 2x with respect to the state-of-the-art solution, Sliding HLL.
Cornacchia, A., Bianchi, G., Bianco, A., Giaccone, P. (2022). Staggered HLL: near-continuous-time cardinality estimation with no overhead. COMPUTER COMMUNICATIONS, 193, 168-175 [10.1016/j.comcom.2022.06.038].
Staggered HLL: near-continuous-time cardinality estimation with no overhead
Bianchi G.;
2022-01-01
Abstract
Most of existing cardinality estimation algorithms do not support natively interval queries under a sliding window model and are thereby insensitive to data recency. We present Staggered-HyperLogLog (ST-HLL), a probabilistic data structure that takes inspiration from HyperLogLog (HLL) and provides nearly continuous-time estimation of cardinality rates, rather than absolute counts. Our solution has zero-bit overhead with respect to vanilla HLL and negligible additional computational complexity. It is based on a periodic staggered reset of HLL registers and a register equalization operation at query times to compensate for staggered counting. We tested ST-HLL over both synthetic and real Internet traffic traces, showing its ability to track variations of the flow cardinality, quickly adapting to variations under non-stationary flow arrival processes. We show that for the same amount of memory footprint, our algorithm improves the accuracy up to a factor 2x with respect to the state-of-the-art solution, Sliding HLL.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.