We propose a highly efficient and effective algorithm to estimate metrics on very large graphs that are based on the neighborhood function: examples of such metrics are the (effective) diameter and (effective) radius or the average distance. In order to efficiently provide good approximations to the size of the neighborhood set of any node, we refer to the MinHash Signatures approach to derive compressed representations of large sparse datasets but preserving similarity. The technique introduced here, named MinHash Signature Estimation (MHSE), exploits the similarity between signatures to estimate the size of the neighborhood function. We show that MHSE is as effective as HyperANF, which is considered the state of art approach for the estimation of the effective diameter of a very large graph. Indeed, performing both parametric (t-test) and non-parametric (Wilcoxon) statistical tests on residuals for average distance, effective diameter and number of connected pairs, the p-values show that MHSE tends to be more statistically significant than HyperANF. On the other hand, we show that MHSE is a very simple and easily distributable algorithm. In addition, by the property of the signatures to preserve similarity between neighborhoods of nodes, the algorithm can be suitably applied to allow to search and estimate the overlapping size of the most similar neighborhood at different distances.
Amati, G., Angelini, S., Gambosi, G., Rossi, G., Vocca, P. (2017). Estimation of distance-based metrics for very large graphs with MinHash Signatures. In 2017 IEEE International Conference on Big Data, Big Data 2017 (pp.536-545). IEEE [10.1109/BigData.2017.8257969].
Estimation of distance-based metrics for very large graphs with MinHash Signatures
Giambattista Amati;Giorgio Gambosi;Gianluca Rossi;Paola Vocca
2017-01-01
Abstract
We propose a highly efficient and effective algorithm to estimate metrics on very large graphs that are based on the neighborhood function: examples of such metrics are the (effective) diameter and (effective) radius or the average distance. In order to efficiently provide good approximations to the size of the neighborhood set of any node, we refer to the MinHash Signatures approach to derive compressed representations of large sparse datasets but preserving similarity. The technique introduced here, named MinHash Signature Estimation (MHSE), exploits the similarity between signatures to estimate the size of the neighborhood function. We show that MHSE is as effective as HyperANF, which is considered the state of art approach for the estimation of the effective diameter of a very large graph. Indeed, performing both parametric (t-test) and non-parametric (Wilcoxon) statistical tests on residuals for average distance, effective diameter and number of connected pairs, the p-values show that MHSE tends to be more statistically significant than HyperANF. On the other hand, we show that MHSE is a very simple and easily distributable algorithm. In addition, by the property of the signatures to preserve similarity between neighborhoods of nodes, the algorithm can be suitably applied to allow to search and estimate the overlapping size of the most similar neighborhood at different distances.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.