Jiri Skala

University of West Bohemia, Czech Republic

Ivana Kolingerova

University of West Bohemia, Czech Republic

Linköping Electronic Conference Proceedings 28:5, p. 17–23

Published: 2007-12-20

ISBN: 978-91-7393-990-4

ISSN: 1650-3686 (print), 1650-3740 (online)

Using recent knowledge in data stream clustering we present a modified approach to the facility location problem in the context of geometric data streams. We give insight to the existing algorithm from a less mathematical point of view; focusing on understanding and practical use; namely by computer graphics experts. We propose a modification of the original data stream k-median clustering to solve facility location which is the case when we a priori do not know the number of clusters in the input data. Like the original; the modified version is capable of processing millions of points while using rather small amount of memory. Based on our experiments with clustering geometric data we present suggestions on how to set processing parameters. We also describe how the algorithm handles various distributions of input data within the stream. These findings may be applied back to the original algorithm.

Data stream; clustering; facility location; geometric data

CHARIKAR; M.; AND GUHA; S. 1999. Improved combinatorialalgorithms for the facility location and k-median problems. In IEEE Symposium on Foundations of Computer Science; 378– 388.

CHARIKAR; M.; KHULLER; S.; MOUNT; D. M.; AND NARASIMHAN; G. 2001. Algorithms for facility location problems with outliers. In Symposium on Discrete Algorithms; 642– 651.

CHARIKAR; M.; GUHA; S.; ´EVA TARDOS; AND SHMOYS; D. B. 2002. A constant-factor approximation algorithm for the kmedian problem. Journal of Computer System Sciences 65; 1; 129–149.

CHARIKAR; M.; O’CALLAGHAN; L.; AND PANIGRAHY; R. 2003. Better streaming algorithms for clustering problems. In Proc. of 35th ACM Symposium on Theory of Computing (STOC); 30–39.

CHUDAK; F. A.; AND SHMOYS; D. B. 2004. Improved approximation algorithms for the uncapacitated facility location problem. SIAM Journal on Comp. 33; 1; 1–25.

FRAHLING; G.; AND SOHLER; C. 2005. Coresets in dynamic geometric data streams. In Proceedings of the 37th annual ACM symposium on Theory of computing (STOC); 209–217.

GUHA; S.; AND KHULLER; S. 1998. Greedy strikes back: Improved facility location algorithms. In ACM-SIAM Symposium on Discrete Algorithms (SODA); 649–657.

GUHA; S.; RASTOGI; R.; AND SHIM; K. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of ACM SIGMOD International Conference on Management of Data; 73–84.

GUHA; S.; MISHRA; N.; MOTWANI; R.; AND O’CALLAGHAN; L. 2000. Clustering data streams. In IEEE Symposium on Foundations of Computer Science; 359–366.

GUHA; S.; MEYERSON; A.; MISHRA; N.; MOTWANI; R.; AND O’CALLAGHAN; L. 2003. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering 15; 3; 515–528.

ISENBURG; M.; AND GUMHOLD; S. 2003. Out-of-core compressionfor gigantic polygon meshes. In SIGGRAPH’03 Conference Proceedings; 935–942.

ISENBURG; M.; AND LINDSTROM; P. 2005. Streaming meshes. In Proceedings of Visualization’05; 231–238.

ISENBURG; M.; LINDSTROM; P.; AND SNOEYINK; J. 2005. Streaming compression of triangle meshes. In Proceedings of the 3rd Eurographics symposium on Geometry processing (SGP); 111.

ISENBURG; M.; LINDSTROM; P.; GUMHOLD; S.; AND SHEWCHUK; J. 2006. Streaming compression of tetrahedral volume meshes. In Graphics Interface; 115–121.

ISENBURG; M.; LIU; Y.; SHEWCHUK; J.; AND SNOEYINK; J. 2006. Streaming computation of delaunay triangulations. ACM Trans. Graph. 25; 3; 1049–1056.

JAIN; K.; AND VAZIRANI; V. V. 1999. Primal-dual approximation algorithms for metric facility location and k-median problems. In IEEE Symposium on Foundations of Computer Science; 2–13.

JAIN; A. K.; MURTY; M. N.; AND FLYNN; P. J. 1999. Data clustering: A review. ACM Computing Surveys 31; 3; 264–323. KAUFMAN; L.; AND ROUSSEEUW; P. J. 1990. Finding groups in data: An introduction to cluster analysis. John Wiley; New York.

KORUPOLU; M. R.; PLAXTON; C. G.; AND RAJARAMAN; R. 1998. Analysis of a local search heuristic for facility location problems. In 9th ACM-SIAM Symposium on Discrete Algorithms (SODA); 1–10.

LINDSTROM; P. 2000. Out-of-core simplification of large polygonal models. In Siggraph 20000; Computer Graphics Proceedings; 259–262.

MAHDIAN; M.; MARKAKIS; E.; SABERI; A.; AND VAZIRANI; V. V. 2001. A greedy facility location algorithm analyzed using dual fitting. In 4th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX); 127–137.

MEYERSON; A. 2001. Online facility location. In Proceedings of the 42nd IEEE symposium on Foundations of Computer Science (FOCS); 426.

MUTHUKRISHNAN; S. 2003. Data streams: Algorithms and applications. In Proceedings of the 14th annual ACM-SIAM symposium on discrete algorithms.

NG; R. T.; AND HAN; J. 1994. Efficient and effective clustering methods for spatial data mining. In 20th Intl. Conference on Very Large Data Bases; 144–155.

O’CALLAGHAN; L.; MISHRA; N.; MEYERSON; A.; GUHA; S.; AND MOTWANI; R. 2002. Streaming-data algorithms forhighquality clustering. In 18th International Conference on Data Engineering (ICDE); 685.

PAJAROLA; R. 2005. Stream-processing points. In Proceedings IEEE Visualization; 239–246.

ROSSIGNAC; J. R.; AND BORREL; P. 1993. Multi-resolution 3D approximations for rendering complex scenes. In Geometric Modeling in Comp. Graphics; 455–465.

SHARIFZADEH; M.; AND SHAHABI; C. 2004. Approximate voronoi cell computation on geometric data streams. Tech. Rep. 04-835; University of Southern California; Computer Science Department.

SHMOYS; D. B. 2000. Approximation algorithms for facility location problems. In Proceedings of International Workshop on Approximation Algorithms for Combinatorial Optimization (APPROX); 27–33.

STANFORD; 2007. Stanford 3D scanning repository. http://graphics.stanford.edu/data/3Dscanrep/.

ZHANG; T.; RAMAKRISHNAN; R.; AND LIVNY; M. 1996. BIRCH: An efficient data clustering method for very large databases. In ACM SIGMOD International Conference on Management of Data; 103–114.