分享
如何评价主流聚类算法时间复杂度, 比如k
suppose that there are k cluster center, n data points and d dimension for every data points. compare their time complexity.Additional question: what is the fastest clustering algorithm in the world except for the three above, no matter how its accuracy is. Can the fastest algorithm obtain ~ O(log n) ?
回复 ( 1 )
Thanks for the invitation. It’s an interesting question.
First of all, these are not strictly “algorithms”. The all present different optimization problems, and there are virtually infinitely many ways to solve these optimization problems. Just because it happens that EM is an efficient method for solving K-means, doesn’t make K-means an algorithm in its own right.
Secondly, there is no such a thing as “accuracy” for clustering algorithms. Let alone the fact that most clustering algorithms suffer from issues from initialization to local optima, there is no single right answer to a clustering problem. Even if there appears to be a “correct solution” sometimes (e.g. in “crescent” dataset and other toy datasets), it’s still an illustration of what kind of properties in the data the algorithm is trying to capture, that previous methods usually cannot (e.g. K-means doesn’t give an intuitive solution to Crescent’s structure, but spectral methods do).
Last but not least, since there could be virtually many algorithms to solve the same (usually non-convex) optimization problem, it is difficult to define which is the “fastest” algorithm for a problem regardless of the actual data distribution. Even if the problem were actually convex, optimization algorithms could still yield drastically different runtimes given different data, before the loss function converges to a reasonable error margin. For instance, Newton methods are great when the problem is quadratic-ish (the function value will then converge quadratically fast to the optimum), but before that phase there’s an undecidable linear phase which varies from problem to problem. Furthermore, for most problems, a specialized algorithm often works better than a generic one, and usually easier to prove runtime upper-bounds on (e.g. simplex algorithm for linear programming). So for convex problems, without further assumptions, the unpredictable behavior of Newton methods is about the best you can get; and for more general problems, even specialized algorithms could suffer from very pathological cases. Strict mathematical bounds on runtime are often intractable, if not irrelevant, since people have invented millions of heuristics to make the algorithms converge faster for “normal”, “real-world” datasets.
In conclusion, this is an interesting question, but I’m afraid the answer is “It’s largely inconclusive, but no one really cares that much”.