Thursday, February 28, 2008

medoids

Suppose I have a set of points. The mean of this set of points is the point (not necessarily in the set) which minimizes the sum of its squared distances to all the points in the set. A related concept is the medoid, which is like the mean but is restricted to being one of the points in the set.

Interesting fact: the sum of squared distances from the medoid to all points in the set cannot be more than twice the sum of squared distances from the mean to all points in the set.

This imples that for the k-medoids problem, if you have an algorithm that achieves some constant times the optimal solution cost, this same algorithm also achieves a constant factor approximation for the k-means problem.

Again, this is probably trivial if you're a theory person, but I think it's fun.

Tuesday, February 26, 2008

double triangle inequality

The triangle inequality is one of the properties of a distance metric d. It states that, for any x, y, and z:

d(x,y) + d(y,z) ≥ d(x,z)
A somewhat lesser-known (but easy to derive) extension of this is the double triangle inequality for squared distances:
2(d2(x,y) + d2(y,z)) ≥ d2(x,z)
I have never seen the double triangle inequality used for anything. However, I would think it would be useful, given the prevalence of squared distances in objective functions. The k-means objective, for instance, is to minimize a sum of squared distances. For problems in which the objective function involves non-squared distances, the ordinary triangle inequality is often used to prove bounds on an algorithm's performance. I'm not sure why I haven't seen similar things with the double triangle inequality and squared distances.

Also, I should point out that I'm not really a theory person, so my not having seen something provides very little evidence for the nonexistence of said thing.