Monday, July 14, 2008

"real-world" object recognition

A paper from earlier this year (actual paper here) argues that Caltech 101 and similar benchmarks used to test object recognition systems fail to capture what is truly difficult about visual object recognition, and good performance can be achieved by local models with no explicit higher-level processing. This, the authors claim, is not because these models are actually effective, but because the datasets used contain very little "real-world" variation. For instance, in Caltech 101 the objects are centered, similarly oriented, mostly unoccluded and free from clutter, and are correlated with the image backgrounds.

Instead, the authors argue, we should use synthetic data sets where the amount of variation is controlled. They test their simple model on such a data set where object size, position, and orientation vary and find that as the variation increases, performance quickly degrades to no better than chance.

I think the paper raises an important point, in that we should take care that our datasets are capturing the essence of the problem to be solved. A few comments:

  1. The problem of object recognition has multiple essences, and Caltech 101 captures some of them. Our "texture recognition as object recognition" models work pretty well because they do capture some of the appearance variation across different object instances, illumination, and very small changes in viewpoint.
  2. It's not clear that a dataset in which objects have large amounts of variation in viewing angle better captures object recognition in the "real world". In particular, cars and airplanes, the two objects used in the synthetic dataset, are rarely viewed from some angles.
  3. I would like to see how humans perform on the synthetic data set with lots of variation.
Incidentally, here's another argument from the paper about the standard recognition datasets:
The majority of images are also ‘‘composed’’ photographs, in that a human decided how the shot should be framed, and thus the placement of objects within the image is not random and the set may not properly reflect the variation found in the real world.
My ECCV 2008 paper uses this observation as a cue for segmenting 3D scenes into objects of interest using photographs from Flickr.

Saturday, July 05, 2008

transformations between point sets with no correspondence

Another problem that arises frequently in computer vision applications is finding a transformation that maps one set of points to another, where we have no information about point correspondences.

For instance, suppose I have a corner detector, and I want to align the set of detected corners to a floor plan, perhaps for an application like this one.

In addition, as will nearly always be the case, many of my detected corners are spurious. What I want to do is find a transformation that maps as many detected corners as possible to corners on the floor plan.

Unfortunately, even an extremely simplified version of this problem is thought to be hard, where hard means there is no solution better than quadratic. This implies that to find the exact solution, we will have to look at each pair of a detected corner and a floor plan corner.

Here's the simplified version: given two sets of numbers A and B, find the constant offset c that we can add to all elements of A to maximize the number of elements that A and B have in common. In other words, considering points in 1D, what is the translation that maximizes the number of correspondences?

It turns out that this problem is hard even if we are merely trying to detect whether or not there are two correspondences that agree on a translation. This is shown through a reduction to the 3SUM problem in this paper.