A paper from earlier this year (actual paper here) argues that Caltech 101 and similar benchmarks used to test object recognition systems fail to capture what is truly difficult about visual object recognition, and good performance can be achieved by local models with no explicit higher-level processing. This, the authors claim, is not because these models are actually effective, but because the datasets used contain very little "real-world" variation. For instance, in Caltech 101 the objects are centered, similarly oriented, mostly unoccluded and free from clutter, and are correlated with the image backgrounds.
Instead, the authors argue, we should use synthetic data sets where the amount of variation is controlled. They test their simple model on such a data set where object size, position, and orientation vary and find that as the variation increases, performance quickly degrades to no better than chance.
I think the paper raises an important point, in that we should take care that our datasets are capturing the essence of the problem to be solved. A few comments:
- The problem of object recognition has multiple essences, and Caltech 101 captures some of them. Our "texture recognition as object recognition" models work pretty well because they do capture some of the appearance variation across different object instances, illumination, and very small changes in viewpoint.
- It's not clear that a dataset in which objects have large amounts of variation in viewing angle better captures object recognition in the "real world". In particular, cars and airplanes, the two objects used in the synthetic dataset, are rarely viewed from some angles.
- I would like to see how humans perform on the synthetic data set with lots of variation.
The majority of images are also ‘‘composed’’ photographs, in that a human decided how the shot should be framed, and thus the placement of objects within the image is not random and the set may not properly reflect the variation found in the real world.My ECCV 2008 paper uses this observation as a cue for segmenting 3D scenes into objects of interest using photographs from Flickr.