Tuesday, October 26, 2010

Datasets

Lets say one has come up with an algorithm that tackles a computer vision problem. Let's assume this algorithm is a novel chair detector. To prove that the detector actually works you need to show quantitative and qualitative results. Testing and evaluation of any such computer vision algorithm requires a dataset. Following are the two options that one could take at this point

  1. Compile a completely new data set
    • Pros
      • Existing data sets may not represent or cover the scenarios in which the algorithm is applicable. Hence a new dataset expands this horizon.
    • Cons
      • Lot of work needed to compile a dataset
      • Lot of work needed to compile a good dataset. Ideally it should be an improvement over the existing datasets, eliminating some (or all) of the earlier shortcomings while expanding into new areas. It should not suffer from latent biases that skew the results.
      • Hard to compare the algorithm with results of other state of the art or earlier results from other groups as they would not have been run on new dataset.
  2. Use an existing dataset
    • Pros
      • Other results to compare against. Other groups would have used the dataset and hence will have results which could be used to compare new algorithm
      • Being a existing and used dataset means that there will be lesser problems and several wrinkles flattened in terms of data annotations etc, making it more reliable
    • Cons
      • May not cover the new scenarios for which the new algorithm is applicable or have a bias against them.

Whichever option you choose you need to produce reproducible results. To ensure the usefulness of your work it is beneficial if you provide appropriate code that makes it easy for others to reproduce the results as well. I would recommend referring to Dataset Issues in Object Recognition by Ponce et. al. for a comprehensive study of the issues involved.

No comments:

Post a Comment