Large-Scale Visual Semantic Extraction
Sam y Bengio1
Google Research
ABSTRACT
Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large-scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task, we propose to build a so-called embedding space, into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image or annotations similar to a given annotation. One can even build a visiosemantic tree from these annotations that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semantic-only trees, such as WordNet, which do not take into account the visual appearance of concepts.
INTRODUCTION
The emergence of the Web as a tool for sharing information has caused a massive increase in the size of potential data sets available for machines to learn from. Millions of images on web pages have tens of thousands of possible annotations in the form of HTML tags that can be conveniently collected by querying search engines (Torralba et al., 2008), tags such as in www.flickr.com, or human-curated labels such as in www.image-net.org (Deng et al., 2009). We therefore need machine learning algorithms for image annotation that can scale to learn from and annotate such data. This includes (i) scalable training and testing times and (ii) scalable memory usage. In the ideal case we would like a fast algorithm that fits on a laptop, at least at annotation time. For many recently proposed models tested on small data sets, it is unclear if they satisfy these constraints.
In the first part of this work, we study feasible methods for just such a goal. We consider models that learn to represent images and annotations jointly in a low-dimension embedding space. Such embeddings are fast at testing time because the low dimension implies fast computations for ranking annotations. Simultaneously, the low dimension also implies small memory usage. To obtain good performance for such a model, we propose to train its parameters by learning to rank, optimizing for the top annotations in the list, for example, optimizing precision at k (p@k). In the second part of this work, we propose a novel algorithm to improve testing time in multiclass classification tasks where the number of classes (or labels) is very large and where even a linear algorithm in the number of classes can become computationally infeasible. We propose an algorithm for learning a tree structure of the labels in the previously proposed joint embedding space, which, by optimizing the overall tree loss, provides a superior accuracy to existing tree labeling methods. 原件下载: |