Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Focusing attention at the level of objects and other salient image regions.
I am going to claim with no justification whatsoever, that this image of AlexNet was as common at CVPR this year as the Lena image. Given that AlexNet is a line drawing and Lena is a 1972 Playboy model, this would be quite an achievement.
One of the key developments in CNNs this year was the increasing use of feature representations that include low and mid-level convolution layer activations, not just output layer activations. The intution behind this approach seems to be that the while the output layers encode the ‘what’ of the problem, the lower layers encode the ‘where’. Certainly it seems that for tasks that include a localisation component, such as image segmentation, this looks like it is now the dominant approach. I counted at least four papers that, at least at first glance, appeared to contain variations on the this common idea, which was variously described as ‘hypercolumns’, ‘deep jets’, ‘zoom-out features’ and ‘the treasure beneath convolutional layers’.
Looking at these papers side-by-side, the most fascinating observation for me is that the best performance on Pascal VOC 2012 semantic segmentation (which was ‘zoom-out features’ by a significant margin) was achieved using a network pre-trained on ImageNet, as-is. The base network was not even fine-tuned on the task at hand. This certainly illustrates the power of using activations from every layer of a CNN.
This year again featured an oral on reconstructing 3D from a single image (which was awarded best student paper) and also included a full day workshop on this task as well. It certainly seems that there is increasing attention focused on a challenge that only a few years ago would have been considered too difficult. Although current approaches require ground truth segmentation and keypoints for the training images, no doubt many researchers are working on relaxing these requirements. It will certainly be interesting to follow progress in this area.
Some interesting work is being done on how to visualise feature activations with deep networks. This paper (Understanding Deep Image Representations by Inverting Them) proposed an optimisation method to sample the space of possible reconstructions for a given set of feature activations. This approach then illustrates some of the photometric and geometric invariances captured at each layer of the net, as illustrated in the image below. I also caught a fascinating invited workshop talk by Antonio Torralba that touched on this topic as well. Check out this really nice interactive visualization of a deep network by his group at MIT.
It seems like rapid progress is being made on the problem of generating natural language image descriptions (aka image captioning), for example this paper (Deep Visual-Semantic Alignments for Generating Image Descriptions). I’ve heard it argued around the lunch table that this isn’t an important problem. Whether that’s true or not, it’s certainly a good yardstick for progress. I’m certainly looking forward to downloading Andrej Karpathy’s code from this paper and testing it out.
Yoshua Bengio’s invited talk at the deep learning workshop is chock-full of great references to some of the recent theory papers on deep learning.
Raquel Urtasun and Marc Pollefeys gave some really interesting talks on graphical models (including combining with deep networks). If I can find links I’ll post them, and then try and understand them!
There were many other great talks and papers as well, this is just a sample. What else did I miss?