During GTC 2015, we were given a demonstration of automated image captioning with ConvNets and Recurrent Nets. Using nothing but a database of images, the platform was able to describe what it saw onscreen. Sometimes the results were astonishingly accurate. Other times, not so much. In other words, it will turn your computer into something brilliant yet handicapped — a bit like Rain Man.
New developments in deep structured learning are allowing computers to accurately perceive what they “see” in photos and translate it into text-based descriptions — just like a human would. For example, Imagenet’s 19 layer-deep network is capable of identifying and naming a subspecies of animal by filtering through millions of similar photos.
However, the degree of perceptiveness is entirely dependent on the number of images available on the network. This can lead to some unusual interpretations. (For example, if the platform only has access to images of cats and dogs, it will attempt to class all animal photos as one or the other.)
We witnessed an unsimulated demonstration of automated image captioning during Nvidia’s GTC 2015 keynote. The results were both astonishing and kind of hilarious. Check out the above video for some interesting examples.
Gizmodo travelled to GTC 2015 in San Jose, California as a guest of Nvidia.