Advances in artificial intelligence have led to the development of machine-learning algorithms that can recognize cats in photos and translate websites into different languages. Now, Facebook’s researchers are fathoming how to teach computers to ‘see.’
Larry Zitnick, lead researcher at the CoCo project (Common Objects in Context) at Facebook says people should not worry “about AI taking over the world” in the near future as they are still struggling to recognize whether a picture has an actual person in it or contains someone standing in front of a mirror taking a selfie.
However, the computer has nailed it if the images contain a man surfing a wave, and a giraffe standing in the grass next to a tree, Zitnick said. This means that users can search for images of “surfing” men, which advertisers may use to promote beach holidays.
“You need a far deeper understanding about the world, you need to understand how the image was actually taken,” said Zitnick, who said he thinks a lot about “what is the role of computer vision in AI.” He made the comments in front of an audience attending the LDV Vision Summit in New York on Tuesday.
Zitnick said that back in 1984, the thinking around AI was focused on how to develop recognition techniques. Researchers understood that they needed to collect huge amounts of data and find really fast computers. They solved the algorithm for machine-learning and collected huge amounts of data.
“[That] took a lot longer…about 30 years to collect enough data to begin learning,” said Zitnick. “In 2016 we find ourselves asking the question ‘How are we going to solve AI, so which direction do we need to go in to solve AI?’”
Facebook researchers started out with an image caption generator.
“We realized while this was exciting but it didn’t work when you don’t have the same images in the dataset, and if the images are too unique, some of the algorithms start falling apart,” Zitnick said.
He said more data was not the solution, so they looked at making the recognition programs harder, and that is when they tried images with mirrors. With a simple image of a room with a mirror in the background, the image caption generator works. The computer can identify a mirror in the room. However, if someone is standing in front of the mirror and taking a selfie, the computer cannot recognize that there is a mirror present.
Zitnick said they are exploring the best approach for AI to learn context but they are just at the beginning stages. They still have to determine what methods for learning will work best: Supervised, semi-supervised, reinforced learning (with rewards) or unsupervised.
With supervised learning, there have been advances so far using ImageNet and Coco, but creating the dataset is incredibly difficult and frustrating. Zitnick said that if a computer is asked to recognize and caption ‘yellow,’ it might be able to identify an image having yellow in it if there is an isolated object but not if the image was of a bunch of bananas. Then the computer would forget to mention the color ‘yellow.’
He said a recent paper proposed a question by giving three statements: “Mary went into the hallway,” “John moved to the bathroom” and “Mary traveled to the kitchen.” The computer was asked, “Where is Mary?”
“Computers have a really hard time of answering that question. This is really trivial and yet [computers] cannot do it because [computers] can’t understand what these statements are actually saying,” said Zitnick. “And people are worried about AI taking over the world!”
LDV Vision Summit is an annual event organized by Evan Nisselson, investor and entrepreneur. The summit brings together technologists, visionaries, startups, executives and investors with the purpose of exploring how imaging and video technologies will empower or disrupt businesses and society.
R.Myles @RebMyles