For her 38th birthday, Chela Robles and her family made a trek to One House, her favorite bakery in Benicia, California, for a brisket sandwich and brownies. On the car ride home, she tapped a small touchscreen on her temple and asked for a description of the world outside. “A cloudy sky,” the response came back through her Google Glass.
Robles lost the ability to see in her left eye when she was 28, and in her right eye a year later. Blindness, she says, denies you small details that help people connect with one another, like facial cues and expressions. Her dad, for example, tells a lot of dry jokes, so she can’t always be sure when he’s being serious. “If a picture can tell 1,000 words, just imagine how many words an expression can tell,” she says.
Robles has tried services that connect her to sighted people for help in the past. But in April, she signed up for a trial with Ask Envision, an AI assistant that uses OpenAI’s GPT-4, a multimodal model that can take in images and text and output conversational responses. The system is one of several assistance products for visually impaired people to begin integrating language models, promising to give users far more visual details about the world around them—and much more independence.
Envision launched as a smartphone app for reading text in photos in 2018, and on Google Glass in early 2021. Earlier this year, the company began testing an open source conversational model that could answer basic questions. Then Envision incorporated OpenAI’s GPT-4 for image-to-text descriptions.
Be My Eyes, a 12-year-old app that helps users identify objects around them, adopted GPT-4 in March. Microsoft—which is a major investor in OpenAI—has begun integration testing of GPT-4 for its SeeingAI service, which offers similar functions, according to Microsoft responsible AI lead Sarah Bird.
In its earlier iteration, Envision read out text in an image from start to finish. Now it can summarize text in a photo and answer follow-up questions. That means Ask Envision can now read a menu and answer questions about things like prices, dietary restrictions, and dessert options.
Another Ask Envision early tester, Richard Beardsley, says he typically uses the service to do things like find contact information on a bill or read ingredients lists on boxes of food. Having a hands-free option through Google Glass means he can use it while holding his guide dog’s leash and a cane. “Before, you couldn’t jump to a specific part of the text,” he says. “Having this really makes life a lot easier because you can jump to exactly what you’re looking for.”
Integrating AI into seeing-eye products could have a profound impact on users, says Sina Bahram, a blind computer scientist and head of a consultancy that advises museums, theme parks, and tech companies like Google and Microsoft on accessibility and inclusion.
Bahram has been using Be My Eyes with GPT-4 and says the large language model makes an “orders of magnitude” difference over previous generations of tech because of its capabilities, and because products can be used effortlessly and don’t require technical skills. Two weeks ago, he says, he was walking down the street in New York City when his business partner stopped to take a closer look at something. Bahram used Be My Eyes with GPT-4 to learn that it was a collection of stickers, some cartoonish, plus some text, some graffiti. This level of information is “something that didn’t exist a year ago outside the lab,” he says. “It just wasn’t possible.”