On Monday, ChatGPT-maker OpenAI announced it was starting to roll out voice and image recognition in ChatGPT. Essentially, the AI can recognize a picture for what it is, and communicate with users about it. Plus, the AI now has speech-to-text and text-to-speech synthesization capabilities. All the new features are supposed to make the chatbot seem more—ahem—“human-like” than it did in previous iterations.
OpenAI shared a promo video that’s supposed to offer users an idea of what the image recognition capabilities will look like. In it, a user asks ChatGPT to help him lower his bike seat, to which the chatbot responds with some general (and, if we were being uncharitable, extremely obvious) advice for lowering any kind of seat.
The first-time bike seat user then drew a circle around the bike seat catch and asked for more detailed help, for which ChatGPT supposedly recognized the type of bolt and told the user they needed an Allen wrench. The system is also supposedly able to look at a picture of the user manual and toolbox to see if they have the right-sized wrench.
While image recognition is not something many chatbot services have experimented with, we’re very up-to-date on speech recognition systems, as well as voice synthesisation. OpenAI teased the chatbot’s new voice services with a video of a mother who asks ChatGPT to read her kids a bedtime story about a particular forest hedgehog (She could just read from an actual picture book, but I guess that’s one way to parent).
Samples included in OpenAI’s blog post do have a natural-ish sounding cadence, though it’s not like the “Juniper,” “Sky,” or “Breeze” voice packs will create unique voices for little Larry the Hedgehog or any of her forest friends. Each voice is based on a voice actor who licensed their sounds to the system, according to OpenAI.
It’s similar to other AI voice synthesisations from companies like ElevenLabs. That service has been dragged for initially being used for deepfakes and harassment. OpenAI said its first voice services were only being implemented in the ChatGPT voice chat. The company is also licensing its voice systems over to Spotify, which on Monday announced new podcast voice translation capabilities. The system should be able to mimic popular podcasters’ voices speaking in Spanish, French, and German to start.
Of course, the new feature is only available to users who pay for the Plus or Enterprise service, and both capabilities should be available on iOS and Android within the next two weeks. Users on the web version of ChatGPT should also have image capabilities soon enough. The system also won’t be nearly as fast or as capable as any of those promo videos suggest. Wired reported based on a pre-release version that the voice recognition took several seconds to respond, and that the image system won’t try to identify people in photos (we’ll have to wait and see how well the system tries to protect peoples’ privacy in photos).
In an email to Gizmodo, a spokesperson for OpenAI said they were trying to roll out new features “gradually to allow for improvements and refinement of risk mitigations over time,” something that is even more “crucial” with voice and image recognition.
The other issue with vision-based models is that the chatbot has a whole new arena where it can misinterpret or fail to accurately gauge users’ prompts. OpenAI claimed the company red-teamed this new feature to try and reduce risks, but it will only be a matter of time before users push the ethical boundaries of the chatbot once again.
ChatGPT has watched its total users decline since it first saw massive popularity back in November 2022. Part of the issue is some users feel like the company has hindered the chatbot’s capabilities as OpenAI has struggled to find some kind of ethical balance between mitigating harms and letting their chatbot users run buck wild.
OpenAI is also facing major competition for its chatbot from major tech players such as Meta as well as startups like Anthropic. Google is reportedly set to release its own GPT-4 competitor called “Gemini” which could also include image and voice recognition capabilities. Last week, OpenAI unveiled its DALL-E 3 AI image generator which also includes ChatGPT integration. Really, it’s just another company drinking the “natural language” Kool-Aid, thinking that the ability to operate a system using natural language is somehow a replacement for a better-functioning user interface.