Coursera Learner working on a presentation with Coursera logo and
Coursera Learner working on a presentation with Coursera logo and

AI Glasses + Multimodal AI: A New Industry Frontier

Recent tech demos by OpenAI and Google highlight why smart glasses are the ideal platform for AI chatbots. OpenAI showcased its GPT-4o multimodal AI model, followed by Google’s demonstration of Project Astra, which will later integrate into Google’s Gemini. These technologies use video and audio inputs to prompt sophisticated AI chatbot responses.

OpenAI’s demonstration appears more advanced or bolder than Google’s, with promises of public availability within weeks, compared to Google’s more vague “later this year” timeline. OpenAI also claims its new model is twice as fast and half the cost of GPT-4 Turbo, whereas Google did not provide performance or cost comparisons for Astra.

Previously, Meta popularized the term “multimodal” through features in its Ray-Ban Meta glasses. These glasses allow users to command Meta to take a picture and describe what it sees. However, the Ray-Ban Meta’s reliance on still images falls short compared to the video capabilities demonstrated by OpenAI and Google.

The Impact of Video in Multimodal AI

Multimodal AI integrates text, audio, photos, and video, enabling a more human-like interaction. For example, during Google I/O, Project Astra and OpenAI’s new model could analyze and interpret what was displayed on a computer screen in real time. Another demo showed GPT-4o using a smartphone camera to describe surroundings based on another instance’s comments and questions.

This technology mirrors human capability, allowing AI to answer questions about objects and people in the physical world. Advertisers are particularly interested in using video in multimodal AI to gauge the emotional impact of ads, as noted by Laurie Sullivan in MediaPost.

The Future of Multimodal AI Points to AI Glasses

The demos by OpenAI and Google indicate a future where multimodal AI with video allows us to interact naturally with our surroundings through AI chatbots. However, using smartphones to show AI what we want it to see is cumbersome. The logical evolution is toward video-enabled AI glasses.

A notable moment in Google’s demo involved a prototype pair of AI glasses taking over a chat session from a smartphone. This transition made the interaction more natural, as the user could look at objects instead of pointing a phone at them.

Despite this progress, it’s unlikely that consumer-ready AI glasses, like hypothetical “Pixel Glasses,” will be available soon. Google’s previous research into translation glasses, which appeared to be shelved, now seems to have been an early prototype for Astra’s features. The translation glasses showcased real-time sign language translation, hinting at video-enhanced multimodal AI capabilities.

Recent developments, including a patent granted to Google for integrating laser projectors into AI glasses, suggest ongoing advancements in AI glasses technology. Companies like Luxottica or Avegant could partner with AI firms to produce branded AI glasses, potentially leading to products like OpenAI Glasses, Perplexity Glasses, or even Hugging Face Glasses.

A massive AI glasses industry is on the horizon, likely to emerge next year. The integration of video into multimodal AI underscores the potential scale of this market.

Languages

Weekly newsletter

No spam. Just the latest releases and tips, interesting articles, and exclusive interviews in your inbox every week.