Multimodal Magic
A couple weeks ago, I was flying to St. Louis for a three-day client workshop and, like I often do, got to wondering what I was flying over. I've alwa...
A couple weeks ago, I was flying to St. Louis for a three-day client workshop and, like I often do, got to wondering what I was flying over. I've always had to guess, but on this trip, I decided to try out Gemini, Google’s AI-powered multimodal tool, to see if it could help me figure out where I was.
I fired up the app, took a photo out the window of the plane and gave it a simple prompt: Based on this image, where do you think I am? Talk me through how you came to this conclusion. I wanted to see what Gemini was capable of, but also wanted to understand the process it took to get there.
Here is Gemini's response:
The image you sent shows a view from an airplane window, with a clear blue sky and a landscape of farmland below. Here's how I can use this information to make an educated guess about your location:
Landscape:
River:
Terrain:
Time of Day:
Based on these clues, here are a few regions that come to mind:
The second response narrowed my location significantly, quickly determining that the river was likely the Mississippi based on the following observations:
Possible Locations:
Based on these observations, the most likely location for the image is the Southern United States, specifically the region around Mississippi, Louisiana, or Arkansas. The Mississippi River is prominent in these states, and the climate in the South is generally milder, which would explain the later foliage change.
Once again, Gemini asked me for a bit more context, so I took a third photo, this one showing a large bend in the river below. I also told Gemini that it was 3:52 in the afternoon and asked it to tell me exactly where I was.
“The third image shows a distinctive bend in a river, which is a strong clue. This bend is likely the "Great Bend" of the Mississippi River, located near Memphis, Tennessee.”
A quick look at the flight map on the back of the seat in front of me confirmed that I was, in fact, flying over Memphis at that moment.
My jaw dropped. Using only three (not very good) photos and a little bit of additional context, Gemini pegged exactly where I was in less than two minutes.
I revisited this experience yesterday when I saw that Google has released Gemini 2.0, a further step in the progress of multimodal capabilities. If you’re not familiar with the term, multimodal means that an AI tool has the ability now to both take in and produce lots of different kinds of information.
When ChatGPT launched, it was simply a text-to-text tool: you typed in a text prompt, it spit out a text response. But with multimodal, you can now do text-to-code, voice-to-image, image-to-movie, spreadsheet-to-graph, or, as in my case, “image-to-where-in-the-world-is-Matt?” The list goes on…
So what does this mean? As we saw on 60 Minutes last weekend, multimodal can help Anderson Cooper draw a more accurate human body. It can teach the OpenAI team how to make pour-over coffee. It can even teach you how to make text “wiggle” in Adobe After Effects.
The same 60 Minutes segment that showed Anderson Cooper learning which side of the body the liver is on also highlighted Khanmigo, the new AI-driven tutoring tool developed by the Khan Academy, an education company that has revolutionized online learning. The goal of Khanmigo is to provide one-on-one, real-time tutoring that allows teachers to focus on each student’s specific learning needs rather than on the time-consuming process of reviewing and grading dozens or even hundreds of papers and tests.
Rather than making teachers obsolete, it puts them back into the role of truly teaching.
Imagine a multimodal AI that can look at a student’s work and provide real-time feedback that helps them learn more effectively. Or an AI that can be used in an operating room to avoid surgical complications. Or an AI that can look at an airplane engine and determine if it needs maintenance.
There’s been a lot of talk this year about AI agents, but the truth is that agents aren’t ready for primetime and may not be anytime soon. But multimodal AI is here and I don’t think it’s an exaggeration to say that the potential here is nearly limitless.
A couple weeks ago, I was flying to St. Louis for a three-day client workshop and, like I often do, got to wondering what I was flying over. I've alwa...
Last week I tried NotebookLM for the first time and was blown away by its ability to summarize two of my blog posts into a pretty listenable podcast p...
Every week, I summarize the week's top AI news, with an emphasis on higher education and healthcare marketing. Google's NotebookLM is Eye-Popping 👀 E...