Multimodal Magic

Written by Matt Cyr | Dec 13, 2024 8:56:12 PM

A couple weeks ago, I was flying to St. Louis for a three-day client workshop and, like I often do, got to wondering what I was flying over. I've always had to guess, but on this trip, I decided to try out Gemini, Google’s AI-powered multimodal tool, to see if it could help me figure out where I was.

I fired up the app, took a photo out the window of the plane and gave it a simple prompt: Based on this image, where do you think I am? Talk me through how you came to this conclusion. I wanted to see what Gemini was capable of, but also wanted to understand the process it took to get there.

Here is Gemini's response:

The image you sent shows a view from an airplane window, with a clear blue sky and a landscape of farmland below. Here's how I can use this information to make an educated guess about your location:

Landscape:

The landscape is primarily farmland, with a few patches of trees and water bodies visible. This suggests you are likely flying over a region with a significant agricultural sector.

River:

There's a prominent river winding through the landscape. The presence of a major river can help narrow down the possibilities.

Terrain:

The terrain appears relatively flat, with gentle slopes. This rules out mountainous regions.

Time of Day:

The sun's position suggests it's early morning or late afternoon.

Based on these clues, here are a few regions that come to mind:

Midwestern United States: Known for its vast agricultural plains and major rivers like the Mississippi.
Canadian Prairies: Similar to the Midwest in terms of landscape and agricultural focus.
European Plains: Countries like France, Germany, and Poland have extensive agricultural areas and river systems.

As you can see, Gemini quickly narrowed down a few parts of the world where I might be, but told me it needed more information to more accurately pinpoint my location. So I took a second photo, this one showing a river with a bridge crossing over it, and I told Gemini that it was mid-November.

The second response narrowed my location significantly, quickly determining that the river was likely the Mississippi based on the following observations:

River: The river visible in the image is likely the Mississippi River. Its distinctive bend and the surrounding terrain match the Mississippi's characteristics.
Time of Year: It's mid-November, which means the landscape would be transitioning towards autumn colors. However, the image shows mostly green fields, suggesting a region with a milder climate where foliage changes later in the season.

Possible Locations:

Based on these observations, the most likely location for the image is the Southern United States, specifically the region around Mississippi, Louisiana, or Arkansas. The Mississippi River is prominent in these states, and the climate in the South is generally milder, which would explain the later foliage change.

Once again, Gemini asked me for a bit more context, so I took a third photo, this one showing a large bend in the river below. I also told Gemini that it was 3:52 in the afternoon and asked it to tell me exactly where I was.

“The third image shows a distinctive bend in a river, which is a strong clue. This bend is likely the "Great Bend" of the Mississippi River, located near Memphis, Tennessee.”

A quick look at the flight map on the back of the seat in front of me confirmed that I was, in fact, flying over Memphis at that moment.

My jaw dropped. Using only three (not very good) photos and a little bit of additional context, Gemini pegged exactly where I was in less than two minutes.

I revisited this experience yesterday when I saw that Google has released Gemini 2.0, a further step in the progress of multimodal capabilities. If you’re not familiar with the term, multimodal means that an AI tool has the ability now to both take in and produce lots of different kinds of information.

When ChatGPT launched, it was simply a text-to-text tool: you typed in a text prompt, it spit out a text response. But with multimodal, you can now do text-to-code, voice-to-image, image-to-movie, spreadsheet-to-graph, or, as in my case, “image-to-where-in-the-world-is-Matt?” The list goes on…

So what does this mean? As we saw on 60 Minutes last weekend, multimodal can help Anderson Cooper draw a more accurate human body. It can teach the OpenAI team how to make pour-over coffee. It can even teach you how to make text “wiggle” in Adobe After Effects.

The same 60 Minutes segment that showed Anderson Cooper learning which side of the body the liver is on also highlighted Khanmigo, the new AI-driven tutoring tool developed by the Khan Academy, an education company that has revolutionized online learning. The goal of Khanmigo is to provide one-on-one, real-time tutoring that allows teachers to focus on each student’s specific learning needs rather than on the time-consuming process of reviewing and grading dozens or even hundreds of papers and tests.

Rather than making teachers obsolete, it puts them back into the role of truly teaching.

Imagine a multimodal AI that can look at a student’s work and provide real-time feedback that helps them learn more effectively. Or an AI that can be used in an operating room to avoid surgical complications. Or an AI that can look at an airplane engine and determine if it needs maintenance.

There’s been a lot of talk this year about AI agents, but the truth is that agents aren’t ready for primetime and may not be anytime soon. But multimodal AI is here and I don’t think it’s an exaggeration to say that the potential here is nearly limitless.

View full post