O3 beats a master-level GeoGuessr player, even with fake EXIF data

by bko

4422d ago

307 comments

Comments (307)

rafram2d ago

From one of o3 outputs:

> Rear window decal clearly reads “www.taxilinder.at”. A quick lookup shows Taxi Linder GmbH is based in Dornbirn, Vorarlberg.

That's cheating. If it can use web search, it isn't playing fair. Obviously you can get a perfect score on any urban GeoGuessr round by looking up a couple businesses, but that isn't the point.

SamPatt2d ago

Author here, I'm glad to see folks find this interesting.

I encourage everyone to try Geoguessr! I love it.

I'm seeing a lot of comments saying that the fact that the o3 model used web search in 2 of 5 rounds made this unfair, and the results invalid.

To determine if that's true, I re-ran the two rounds where o3 used search, and I've updated the post with the results.

Bottom line: It changed nothing. The guesses were nearly identical. You can verify the GPS coordinates in the post.

Here's an example of why it didn't matter. In the Austria round, check out how the model identifies the city based on the mountain in the background:

https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...

It already has so much information that it doesn't need the search.

Would search ever be useful? Of course it would. But in this particular case, it was irrelevant.

jvvw2d ago

I'm Master level at Geoguessr - it's a rank where you have to definitely know what you are doing but it isn't as high as it probably sounds from the headline.

Masters is about 800-1200 ELO whereas the pros are 1900-2000ish. I'll know the country straight away on 95% of rounds but I can still have no idea where I am in Russia or Brazil sometimes if there's no info. Scripters can definitely beat me!

OtherShrezzing2d ago

It's my understanding that o3 was trained on multimodal data, including imagery. Is it unreasonable to assume its training data includes images of these exact locations and features? GeoGuesser uses Google Maps, and Google Maps purchases most of its imagery from third-parties these days. If those third parties aren't also selling to all the big AI companies, I'd be very surprised.

arm322d ago

GeoGuessr aside, I really hope that this tech will be able to help save kids someday, e.g. help with FBI's ECAP (https://www.fbi.gov/wanted/ecap).

parsimo20102d ago

My comment from the previous post:

> I’m sure there are areas where the location guessing can be scary accurate, like the article managed to guess the exact town as its backup guess. But seeing the chain of thought, I’m confident there are many areas that it will be far less precise. Show it a picture of a trailer park somewhere in Kansas (exclude any signs with the trailer park name and location) and I’ll bet the model only manages to guess the state correctly.

This post, while not a big sample size, reflects how I would expect these models to perform. The model managed to be reliable with guessing the right country, even in pictures without a lot of visual information (I'll claim that getting the country correct in Europe is roughly equivalent to guessing the right state in the USA). It does sometimes manage to get the correct town, but this is not a reliable level of accuracy. The previous article only tested on one picture and it happened to get the correct town as its second guess and the author called it "scary accurate." I suppose that's a judgement call. To me, I've grown to expect that people can identify what country I'm in from a variety of things (IP address, my manner of speech, name, etc.), so I don't think that is "scary."

I will acknowledge that o3 with web search enabled seems capable of playing GeoGuessr at a high level, because that is less of a judgement call. What I want to see now is an o3 GeoGuessr bot to play many matches and see what its ELO is.

orangecat2d ago

Amazing. I'm relatively bullish on AI and still I would have bet on the human here. Looking forward to the inevitable goalpost-moving of "that's not real reasoning".

jampa2d ago

I was trying to play with o3 this week to see how close it can identify things, and, interestingly, it tries more pattern matching than its own "logic deduction". For example, it can easily deduce any of my photos from Europe and the US because there are many pictures online that I can search for and see similar pictures.

However, when there are not many photos of the place online, it gets closer but stops seeking deeper into it and instead tries to pattern-match things in its corpus / internet.

One example was an island's popular trail that no longer exists. It has been overgrown since 2020. It said first that the rocks are typical of those of an island and the vegetation is from Brazil, but then it ignored its hunch and tried to look for places in Rio de Janeiro.

Another one was a popular beach known for its natural pools during low tides. I took a photo during high tide, when no one posts pictures. It captured the vegetation and the state correctly. But then it started to search for more popular places elsewhere again.

asdsadasdasd1232d ago

This is probably one of the less impressive LLM applications imo. Like it already knows what every plant, street sign, etc is. I would imagine a traditional neural net would do really well here as well if you can extract some crude features.

bongodongobob2d ago

I tried this the other day with a picture of my dog in a field in a park by a river with EXIF stripped. It gave me a list of parks on the correct river. There were really no other features other than the foliage and water. Seems like magic to me, I don't really understand how it's possible to be that accurate.

exitb2d ago

I tried a picture of Dublin and it pointed out the hotel I took it from. Obviously that’s more data than any single person can keep in their head.

ksec2d ago

>But several comments intrigued me:

>>I wonder What happened if you put fake EXIF information and asking it to do the same. ( We are deliberately misleading the LLM )

Yay. That was me [1] which was actually downvoted for most of its time. But Thank You for testing out my theory.

What I realised over the years is that comments do get read by people and do shape other people's thought.

I honestly dont think looking up online is cheating. May be in terms of the game. But in real life situation which is most of the time it is absolutely the right thing to do. The chains of thought is scary. I still dont know anything about how AI works other than old garbage in, garbage out. But CoT is definitely something else. Even though the author said it is sometimes doing needless work, but in terms of computing resources I am not even sure if it matters as long as it is accurate. And it is another proof that may be, just may be AI taking over the world is much closer than I imagined.

[1] https://news.ycombinator.com/item?id=43803985

groby_b2d ago

For what it's worth, it's also pretty impressive with night sky images, if you give it an approximate date & time. (Yes, I know, it's not that impressive a skill, except the process was still pretty involved - no EXIF, postprocessed and stacked image, rotated field of view)

I gave it a (stacked) shot of M13, with date & time. It immediately recognized M13 (no search), figured out the shot also included NGC 6207 (which is already pretty impressive).

It further figured out the shot was rotated. (True, I was going for a specific field of View)

That was enough to pinpoint 37-38 degrees latitude.

From there, it inferred from the light pollution that it's probably Bay Area. (Yes, yes it its)

Furthermore, still based on light pollution, pinpointed I'm in a Bortle 4-5 area (also correct) and narrowed it down to "South Bay/Palo Alto" (still correct)

Given that this was a stacked, post-processed/color-corrected image that inference is still pretty damn impressive

And, fwiw, 4o gets all the way to "huh, 35-40 deg latitude", so that's a good improvement.

[Image link](https://photos.app.goo.gl/2P7NFKn8ZycNhrXn7) here if you want to try

geoelectric1d ago

I’m pretty sure temp chat mode doesn’t prevent the model from accessing your past chats and personalization. It just means that chat won’t be saved to them, to be seen in the future. It’s the same as incognito mode in browsers—it doesn’t prevent your search history from being used; it just keeps that session out of it.

If the experiment had been based on the idea that that option isolated the question, it may have been flawed. I found my ChatGPT’s o3’s accuracy went way down when I cleared personalization and deleted all past chats (turning off extended memory would’ve been equivalent, I think).

Importantly, only once did the o3 reasoning mention it was fishing from my past chats—that’s what clued me in I messed up the isolation—but the guess rate was still radically different from all the times before once I cleaned house. That suggests to me that it was quietly looking before, and it just didn’t make the cut for explicitly saying so.

weinzierl2d ago

I tried it with a couple of holiday shots and couple of shots from my window and it is nothing but amazing.

That being said I noticed two things that probably hamper its performance - or make its current performance even more amazing - depending how you look at it:

- It often tries to zoom in to decipher even minuscle text. This works brilliantly. Sometimes it tries to enhance contrast by turning the image into black and white with various threshold levels to improve the results, but in my examples it always went in the wrong direction. For example the text was blown out white, it failed, it turned it even ligher instead of darker, failed again, turned it into a white rectangle and gave up on the approach.

- It seems not to have any access to Google Maps or even Open Street Maps and therefore fails to recognize steet patterns. This is even more baffling than the first point, because it is so unlike how I suppose human geo guessers work.

kazinator2d ago

This seems like a really silly category in which to be competing against machines.

Machine learning could index million or faces, and then identify members of that set from pictures. Could you memorize millions of people, to be able to put a name to a face?

Why not also compete againt grep -r to see who can find matches for a regex faster across your filesystem.

amrrs2d ago

It's thinking process to go about guessing a place is further fascinating. Even o4 mini high is quite good[1] and very fast.

But unlike a geogussr, it uses websearch[1] [1] https://youtu.be/P2QB-fpZlFk?si=7dwlTHsV_a0kHyMl [1]

Imnimo2d ago

On the first image, from the model's CoT:

>"I also notice Cyrillic text on a sign"

Am I missing this somewhere? Is the model hallucinating this?

I'd also be very interested to see a comparison against 4o. 4o was already quite good at GeoGuessr-style tasks. How big of a jump is o3?

Sam6late2d ago

I was wondering if this helps in detecting current spots from old aerial videos, say San Francisco in 2002, how cool would it be to juxtapose both in a new video, San Francisco in 2002: https://www.youtube.com/watch?v=vTR6iftL7yE

or Dubai in 1997 https://www.youtube.com/watch?v=JMNXXiiDRhM

textlapse2d ago

Man this would be a game changer for those OSINT (Bellingcat/Trace an object) style work. I wonder if that has happened yet!

There could even be geoguessr style competitions that could significantly help move the needle at least as a copilot if not outright mass identify.

karaterobot2d ago

I don't really follow OSINT, but I occasionally enjoy the fruits of that labor. I assume these models are all in heavy rotation for identifying a location based on an imperfect photograph. What are other practical implications of a model being better than a human at this?

sinuhe692d ago

I also propose using photoshop to insert some fake elements (besides fake EXIF) to see if it can detect them.

If yes, that means we can employ AI to find faked images?

sixtram2d ago

I'm wondering if you feed all the Google street map photos into a special ML designed just for that, how important could that be for say the CIA or FBI?

mrcwinn2d ago

O3 is seriously impressive for coding, as well, with Codex. It seems far superior to 3.7-thinking, although it's also more expensive in my usage.

simianparrot2d ago

I too can beat a master level GeoGuessr if I’m allowed to cheat. Please add that info to the headline and be honest.

julianhuang2d ago

1. The "master geoguesser" is a bit misleading--as mentioned in his blog post, there are players far better than him, and he is certainly not the bar for human supremacy. Probably analogous to a 1400-1800 elo chess player. 2. o3 isn't the best model at playing GeoGuessr, Gemini 1.5 & 2.5 solidly beat it out--for those interested, check out my friend's benchmark (https://geobench.org/) and blog post (https://ccmdi.com/blog/GeoBench) detailing interesting model explanations. 3. In the post, he only tests on one game--o3's average score over 100 locations (20 5-location games) was 19,290, far lower than the 23,179 in the game. Model geolocation capabilities are really important to keep track of, but the specific blog post in question isn't anything out of the ordinary. LLMs are making geolocation abilities much more accessible, but still fall short compared to 1. top GeoGuessr players playing GeoGuessr (only google streetview coverage, without web search) and 2. professional geolocators, who are proficient at using a wide variety of software/search. I.e., if the CIA wanted to find someone using an image, LLMs would not provide them any unique ability to do so as opposed to someone like Rainbolt

bredren2d ago

Neat to see progress of this from Simon's original post to comment to this.

godelski2d ago

There's two important things here to consider when reading:

1) O3 cheated by using Google search. This is both against the rules of the game and OP didn't use search either

2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.

It's an apples to oranges comparison. They're both fruit and round, but you're ignoring obvious differences. You're cherry picking.

The title is fraudulent as you can't make a claim like that when one party cheats.

I would find it surprising if OP didn't know these rules considering their credentials. Doing this kind of clickbait completely undermines a playful study like this.

Certainly O3 is impressive, but by over exaggerating its capabilities you taint any impressive feats with deception. It's far better to under sell than over sell. If it's better than expected people are happier, even if the thing is crap. But if you over sell people are angry and feel cheated, even if the thing is revolutionary. I don't know why we insist on doing this in tech, but if you're wondering why so many people hate "tech bros", this is one of the reasons. There's no reason to lie here either! Come on! We can't just normalize this behavior. It's just creating a reasonable expectation for people to be distrusting of technology and anything tech people say. It's pretty fucked up. And no, I don't think "it's just a blog post" makes it any better. It makes it worse, because it normalizes the behavior. There's other reasons to distrust big corporations, I don't want to live in a world where we should have our guards up all the time.

derfnugget2d ago

"These models have more than an individual mind could conceivably memorize."

...so what? Is memorization considered intelligence? Calculators have similar properties.

GeoGuessr is the modern nerds' Rubix Cube. The latest in "explore the world without risk of a sunburn".

j3s2d ago

isn't anyone else horrified by this? the implication is that given an arbitrary picture, chatgpt can give you a very likely approximate location - expert level doxxing is in the hands of anyone with access to a chatgpt subscription.

feels terrifying, especially for women.