> Rear window decal clearly reads “www.taxilinder.at”. A quick lookup shows Taxi Linder GmbH is based in Dornbirn, Vorarlberg.
That's cheating. If it can use web search, it isn't playing fair. Obviously you can get a perfect score on any urban GeoGuessr round by looking up a couple businesses, but that isn't the point.
I encourage everyone to try Geoguessr! I love it.
I'm seeing a lot of comments saying that the fact that the o3 model used web search in 2 of 5 rounds made this unfair, and the results invalid.
To determine if that's true, I re-ran the two rounds where o3 used search, and I've updated the post with the results.
Bottom line: It changed nothing. The guesses were nearly identical. You can verify the GPS coordinates in the post.
Here's an example of why it didn't matter. In the Austria round, check out how the model identifies the city based on the mountain in the background:
https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
It already has so much information that it doesn't need the search.
Would search ever be useful? Of course it would. But in this particular case, it was irrelevant.
Masters is about 800-1200 ELO whereas the pros are 1900-2000ish. I'll know the country straight away on 95% of rounds but I can still have no idea where I am in Russia or Brazil sometimes if there's no info. Scripters can definitely beat me!
> I’m sure there are areas where the location guessing can be scary accurate, like the article managed to guess the exact town as its backup guess. But seeing the chain of thought, I’m confident there are many areas that it will be far less precise. Show it a picture of a trailer park somewhere in Kansas (exclude any signs with the trailer park name and location) and I’ll bet the model only manages to guess the state correctly.
This post, while not a big sample size, reflects how I would expect these models to perform. The model managed to be reliable with guessing the right country, even in pictures without a lot of visual information (I'll claim that getting the country correct in Europe is roughly equivalent to guessing the right state in the USA). It does sometimes manage to get the correct town, but this is not a reliable level of accuracy. The previous article only tested on one picture and it happened to get the correct town as its second guess and the author called it "scary accurate." I suppose that's a judgement call. To me, I've grown to expect that people can identify what country I'm in from a variety of things (IP address, my manner of speech, name, etc.), so I don't think that is "scary."
I will acknowledge that o3 with web search enabled seems capable of playing GeoGuessr at a high level, because that is less of a judgement call. What I want to see now is an o3 GeoGuessr bot to play many matches and see what its ELO is.
However, when there are not many photos of the place online, it gets closer but stops seeking deeper into it and instead tries to pattern-match things in its corpus / internet.
One example was an island's popular trail that no longer exists. It has been overgrown since 2020. It said first that the rocks are typical of those of an island and the vegetation is from Brazil, but then it ignored its hunch and tried to look for places in Rio de Janeiro.
Another one was a popular beach known for its natural pools during low tides. I took a photo during high tide, when no one posts pictures. It captured the vegetation and the state correctly. But then it started to search for more popular places elsewhere again.
>>I wonder What happened if you put fake EXIF information and asking it to do the same. ( We are deliberately misleading the LLM )
Yay. That was me [1] which was actually downvoted for most of its time. But Thank You for testing out my theory.
What I realised over the years is that comments do get read by people and do shape other people's thought.
I honestly dont think looking up online is cheating. May be in terms of the game. But in real life situation which is most of the time it is absolutely the right thing to do. The chains of thought is scary. I still dont know anything about how AI works other than old garbage in, garbage out. But CoT is definitely something else. Even though the author said it is sometimes doing needless work, but in terms of computing resources I am not even sure if it matters as long as it is accurate. And it is another proof that may be, just may be AI taking over the world is much closer than I imagined.
I gave it a (stacked) shot of M13, with date & time. It immediately recognized M13 (no search), figured out the shot also included NGC 6207 (which is already pretty impressive).
It further figured out the shot was rotated. (True, I was going for a specific field of View)
That was enough to pinpoint 37-38 degrees latitude.
From there, it inferred from the light pollution that it's probably Bay Area. (Yes, yes it its)
Furthermore, still based on light pollution, pinpointed I'm in a Bortle 4-5 area (also correct) and narrowed it down to "South Bay/Palo Alto" (still correct)
Given that this was a stacked, post-processed/color-corrected image that inference is still pretty damn impressive
And, fwiw, 4o gets all the way to "huh, 35-40 deg latitude", so that's a good improvement.
[Image link](https://photos.app.goo.gl/2P7NFKn8ZycNhrXn7) here if you want to try
If the experiment had been based on the idea that that option isolated the question, it may have been flawed. I found my ChatGPT’s o3’s accuracy went way down when I cleared personalization and deleted all past chats (turning off extended memory would’ve been equivalent, I think).
Importantly, only once did the o3 reasoning mention it was fishing from my past chats—that’s what clued me in I messed up the isolation—but the guess rate was still radically different from all the times before once I cleaned house. That suggests to me that it was quietly looking before, and it just didn’t make the cut for explicitly saying so.
That being said I noticed two things that probably hamper its performance - or make its current performance even more amazing - depending how you look at it:
- It often tries to zoom in to decipher even minuscle text. This works brilliantly. Sometimes it tries to enhance contrast by turning the image into black and white with various threshold levels to improve the results, but in my examples it always went in the wrong direction. For example the text was blown out white, it failed, it turned it even ligher instead of darker, failed again, turned it into a white rectangle and gave up on the approach.
- It seems not to have any access to Google Maps or even Open Street Maps and therefore fails to recognize steet patterns. This is even more baffling than the first point, because it is so unlike how I suppose human geo guessers work.
Machine learning could index million or faces, and then identify members of that set from pictures. Could you memorize millions of people, to be able to put a name to a face?
Why not also compete againt grep -r to see who can find matches for a regex faster across your filesystem.
But unlike a geogussr, it uses websearch[1] [1] https://youtu.be/P2QB-fpZlFk?si=7dwlTHsV_a0kHyMl [1]
>"I also notice Cyrillic text on a sign"
Am I missing this somewhere? Is the model hallucinating this?
I'd also be very interested to see a comparison against 4o. 4o was already quite good at GeoGuessr-style tasks. How big of a jump is o3?
or Dubai in 1997 https://www.youtube.com/watch?v=JMNXXiiDRhM
There could even be geoguessr style competitions that could significantly help move the needle at least as a copilot if not outright mass identify.
If yes, that means we can employ AI to find faked images?
1) O3 cheated by using Google search. This is both against the rules of the game and OP didn't use search either
2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.
It's an apples to oranges comparison. They're both fruit and round, but you're ignoring obvious differences. You're cherry picking.
The title is fraudulent as you can't make a claim like that when one party cheats.
I would find it surprising if OP didn't know these rules considering their credentials. Doing this kind of clickbait completely undermines a playful study like this.
Certainly O3 is impressive, but by over exaggerating its capabilities you taint any impressive feats with deception. It's far better to under sell than over sell. If it's better than expected people are happier, even if the thing is crap. But if you over sell people are angry and feel cheated, even if the thing is revolutionary. I don't know why we insist on doing this in tech, but if you're wondering why so many people hate "tech bros", this is one of the reasons. There's no reason to lie here either! Come on! We can't just normalize this behavior. It's just creating a reasonable expectation for people to be distrusting of technology and anything tech people say. It's pretty fucked up. And no, I don't think "it's just a blog post" makes it any better. It makes it worse, because it normalizes the behavior. There's other reasons to distrust big corporations, I don't want to live in a world where we should have our guards up all the time.
...so what? Is memorization considered intelligence? Calculators have similar properties.
GeoGuessr is the modern nerds' Rubix Cube. The latest in "explore the world without risk of a sunburn".
feels terrifying, especially for women.