Just Zoom In

Cross-View Geo-Localization via Autoregressive Zooming

Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz

Photogrammetric Computer Vision Lab, The Ohio State University

Street-view query image used for cross-view geo-localization Street-view queryGround image input
Coarse satellite map at zoom level 1 Level 1Coarse search area
Selected satellite region at zoom level 2 Level 2Selected region
Finer selected satellite region at zoom level 3 Level 3Finer candidate cell
Terminal satellite localization cell at zoom level 4 Level 4Terminal localization
Given a street-view query, Just Zoom In predicts a sequence of coarse-to-fine satellite cells instead of searching a flat retrieval database.

Abstract

Cross-view geo-localization estimates a camera’s location by matching a street-view image to geo-referenced overhead imagery. Existing approaches usually treat this as contrastive image retrieval over a dense satellite database. That formulation can depend on large batches, hard negative mining, and exhaustive nearest-neighbor search, while also ignoring the geographic hierarchy of the map.

Just Zoom In reformulates cross-view geo-localization as autoregressive zooming. Starting from a coarse satellite view, the model predicts a short sequence of zoom actions until it selects a terminal overhead cell. This enables coarse-to-fine spatial reasoning without contrastive losses or hard negative mining.

Method

The model uses a shared DINOv2 vision encoder for street-view and satellite imagery. A causal transformer decoder then predicts the next zoom action conditioned on the ground query, previous decisions, and the current satellite context.

Each action chooses one of the child cells in a fixed grid. After several steps, the final selected cell becomes the location estimate.

  1. 1Encode query Street-view image features condition the search.
  2. 2Choose cell The decoder predicts the next satellite grid cell.
  3. 3Zoom again The selected cell becomes the next map context.
  4. 4Localize The terminal cell gives the final position estimate.
Just Zoom In architecture overview
Shared image encoding followed by autoregressive next-action prediction.

Zoom-in localization demo

The model replaces flat contrastive retrieval with a sequence of zoom-in decisions over a multi-scale overhead map. Each satellite frame below is centered on the ground-truth location, showing the same place at progressively finer spatial context.

Street-view query for demo example 1

Street-view query

Satellite zoom level 1 for demo example 1
GT location

Coarse satellite context centered on the ground-truth location.

Street-view examples

Representative full-resolution street-view samples from the benchmark show the diversity of viewpoints, road geometry, occlusions, lighting, and urban context.

Results

On the proposed benchmark, Just Zoom In improves distance-based Recall@1 while avoiding hard negative mining. Compared with Sample4Geo, the strongest contrastive baseline shown here, Ours improves R@50m by +5.50% and R@100m by +9.63%.

Method R@40m ↑ R@50m ↑ R@100m ↑
SAIG-D39.3647.5264.17
TransGeo45.9754.5567.61
Sample4Geo52.8560.8171.30
Ours 55.7466.3180.93

Citation

@article{erzurumlu2026justzoomin,
  title   = {Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming},
  author  = {Erzurumlu, Yunus Talha and Kwag, Jiyong and Yilmaz, Alper},
  journal = {arXiv preprint arXiv:2603.25686},
  year    = {2026}
}