Research
My research focuses on multimodal self-supervised learning and generative AI. I build models that integrate satellite imagery, audio, and text, and curate large-scale datasets to support applications in geospatial AI, medicine, and ecology.
|
Selected Publications
For a complete list, see my Google Scholar profile.
|
|
PRUE: A Practical Recipe for Field Boundary Segmentation at Scale
Muhawenayo Gedeon*, Robinson Caleb*, Khanal Subash*, Fang Zhanpei*, Corley Isaac, Wollam Alexander, Gao Tianyi, Strnad Leonard, Avery Ryan, Estes Lyndon, Tárano Ana M., Jacobs Nathan and Kerner Hannah
CVPR, 2026 (*Equal Contribution)
arxiv /
bibtex /
code
We conduct the first systematic evaluation of segmentation and geospatial foundation models for global field boundary delineation using the Fields of The World (FTW) benchmark. We propose PRUE, combining a U-Net backbone, composite loss functions, and targeted data augmentations to achieve 76% IoU and 47% object-F1 on FTW, surpassing the previous baseline by 6% and 9% respectively.
|
|
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
Khanal Subash, Sastry Srikumar, Dhakal Aayush, Ahmad Adeel, Stylianou Abby and Jacobs Nathan
CVPRW (EarthVision), 2026
arxiv /
demo video /
bibtex /
code
A state-of-the-art soundscape mapping framework that leverages a Vision-Language Model (VLM) to enrich the semantic understanding of a location's soundscape, learns a shared codebook for fine-grained alignment, and enables retrieval-based, location-conditioned soundscape generation.
|
|
ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology
Sastry Srikumar, Khanal Subash, Dhakal Aayush, Lin Jiayu, Cher Dan, Jarosz Phoenix and Jacobs Nathan
CVPR, 2026
arxiv /
bibtex /
code /
project page
ProM3E is a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. It learns to infer missing modalities given a few context modalities and proposes a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks.
|
|
SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images
Dhakal Aayush, Khanal Subash, Sastry Srikumar, Arndt Jacob, Dias Philipe Ambrozio, Lunga Dalton and Jacobs Nathan
CVPR, 2026
arxiv /
bibtex
SimLBR is a simple and efficient framework for fake image detection using Latent Blending Regularization (LBR). By learning a tight decision boundary around the real image distribution and treating the fake category as a sink class, it significantly improves cross-generator generalization, achieving up to +24.85% accuracy and +69.62% recall on the challenging Chameleon benchmark.
|
|
Global and Local Entailment Learning for Natural World Imagery
Sastry Srikumar, Dhakal Aayush, Xing Eric, Khanal Subash and Jacobs Nathan
ICCV, 2025
arxiv/
project page/
bibtex
We introduce Radial Cross-Modal Embeddings (RCME), a framework for learning hierarchical vision-language representations that explicitly models transitivity-enforced entailment. Applied to the Tree of Life taxonomy, our model achieves state-of-the-art performance on hierarchical species classification and image-text retrieval tasks.
|
|
Mixed-View Panorama Synthesis using Geospatially Guided Diffusion
Xiong Zhexiao, Xing Xin, Workman Scott, Khanal Subash, Jacobs Nathan
TMLR, 2025
arxiv /
project page /
bibtex
We introduce mixed-view panorama synthesis — generating a novel ground-level panorama conditioned on a satellite image and a set of nearby panoramas. Our diffusion-based model with geospatially-guided attention handles sparse or distant panorama inputs and consistently outperforms cross-view and same-view synthesis baselines.
|
|
RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings
Dhakal Aayush, Sastry Srikumar, Khanal Subash, Ahmad Adeel, Xing Eric and Jacobs Nathan
CVPR, 2025
arxiv /
bibtex
We propose a novel retrieval-augmented strategy for multi-resolution geo-embeddings, called RANGE. Our method is based on the intuition that the visual features of a location can be estimated by aggregating visual features from multiple similar-looking locations.
|
|
TaxaBind: A Unified Embedding Space for Ecological Applications
Sastry Srikumar, Khanal Subash, Dhakal Aayush, Ahmad Adeel and Jacobs Nathan
WACV, 2025 (Oral Presentation)
arxiv /
bibtex /
code /
project page
TaxaBind is a suite of multimodal models useful for downstream ecological tasks covering six modalities: ground-level image, geographic location, satellite image, text, audio, and environmental features.
|
|
LD-SDM: Language-Driven Hierarchical Species Distribution Modeling
Sastry Srikumar, Xin Xing, Dhakal Aayush, Khanal Subash, Ahmad Adeel, and Jacobs Nathan
CV4E Workshop, ICCV, 2025
arxiv /
bibtex
We introduce a language-driven approach for hierarchical species distribution modeling (SDM) that uses an LLM to encode species representations from taxonomic descriptions. This enables range map prediction at multiple taxonomic levels and zero-shot generalization to unseen species.
|
|
PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping
Khanal Subash, Xing Eric, Sastry Srikumar, Dhakal Aayush, Xiong Zhexiao, Ahmad Adeel and Jacobs Nathan
ACM Multimedia, 2024
arxiv /
bibtex /
code /
project page
We develop a probabilistic, multi-scale, and metadata-aware embedding space that connects audio, text, and overhead imagery. This enables the creation of dynamic, multi-scale soundscape maps for any geographic region, along with uncertainty estimates for the mapping.
|
|
Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images
Dhakal Aayush, Ahmad Adeel, Khanal Subash, Sastry Srikumar, Kerner Hannah and Jacobs Nathan
CVPRW (EarthVision), 2024, Best Paper Award
arxiv /
bibtex /
code
We propose Sat2Cap, a contrastive learning framework that aligns satellite imagery with fine-grained textual descriptions of locations. Trained on a novel large-scale dataset, it enables zero-shot geospatial mapping driven by free-form text queries.
|
|
GeoSynth: Contextually-Aware High-Resolution Satellite Image Synthesis
Sastry Srikumar, Khanal Subash, Dhakal Aayush, and Jacobs Nathan
CVPRW (EarthVision), 2024
arxiv /
bibtex /
code /
project page
GeoSynth is a diffusion-based model for synthesizing high-resolution satellite images with global style control via text prompts or geographic location, and image-driven layout control via OpenStreetMap data. It exhibits strong zero-shot generalization and produces diverse, geographically consistent satellite imagery.
|
|
BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping
Sastry Srikumar, Khanal Subash, Dhakal Aayush, Di Huang and Jacobs Nathan
WACV, 2024
arxiv /
bibtex /
code
BirdSAT unifies cross-view contrastive learning and masked autoencoders to jointly learn from paired ground-level bird images and satellite imagery. The resulting model supports both fine-grained bird species classification and geographic species distribution mapping.
|
|
Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping
Khanal Subash, Sastry Srikumar, Dhakal Aayush and Jacobs Nathan
BMVC, 2023
arxiv /
supplementary /
bibtex /
code
We learn a tri-modal embedding space between audio, text and overhead imagery. This enables us to create soundscape maps over any geographic region, using either audio or textual queries.
|
|
Causality for inherently explainable transformers: CAT-XPLAIN
Khanal Subash, Brodie Benjamin, Xing Xin, Lin Ai-Ling and Jacobs Nathan
CVPR Workshop, 2022
arxiv /
bibtex /
code
We introduce CAT-XPLAIN, an inherently interpretable Vision Transformer that adds a causal selection token trained to identify the most causally significant image patches for the classification decision, eliminating the need for post-hoc explainers while maintaining strong task performance.
|
|