Subash Khanal

I have a PhD in Computer Science from Washington University in St. Louis, where I worked in the Multimodal Vision Research Laboratory led by Dr. Nathan Jacobs.

Broadly, I am interested in building scalable AI systems that bridge multiple modalities to address real-world challenges.

Email / CV / Google Scholar / Linkedin / Github

Research

My research focuses on multimodal self-supervised learning and generative AI. I build models that integrate satellite imagery, audio, and text, and curate large-scale datasets to support applications in geospatial AI, medicine, and ecology.

Publications

Global and Local Entailment Learning for Natural World Imagery
Sastry Srikumar, Dhakal Aayush, Xing Eric, Khanal Subash and Jacobs Nathan
ICCV, 2025
arxiv/ project page/ bibtex

We introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models.

Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
Khanal Subash, Sastry Srikumar, Dhakal Aayush, Ahmad Adeel and Jacobs Nathan
preprint, 2025
arxiv/ demo video/ bibtex

A state-of-the-art soundscape mapping framework that leverages a Vision-Language Model (VLM) to enrich the semantic understanding of a location’s soundscape, learns a shared codebook for fine-grained alignment, and enables retrieval-based, location-conditioned soundscape generation.

Mixed-View Panorama Synthesis using Geospatially Guided Diffusion
Xiong Zhexiao, Xing Xin, Workman Scott, Khanal Subash, Jacobs Nathan
TMLR, 2025
arxiv / project page / bibtex

This work introduces the task of mixed-view panorama synthesis, where the goal is to synthesize a novel panorama given a small set of input panoramas and a satellite image of the area.

RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings
Dhakal Aayush, Sastry Srikumar, Khanal Subash , Ahmad Adeel, Xing Eric and Jacobs Nathan
CVPR, 2025
arxiv / bibtex

We propose a novel retrieval-augmented strategy for multi-resolution geo-embeddings, called RANGE. Our method is based on the intuition that the visual features of a location can be estimated by aggregating visual features from multiple similar-looking locations.

TaxaBind: A Unified Embedding Space for Ecological Applications
Sastry Srikumar, Khanal Subash, Dhakal Aayush, Ahmad Adeel and Jacobs Nathan
WACV, 2025
arxiv / bibtex / code / project page

TaxaBind is a suite of multimodal models useful for downstream ecological tasks covering six modalities: ground-level image, geographic location, satellite image, text, audio, and environmental features.

PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping
Khanal Subash, Xing Eric, Sastry Srikumar, Dhakal Aayush, Xiong Zhexiao, Ahmad Adeel and Jacobs Nathan
ACM Multimedia, 2024
arxiv / bibtex / code / project page

We develop a probabilistic, multi-scale, and metadata-aware embedding space that connects audio, text, and overhead imagery. This enables the creation of dynamic, multi-scale soundscape maps for any geographic region, along with uncertainty estimates for the mapping.

Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images
Dhakal Aayush, Ahmad Adeel, Khanal Subash, Sastry Srikumar, Kerner Hannah and Jacobs Nathan
CVPRW (EarthVision), 2024, Best Paper Award
arxiv / bibtex / code

We train a contrastive learning framework, Sat2Cap on a novel large scale dataset. This enables us to create maps using free-form textual descriptions.

GeoSynth: Contextually-Aware High-Resolution Satellite Image Synthesis
Sastry Srikumar, Khanal Subash, Dhakal Aayush, and Jacobs Nathan
CVPRW (EarthVision), 2024
arxiv / bibtex / code / project page

This work presents GeoSynth, a diffusion-based model for synthesizing satellite images with global style and image-driven layout control.

GeoBind: Binding text, image, and audio through satellite images.
Dhakal Aayush, Khanal Subash, Sastry Srikumar, Ahmad Adeel, Jacobs Nathan
IGARSS , 2024, Oral Presentation
arxiv / bibtex

This work presents a general framework that can be used to create an embedding space with any number of modalities by using satellite images as the binding element.

LD-SDM: Language-Driven Hierarchical Species Distribution Modeling
Sastry Srikumar, Xin Xing, Dhakal Aayush, Khanal Subash, Ahmad Adeel, and Jacobs Nathan
preprint, 2024
arxiv / bibtex

We introduced a novel approach for species distribution modeling that uses a large-language model to generate a representation of species. This provides flexibility to generate range maps at different levels of the taxonomic hierarchy and for unseen species.

BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping
Sastry Srikumar, Khanal Subash, Di Huang, Dhakal Aayush and Jacobs Nathan
WACV, 2024
arxiv / bibtex / code

This work presents a flexible framework, with vector embedding and metric learning variants, that supports both species distribution mapping with fine-grained visual classification.

Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping
Khanal Subash, Sastry Srikumar, Dhakal Aayush and Jacobs Nathan
BMVC, 2023
arxiv / supplementary / bibtex / code

We learn a tri-modal embedding space between audio, text and overhead imagery. This enables us to create soundscape maps over any geographic region, using either audio or textual queries.

Causality for inherently explainable transformers: CAT-XPLAIN
Khanal Subash, Brodie Benjamin, Xing Xin, Lin Ai-Ling and Jacobs Nathan
CVPR Workshop, 2022
arxiv / bibtex / code

Add an extra special token (explainable token) into Vision Transformer (ViT), and train it to select the most important patches in the input image.

Advit: Vision transformer on multi-modality pet images for alzheimer disease diagnosis
Xing Xin, Liang Gongbo, Zhang Yu, Khanal Subash, Lin Ai-Ling and Jacobs Nathan.
ISBI, 2022
paper / bibtex

Training ViT on 3D-to-2D converted multi-modal PET images achieves better Alzheimer's disease prediction.

Alzheimer's Disease Classification Using Genetic Data
Khanal Subash, Chen Jin, Jacobs Nathan and Lin Ai-Ling
BIBM Workshop, 2021
paper / bibtex / code

Machine learning on different types of genetic data helps to identify candidate genes for Alzheimer's disease progression.

Hierarchical Probabilistic Embeddings for Multi-View Image Classification
Brodie Benjamin, Khanal Subash, Rafique Muhammad Usman, Greenwell Connor and Jacobs Nathan
IGARSS, 2021
paper / bibtex

Learning a hierarchical, probabilistic embedding space allows one to achieve uncertainty estimate of feature distributions coming from sources with variable bands of information.

Articulatory Comparison of L1 and L2 Speech for Mispronunciation Diagnosis
Khanal Subash, Johnson Michael T. and Bozorg Narjess
SLT, 2021
paper / bibtex

This paper compares the difference in articulatory patterns between native (L1) and non-native (L2) Mandarin speakers of English, for the purpose of providing an understanding of mispronunciation behaviors of L2 learners.

Mispronunciation Detection and Diagnosis for Mandarin Accented English Speech
Khanal Subash, Johnson Michael T., Soleymanpour Mohammad and Bozorg Narjes
SpeD, 2021
paper / bibtex

Articulatory features improve the performance of Automatic Speech Recognition (ASR) based Mispronunciation Detection and Diagnosis (MDD) systems.

Mispronunciation Detection and Diagnosis in Mandarin Accented English Speech
Khanal Subash
Theses and Dissertations--Electrical and Computer Engineering, 2020
Thesis / bibtex

The focus of this work was to analyse articulatory patterns of mispronunciation and design of ASR based MDD system.

This website is modified from source code of John Barron's website.