LERF

📎 Language Embedded Radiance Fields 🚜

UC Berkeley

*Denotes Equal Contribution

ICCV 2023 (Oral)

Paper

</Code>

Data

TL;DR: Grounding CLIP vectors volumetrically inside a NeRF allows flexible natural language queries in 3D

Overview

LERF optimizes a dense, multi-scale language 3D field by volume rendering CLIP embeddings along training rays, supervising these embeddings with multi-scale CLIP features across multi-view training images. After optimization, LERF can extract 3D relevancy maps for language queries interactively in real-time. LERF enables pixel-aligned queries of the distilled 3D CLIP embeddings without relying on region proposals, masks, or fine-tuning, supporting long-tail open-vocabulary queries hierarchically across the volume.

Click the thumbnails below to load scenes, use ON/OFF to toggle between RGB and 3D relevancy videos.

Why 3D CLIP embeddings?

With multi-view supervision, 3D CLIP embeddings are more robust to occlusion and viewpoint changes than 2D CLIP embeddings. 3D CLIP embeddings also conform better to the 3D scene structure, giving them a crisper appearance.

Multi-scale supervision

To supervise language embeddings, we pre-compute an image pyramid of CLIP features for each training view. Then, each sampled ray during optimization is supervised by interpolating the CLIP embedding within this pyramid.

DINO Regularization

Inspired by Distilled Feature Fields (DFF), we use DINO features to regularize CLIP features. This leads to qualitative improvements in object boundaries, as CLIP embeddings in 3D can be sensitive to floaters and regions with few views.

LERF 🤝 ChatGPT

Natural language interaction could allow LLMs to interrogate the 3D world. We show an illustrative example where ChatGPT provides language queries of objects to look for to clean a coffee spill.

Imagine you're a robot in a kitchen tasked to clean, and somebody knocked over a pour-over full of grinds on the counter. Give a list of things to search for to clean it up, in the form "search: action".

As a robot in a kitchen tasked to clean and faced with a pour-over full of grinds spilled on the counter, here is a list of things to search for to clean it up, in the form of "search: action":

Paper towel: to blot up as much of the coffee grinds as possible
Dustpan and brush: to sweep up the coffee grinds that can't be absorbed with paper towels
Baking soda: to sprinkle over the stain and absorb any leftover coffee liquid
Cleaning spray: to clean and disinfect the counter after the coffee grinds have been removed
Sponge or cloth: to wipe down the counter with the cleaning spray and remove any stains
Scrub brush: to clean any remaining stains or residue left by the coffee grinds
Vacuum cleaner: if there are any coffee grinds that have spilled on the floor, a vacuum cleaner could be used to suck them up
Trash can: to dispose of the coffee grinds collected by the dustpan or vacuum cleaner
Sink: to rinse out the pour-over and dispose of the used coffee grinds in the compost or trash
Cabinet or drawer: to store the cleaning supplies after the spill has been cleaned up.

Hover over each of the highlighted words to see LERF's relevancy maps for each response.

... grinds spilled

Paper towel

Dustpan and brush

Baking soda

Cleaning spray

Sponge or cloth

Scrub brush

Vacuum cleaner

Trash can

Sink

Cabinet or drawer

Integration

LERF is integrated into the popular research codebase Nerfstudio! Find the code here. This video shows a user typing in queries and visualizing results from a LERF in real-time. We're excited about what people will create with natural language NeRF interaction :)

Citation

If you use this work or find it helpful, please consider citing: (bibtex)

@inproceedings{lerf2023,
 author = {Kerr, Justin* and Kim, Chung Min* and Goldberg, Ken and Kanazawa, Angjoo and Tancik, Matthew},
 title = {LERF: Language Embedded Radiance Fields},
 booktitle = {International Conference on Computer Vision (ICCV)},
 year = {2023},
}