LERF optimizes a dense, multi-scale language 3D field by volume rendering CLIP embeddings along training rays, supervising these embeddings with multi-scale CLIP features across multi-view training images. After optimization, LERF can extract 3D relevancy maps for language queries interactively in real-time. LERF enables pixel-aligned queries of the distilled 3D CLIP embeddings without relying on region proposals, masks, or fine-tuning, supporting long-tail open-vocabulary queries hierarchically across the volume.
With multi-view supervision, 3D CLIP embeddings are more robust to occlusion and viewpoint changes than 2D CLIP embeddings. 3D CLIP embeddings also conform better to the 3D scene structure, giving them a crisper appearance.
To supervise language embeddings, we pre-compute an image pyramid of CLIP features for each training view. Then, each sampled ray during optimization is supervised by interpolating the CLIP embedding within this pyramid.
Inspired by Distilled Feature Fields (DFF), we use DINO features to regularize CLIP features. This leads to qualitative improvements in object boundaries, as CLIP embeddings in 3D can be sensitive to floaters and regions with few views.
Imagine you're a robot in a kitchen tasked to clean, and somebody knocked over a pour-over full of grinds on the counter. Give a list of things to search for to clean it up, in the form "search: action".
As a robot in a kitchen tasked to clean and faced with a pour-over full of grinds spilled on the counter, here is a list of things to search for to clean it up, in the form of "search: action":
LERF is integrated into the popular research codebase Nerfstudio! Find the code here. This video shows a user typing in queries and visualizing results from a LERF in real-time. We're excited about what people will create with natural language NeRF interaction :)
If you use this work or find it helpful, please consider citing: (bibtex)
@inproceedings{lerf2023, author = {Kerr, Justin* and Kim, Chung Min* and Goldberg, Ken and Kanazawa, Angjoo and Tancik, Matthew}, title = {LERF: Language Embedded Radiance Fields}, booktitle = {International Conference on Computer Vision (ICCV)}, year = {2023}, }