I think there are a large number of use cases for reality capture - translating 2D images and video into 3D scenes (digital twins). Some examples are creating 3D assets for use in movies, games, virtual/augmented reality, virtual visits to landmarks across the world, Google street view - except not being constrained to the position of the camera, documenting progress of construction, virtual tours, robotics navigation/simulation, understanding of surroundings, simulation environment generation, as just a few examples.
This is primarily done today using Photogrammetry or fused lidar+photo. It seems very likely that a learning based approaches will outperform previous physics/optics + heuristics based methods and NERFs seem to be the most impressive learning based approach at the moment.
Currently both NERFs and photogrammetry face similar issues, for example neither method works well when some photos are taken in the day and some at night, blurry photos and photos with shallow or even varying depths of field do not work well together, moving objects within the scene (people or flag blowing in the wind) cause issues, photos from different camera models sometimes don’t work well together, etc. Many of these types of issues are now starting to be addressed with NERFs. This video does a great job demonstrating some of these types of issues as well as demonstrating techniques to address these issues.
Another problem with some of the early work with NERFs was that they were extremely slow to train. Many of the early examples took several days to train. Then instant-ngp
came along and offered a multi order of magnitude speedup allowing you to train the NERF in seconds to minutes vs hours to days. These result totally blew me away - training a NERF using instant-ngp
is about as fast as training the fast.ai dogs vs cats model. instant-ngp
is also dramatically faster than most photogrammetry techniques that I’ve seen. Unfortunately at least some of these performance improvements were due to the majority of instant-ngp
being implemented directly in CUDA vs being done in Pytorch which seems to significantly raise the barrier to entry to build upon these results.
Here’s a list of some of the things that I believe need to be incorporated into NERF’s. Many of these are being worked on currently and will need to be brought together once solved:
- Camera localization done via learning.
- ‘Floater’ removal or elimination.
- Camera/lens pixel mappings (intrinsics) - per image preferred
- Rolling shutter correction
- Ability to export high quality meshes and textures. Mesh results are currently quite poor, at least in
instant-ngp
.
- Work well in more unconstrained environments - currently NERF’s work best when photos circle an object of interest.
- Ability to separate geometry from textures - the video I referenced addresses this.
- Ability to ignore/filter out moving objects
- Improve inference speed - for
instant-ngp
inference at higher resolutions seems slower than training
Other ideas - may be quite out there/bad ideas:
- As NERFs store a compressed representation of a 3D scene, maybe they would be somehow useful as a mechanism for storing/retrieving temporal residuals for previous video frames for video predictions
- Ability to segment and temporally encode moving objects. Serves both to reduce artifacts from current NERF implementations for static scenes and as well as enabling moving objects to be encoded and referenced.
- Unlimited, more realistic and more diverse data augmentation for image classifiers
As for ideas for the class - I think that NERFs are generally fun and accessible. All you need is a camera and you can create a 3d scene of whatever you want whether it’s a toy, your home, your workplace, a local street or landmark, etc.
- Can a Pytorch implementation get close to the speed of
instant-ngp
that would allow for fast.ai style fast iteration while being more accessible/hackable than the current CUDA implementation?
- Are there pieces of
instant-ngp
that are able to be incorporated into a Pytorch implementation, whether for NERFs specifically or more generally? Are there useful things to be learned from what makes instant-ngp
so much faster than previous NERF implementations?
- Do you have any intuition based ideas for NERFs on how they can be improved, similar to your intuition on optimizers for stable diffusion?
EDIT: Oops, I forgot to make this a reply to Jeremy…