SceneScape: Text-Driven Consistent Scene Generation

Supplementary Material

 

Please wait for a few seconds for the videos in the page to load

 


Our Results

Sample results, part of which are shown in Fig.4 in the paper.


POV, cave, pools, water, dark cavern, inside a cave, beautiful scenery, best quality, indoor scene
POV, haunted house, dark, wooden door, spider webs, skeletons, indoor scene
walkthrough, inside a medieval forge, metal, fire, beautiful photo, masterpiece, indoor scene
Our result
Our result
Our result


walkthrough, a medieval dungeon with damp, stone corridors and flickering torches lining the walls, beautiful photo, masterpiece, indoor scene
A grand, marble staircase spirals up to a vaulted ceiling in a grand entrance hall of a palace
inside a castle made of ice, beautiful photo, masterpiece
Our result
Our result
Our result


walkthrough, an opulent hotel with long, carpeted hallways, beautiful photo, masterpiece, indoor scene
walkthrough, spaceship interiors, corridors,amazing quality, masterpiece, beautiful scenery, best quality, indoor scene
A dimly lit library, with rows upon rows of leather-bound books and dark wooden shelves
Our result
Our result
Our result


walkthrough, inside a medieval castle, metal, beautiful photo, masterpiece, indoor scene
walkthrough, a medieval dungeon with damp, stone corridors and flickering torches lining the walls, beautiful photo, masterpiece, indoor scene
POV, haunted house, dark, wooden door, spider webs, skeletons, indoor scene
Our result
Our result
Our result


POV, cave, pools, water, dark cavern, inside a cave, beautiful scenery, best quality, indoor scene
walkthrough, an opulent hotel with long, carpeted hallways, beautiful photo, masterpiece, indoor scene
walkthrough, underground bunker with narrow, dimly lit corridors and reinforced metal doors, beautiful photo, masterpiece, indoor scene
Our result
Our result
Our result


A dimly lit library, with rows upon rows of leather-bound books and dark wooden shelves
inside a castle made of ice, beautiful photo, masterpiece
walkthrough, spaceship interiors, corridors,amazing quality, masterpiece, beautiful scenery, best quality, indoor scene
Our result
Our result
Our result


walkthrough, abandoned hospital with empty, sterile corridors, fluorescent lights, cracked linoleum floors, best quality, eerie atmosphere.
walkthrough, inside a medieval castle, metal, beautiful photo, masterpiece, indoor scene
Simple museum, pictures, paintings, artistic, best quality, dimly lit
Our result
Our result
Our result


A grand, marble staircase spirals up to a vaulted ceiling in a grand entrance hall of a palace
indoor scene, interior, candy house, fantasy, beautiful, masterpiece, best quality
Our result
Our result


 


Ablations

We ablate the key components in our framework: (i) depth finetuning, (ii) decoder finetuning, (iii) mesh representation. We also show results of a naive warp-inpaint baseline. Without each of our components, the produced videos suffer from visual artifacts (e.g blur without mesh representation, stretches without depth finetuning, high-frequency flickering without decoder finetuning). For discussion and numerical evaluations, please see Sec. 4.1 and 4.2 in the paper.

POV, cave, pools, water, dark cavern, inside a cave, beautiful scenery, best quality, indoor scene
Our result
w/o depth fine-tuning
w/o decoder fine-tuning
w/o mesh representation
naive warp-inpaint


A dimly lit library, with rows upon rows of leather-bound books and dark wooden shelves
Our result
w/o depth fine-tuning
w/o decoder fine-tuning
w/o mesh representation
naive warp-inpaint


inside a castle made of ice, beautiful photo, masterpiece
Our result
w/o depth fine-tuning
w/o decoder fine-tuning
w/o mesh representation
naive warp-inpaint


walkthrough, an opulent hotel with long, carpeted hallways, beautiful photo, masterpiece, indoor scene
Our result
w/o depth fine-tuning
w/o decoder fine-tuning
w/o mesh representation
naive warp-inpaint


POV, haunted house, dark, wooden door, spider webs, skeletons, indoor scene
Our result
w/o depth fine-tuning
w/o decoder fine-tuning
w/o mesh representation
naive warp-inpaint





Splatting

We compare our method to Splatting baseline with horizontal camera translations. This additionally demonstrates the need of a unified 3D representation:


POV, beautiful room, wardrobe, bed, table, side view
a wall in a museum with paintings, beautiful photo, masterpiece
Our result
"Splatting"
Our result
"Splatting"

 


Comparisons

The following videos include comparison results to the following baselines (Sec. 4.3 in the paper):

In addition, we include a qualitative comparison to StableDreamFusion [4], an open-source implementation of DreamFusion [3], a text-to-3D method.

Sample comparisons to GEN-1

indoor scene, interior, candy house, fantasy, beautiful, masterpiece, best quality
POV, haunted house, dark, wooden door, spider webs, skeletons, indoor scene
walkthrough, an opulent hotel with long, carpeted hallways, beautiful photo, masterpiece, indoor scene
Our result
GEN-1 result
Our result
GEN-1 result
Our result
GEN-1 result


A dimly lit library, with rows upon rows of leather-bound books and dark wooden shelves.
walkthrough, inside a medieval castle, metal, beautiful photo, masterpiece, indoor scene
Our result
GEN-1 result
Our result
GEN-1 result


Note that GEN-1 videos appear to be "lagging" since GEN-1 occasionaly produces near-duplicate frames.

We demonstrate the resulting point cloud, produced by COLMAP reconstruction, on our method and on GEN-1, given the camera path. We also visualize the cameras by boxes with alternating colors.


indoor scene, interior, candy house, fantasy, beautiful, masterpiece, best quality

Our result
GEN-1 result



Sample comparisons to VideoFusion

Note that VideoFusion produces shorter videos and there is no explicit control over the camera motion.

walkthrough, inside a medieval castle, metal, beautiful photo, masterpiece, indoor scene
walkthrough, inside a medieval forge, metal, fire, beautiful photo, masterpiece, indoor scene
inside a castle made of ice, beautiful photo, masterpiece
Our result
VideoFusion result
Our result
VideoFusion result
Our result
VideoFusion result


walkthrough, spaceship interiors, corridors,amazing quality, masterpiece, beautiful scenery, best quality, indoor scene
Simple museum, pictures, paintings, artistic, best quality, dimly lit
POV, cave, pools, water, dark cavern, inside a cave, beautiful scenery, best quality, indoor scene
Our result
VideoFusion result
Our result
VideoFusion result
Our result
VideoFusion result


indoor scene, interior, candy house, fantasy, beautiful, masterpiece, best quality
A dimly lit library, with rows upon rows of leather-bound books and dark wooden shelves.
POV, haunted house, dark, wooden door, spider webs, skeletons, indoor scene
Our result
VideoFusion result
Our result
VideoFusion result
Our result
VideoFusion result



Comparison to StableDreamFusion

We compare our method to StableDreamFusion, an open-source text-to-3D model that creates an implicit representation of a scene from our camera trajectory and a prompt. The generated scenes contain blur and unrealistic artifacts which demonstrates the failure of current 3D methods to create such videos - to achieve good quality, NERF requires multiple viewpoints of the scene from different angles.

a DSLR photo of the inside of a hotel a DSLR photo of the inside of a haunted house a DSLR photo of the inside of a medieval castle a DSLR photo of the inside of a candy house a DSLR photo of the inside of a library a DSLR photo of the inside of a cave


Meshes visualization

We demonstrate the resulting meshes, produced by our method and post processed with Poisson surface reconstruction.


POV, cave, pools, water, dark cavern, inside a cave, beautiful scenery, best quality, indoor scene
indoor scene, interior, candy house, fantasy, beautiful, masterpiece, best quality



 

 

 

[1] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.

[2] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, LiangWang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.

[3] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.

[4] Jiaxiang Tang. 2022. Stable-dreamfusion: Text-to-3D with Stable-diffusion https://github.com/ashawkey/stable-dreamfusion.