A new research initiative between the US and China has proposed the use of Generative Adversarial Networks (GANs) to increase the realism of driving simulators.
In a novel approach to the challenge of generating photorealistic POV driving scenarios, researchers have developed a hybrid method that plays to the strengths of different approaches by mixing the more photorealistic output of CycleGAN-based systems with conventionally generated elements that require a higher level of detail and consistency, such as B. Road markings and the actual vehicles viewed from the driver’s perspective.
The system, called Hybrid Generative Neural Graphics (HGNG), injects highly limited outputs of a traditional CGI-based driving simulator into a GAN pipeline, where the NVIDIA SPADE framework does the environment generation work.
According to the authors, the benefit is that the driving environments become potentially more diverse and create a more immersive experience. As it stands, even converting CGI outputs to photorealistic neural rendering outputs cannot solve the repetition problem, since the original material entering the neural pipeline is constrained by the limitations of the model environments and their tendency to repeat textures and meshes is.
The paper states*:
“The accuracy of a traditional driving simulator depends on the quality of its computer graphics pipeline, which consists of 3D models, textures and a rendering engine. High-quality 3D models and textures require craftsmanship, while the rendering engine must perform complicated physical calculations for lighting and shading to appear realistic.’
The new paper is titled Photorealism in Driving Simulations: Blending Generative Adversarial Image Synthesis with Renderingand comes from researchers at Ohio State University’s Department of Electrical and Computer Engineering and Chongqing Changan Automobile Co Ltd in Chongqing, China.
HGNG transforms the semantic layout of an input CGI-generated scene by blending partially rendered foreground material with GAN-generated environments. Although the researchers experimented with different datasets to train the models, the KITTI Vision Benchmark Suite proved to be the most effective, containing mostly footage of driver POV footage from the German city of Karlsruhe.
Researchers experimented with both Conditional GAN (cGAN) and CYcleGAN (CyGAN) as generative networks and ultimately found that both have strengths and weaknesses: cGAN requires paired datasets and CyGAN does not. However, CyGAN is currently unable to outperform the state of the art in traditional simulators pending further improvements in domain matching and cycle consistency. Therefore, cGAN currently performs best with its additional paired data requirements.
In HGNG’s neural graphics pipeline, 2D representations are formed from CGI-synthesized scenes. The objects passed from the CGI rendering to the GAN flow are limited to “essential” items, including road markings and vehicles, which a GAN itself cannot currently render with reasonable temporal consistency and integrity for a driving simulator. The cGAN synthesized image is then blended with the partial physics-based rendering.
To test the system, researchers used SPADE, trained on cityscapes, to transform the scene’s semantic layout into photorealistic output. The CGI source comes from the open source driving simulator CARLA, which uses the Unreal Engine 4 (UE4).
UE4’s shading and lighting engine provided the semantic layout and partially rendered images, outputting only vehicles and lane markings. The blending was achieved using a GP-GAN instance trained with the Transient Attributes Database and all experiments are run on an NVIDIA RTX 2080 with 8GB GDDR6 VRAM.
The researchers tested on semantic retention – the ability of the output image to conform to the initial semantic segmentation mask intended as a template for the scene.
In the test images above, we can see that in the render-only image (bottom left), the full render does not get any plausible shadows. The researchers note that here (yellow circle) shadows from trees falling on the sidewalk were incorrectly classified as “street” content by DeepLabV3 (the semantic segmentation framework used for these experiments).
In the middle column flow we see that vehicles created by cGAN do not have a consistent definition to be used in a driving simulator (red circle). In the column flow on the far right, the blended image conforms to the original semantic definition while retaining essential CGI-based elements.
To assess realism, the researchers used Frechet Inception Distance (FID) as a performance metric because it can work with paired or unpaired data.
Three datasets were used as ground truth: Cityscapes, KITTI and ADE20K.
The output images were compared to each other and to the physics-based (i.e., CGI) pipeline using FID scores, while also assessing semantic retention.
In the above results related to semantic retention, higher scores are better, with the pyramid-based CGAN approach (one of several pipelines tested by the researchers) performing best.
The results shown directly above refer to FID scores, with HGNG best rated using the KITTI dataset.
The render-only method (referred to as ) refers to the output of CARLA, a CGI flow that is not expected to be photorealistic.
Qualitative results on the traditional rendering engine (‘c’ in the image directly above) show unrealistically distant background information such as trees and vegetation, while requiring detailed models and just-in-time webloading and other processor-intensive techniques. In the middle (b) we see that cGAN does not receive sufficient definition for the essential elements, cars and road markings. In the proposed mixed edition (a), vehicle and road definition is good, while the environment is diverse and photorealistic.
The paper concludes by suggesting that the temporal consistency of the GAN-generated portion of the rendering pipeline could be increased by using larger urban datasets, and that future work in this direction could offer a viable alternative to costly neural transformations based on CGI streams while offering more realism and variety.
* My conversion of authors’ inline citations to hyperlinks.
First published on July 23, 2022.