Maintaining clear impressions of the environment for autonomous systems even in adverse weather conditions

Abstract

Autonomous systems require a continuous and dependable environment perception for navigation and decision making which is best achieved by combining different sensor types. Radar continues to function robustly in compromised circumstances in which cameras become impaired, guaranteeing a steady inflow of information. Yet camera images provide a more intuitive and readily applicable impression of the world. This work combines the complementary strengths of both sensor types in a unique self-learning fusion approach for a probabilistic scene reconstruction in adverse surrounding conditions. After reducing the memory requirements of the synchronized measurements through a decoupled stochastic self-supervised compression technique, the proposed algorithm exploits similarities and establishes correspondences between both domains at different feature levels during training. Then, at inference time, relying exclusively on radio frequencies the model successively predicts camera constituents in an autoregressive and self-contained process. These discrete tokens are finally transformed into an instructive view of the respective surrounding allowing to visually perceive potential dangers for important tasks downstream.

Without any explicit annotation, the model relies exclusively on radar-based environment sensing to construct intuitive camera views of the surrounding

Camera view generation based solely on radar-frequency information. The synchronized camera ground-truth is supplied for visual reference only. The model generally succeeds in inferring the essential characteristics and captures key features of the underlying real-world scenery. Less confidence is shown for the exact localization of dynamic objects in rapidly changing environments, particularly if visible in only one of both sensors, therefore lacking cross-modal correspondence.

Approach


This work addresses two fundamental aspects of applied modern deep learning research:


To tackle both problems, the proposed algorithm comprises two stages which are both trained end-to-end in a self-supervised fashion on low-level radar data without the need for expensive and time-consuming annotations. After the training of both stages has been completed, probabilistic predictions about the environment are performed.

1. Stage: Probabilistic Measurement Compression

Both memory-intensive sensor streams are compressed through categorical variational autoencoders into stochastic integer sequences. Each contained token takes on one of 256 categories representing square input patches of either domain. The reconstruction quality of these quantized representations is a measure of the models discretization capabilities and used as part of the training objective. The animations show how the networks assign different regions of the sensor outputs to distinct latent categories.


(hover for animation)


(hover for animation)


(hover for animation)



2. Stage: Crossmodal Modeling of Sensor Constituents

Using the memory-reduced domain representations, an autoregressive transformer model finds links between radar and camera measurements in latent space and learns to recognize correlations between both modalities. The incorporated attention mechanism is used to condition camera tokens on discretized radar information. Below animation shows the inter-modal attention span for every head in every layer of the model. Each matrix denotes the strength with which camera tokens pay attention to radar tokens.


Range-Doppler conditioned SYNTHESIS OF CAMERA Views


The trained model successively outputs probability mass functions over camera constituents. Sampling then predicts camera content based exclusively on robust radar sequences regardless of weather conditions. Appending the tokens to the radar sequence makes the model increasingly confident about the composition of the environment in latent space. Upon completed prediction, the constructed camera sequence is decompressed by the categorical decoder into an instructive view of the surroundings.








Stepwise camera view generation based solely on radar-frequency information. The synchronized camera ground-truth is supplied for visual reference only. At times, the model is thrown off track by suboptimal first camera samples and dreams up completely artificial environments. Most often though, the model succeeds in reproducing the global structure of the surroundings and for the most part manages to compile a realistic rendering of its central components.



Exploring the conditional sample space with temperature sweeps

Constraining the sample space while varying the sampling temperature allows to control the quality of the camera samples. Below animations show probabilistic results and contrasts the generated views for improved visual intuition.









Nucleus inference with only a limited number of categories K̂ = 25 to sample camera constituents from. The temporal context underlines the differences in synthesis quality when the sampling temperature varies. Unconstrained category selection over the entire camera sample space and top-1 sampling with K̂ = 1 serve as basic visual references.



The designed method succeeds in reflecting on the integral objects of a scene and reconstructs crucial entities in the sensors vicinity.

Acknowledgment

The author would like to mention the EleutherAI community and members of the EleutherAI discord channels for fruitful and interesting discussions along the way of composing this paper. Additional thanks to Phil Wang (lucidrains) for his tireless efforts of making attention-based algorithms accessible to the humble deep learning research community.