4D panoptic segmentation in a streaming setting is critical for highly dynamic environments, such as evacuating dense crowds and autonomous driving in complex scenarios, where real-time, fine-grained perception within a constrained time budget is essential. In this paper, we introduce 4DSegStreamer, a novel framework that employs a Dual-Thread System to efficiently process streaming frames. The framework is general and can be seamlessly integrated into existing 3D and 4D segmentation methods to enable real-time capability. It also demonstrates superior robustness compared to existing streaming perception approaches, particularly under high FPS conditions. The system consists of a predictive thread and an inference thread. The predictive thread leverages historical motion and geometric information to extract features and forecast future dynamics. The inference thread ensures timely prediction for incoming frames by aligning with the latest memory and compensating for ego-motion and dynamic object movements. We evaluate 4DSegStreamer on the indoor HOI4D dataset and the outdoor SemanticKITTI and nuScenes datasets. Comprehensive experiments demonstrate the effectiveness of our approach, particularly in accurately predicting dynamic objects in complex scenes.
We propose a new task of streaming 4D panoptic segmentation. Similar to the traditional streaming perception paradigm, streaming 4D panoptic segmentation conducts the panoptic segmentation in an online manner. The key challenge is ensuring that each incoming frame is processed and predicted within an ignorant small time budget, even if the processing of the current frame is not complete.
The dual-thread system consists of a predictive thread and an inference thread, enabling real-time query for unseen future frames. The predictive thread updates the geometric and motion memories with the latest extracted feature and leverages the historical information to forecast future dynamics. The inference thread retrieves per-point predictions by geometrically aligning them with the current memory using ego-pose and dynamic object alignment.
In the inference thread, we perform ego-pose alignment and dynamic object alignment to align the current querying frame with the latest memory, which has been updated by previously processed frames. The green points represent the previously processed frame that has been used to update the memory and the blue points are the current querying frame. The yellow box highlights static objects that can be aligned through ego-pose alignment. The red box indicates dynamic objects, which require dynamic object alignment to achieve proper alignment.
Comparison of streaming performance at different FPS settings on the SemanticKITTI dataset. Our 4DSegStreamer demonstrates significant performance gains and exhibits a slower performance decline as the FPS increases, indicating its robustness as a more advanced 4D streaming system for panoptic segmentation tasks, particularly in high-FPS scenarios.
Our method is designed as a plug-and-play module to enhance 3D and 4D segmentation backbones, enabling real-time performance in streaming scenarios. Existing 3D and 4D segmentation methods can be easily integrated with our dual-thread system to achieve real-time capabilities. The results demonstrate that our method significantly improves the performance of these backbones in the streaming setting.
Depending on whether the camera pose is available, we define two settings: