Abstract:In response to the challenges posed by inaccurate estimations in handling dynamic objects, occlusions, and other complex real-world scenarios within the autonomous driving domain, an unsupervised depth and optical flow estimation method based on the multi-mask technique was proposed. This approach aimed to extract target depth, camera pose, and optical flow information from monocular video sequences through unsupervised learning. The method firstly designed a variety of special masks for different outlier types which effectively suppressed the interference of outliers on the photometric consistency loss function and played a key role in outlier removal during both pose and optical flow estimation tasks. Secondly, a pre-trained optical flow estimation network was introduced to assist the depth and pose estimation networks to fully utilize the geometric constraints of the 3D scene, thus enhancing the joint training performance. Finally, the optical flow estimation network was optimally trained with the help of the depth and pose information obtained from training, as well as the computationally obtained mask. Experimental results on the KITTI dataset showed that this strategy significantly improved the performance of the model and outperforms other methods of the same type.