ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization

Nanjing University of Science and Technology,

Abstract

How to effectively explore spatial-temporal features is important for video colorization. Instead of stacking multiple frames along the temporal dimension or recurrently propagating estimated features that will accumulate errors or cannot explore information from far-apart frames, we develop a memory-based feature propagation module that can establish reliable connections with features from far-apart frames and alleviate the influence of inaccurately estimated features. To extract better features from each frame for the above-mentioned feature propagation, we explore the features from large-pretrained visual models to guide the feature estimation of each frame so that the estimated features can model complex scenarios. In addition, we note that adjacent frames usually contain similar contents. To explore this property for better spatial and temporal feature utilization, we develop a local attention module to aggregate the features from adjacent frames in a spatial-temporal neighborhood. We formulate our memory-based feature propagation module, large-pretrained visual model guided feature estimation module, and local attention module into an end-to-end trainable network (named ColorMNet) and show that it performs favorably against state-of-the-art methods on both the benchmark datasets and real-world scenarios.

Framework

Our goal is to develop an effective and efficient video colorization method to restore high-quality videos with low GPU memory requirements. The proposed ColorMNet contains a large-pretrained visual model guided feature estimation (PVGFE) module to extract spatial features from each frame, a memory-based feature propagation (MFP) module that is able to adaptively explore the tem- poral features from far-apart frames, and a local attention (LA) module that is used to explore the similar contents from adjacent frames for better spatial and temporal feature utilization.

An overview of the proposed ColorMNet. The core components of our method include: (a) large-pretrained visual model guided feature estimation (PVGFE) module, (b) memory-based feature propagation (MFP) module and (c) local attention (LA) module.

Input Videos (left column) and Colorized Videos (right column)

More Results on Synthetic Datasets