Image Denoising with Autoencoder
1 Introduction
Artificial Neural Networks (ANNs) are a class of machine learning algorithms that learn from data and specialize in pattern recognition.1This deep neural networks now are used in many fields and show success in various artificial intelligence tasks such as computer vision, natural language processing and even computational finance.
Deep learning is nothing but many classifiers working together, which are based on linear regression followed by some activation functions. Its basis is the same as the traditional statistical linear regression \(W^{T}X+b\) approach. The only difference is that there are many neural nodes in deep learning instead of only one node which is called linear regression in the traditional statistical learning. These neural nodes are also known as a neural network, and one classifier node is known as a neural unit or perception. Another contrasting point need to be noticed is that in deep learning there are many layers between the input and the output. A layer can have many hundreds or even thousands of neural units. The layers which are in between the input and the output known as the hidden layers and the nodes are known as the hidden nodes.2
Basic ingredient of ANN is is the feedforward deep network, or multilayer perceptron (MLP). A multilayer perceptron is just a mathematical function mapping some set of input values to output values. The function is formed by composing many simpler functions. We can think of each application of a different mathematical function as providing a new representation of the input.4
Machine learning algorithms can be broadly categorized as unsupervised or supervised by what kind of “experience” they are allowed to have during the learning process. In these case, “experience” is called dataset. Sometimes we call them data points. Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target.
2 Image Denoising
Image denoising is to remove noise from a noisy image, so as to restore the true image. However, since noise, edge, and texture are high frequency components, it is difficult to distinguish them in the process of denoising and the denoised images could inevitably lose some details.5
The purpose of noise reduction is to decrease the noise in natural images while minimizing the loss of original features and improving the signal-to-noise ratio (SNR). The major challenges for image denoising are as follows:
- flat areas should be smooth,
- edges should be protected without blurring,
- textures should be preserved, and
- new artifacts should not be generated.
2.1 Classical denoising method
- Spatial domain filtering: aim to remove noise by calculating the gray value of each pixel based on the correlation between pixels/image patches in the original image. Usually done by applying linear or non-linear filters (e.g. mean filtering, median filtering). Normally, spatial filters eliminate noise to a reasonable extent but suffered with image blurring, which in turn loses sharp edges.5
- Transform Domain Filtering since the characteristics of image information and noise are different in the “transform space”, noisy image are transform to another domain and then they apply a denoising procedure on the transformed image according to the different characteristics of the image and its noise.
2.2 Machine Learning method
Denoising methods in Machine Learning (ML) usually employ convolutional neural network (CNN)-based. Some loss function is used to estimate the proximity between the denoised image \(\hat{x}\) and the ground-truth \(x\). Now, this deep neural networks have become the tool of choice for image denoising owing to their ability to learn natural image priors from image datasets.
2.3 Additive White Gaussian Noise
Additive white Gaussian noise is one of the most common types of noise. In the image denoising literature, noise is often assumed to be zero-mean additive white Gaussian noise (AWGN). We simply add a random number to each pixel. The random number has a mean \(\mu\) of zero and a certain standard deviation \(\sigma\).
2.4 Denoising performance
To evaluate the performance metrics of image denoising methods, PSNR and SSIM are used as representative quantitative measurements:
Given a ground truth image \(x\), the PSNR of a denoised image \(\hat{x}\) is defined by:
\[ PSNR(x,\hat{x})=10⋅log_{10}(\frac{255^{2}}{||x - \hat{x}||_{2}^{2}}) \]
While quantitative measurements cannot reflect the visual quality perfectly, visual quality comparisons on a set of images are necessary. Besides the noise removal effect, edge and texture preservation is vital for evaluating a denoising method.
3 Autoencoder
An autoencoder is a neural network that is trained to attempt to copy its input to its output.4 This type of ANN are becoming increasingly popular due to their ability to learn complex representations of data. Their main purpose is learning in an unsupervised manner an “informative” representation of the data.
The autoencoder first encodes the data into a lower dimensional representation, then reconstructs it back to its original form. They can be used for a variety of tasks, such as denoising, anomaly detection, feature extraction & are able to learn features from unlabeled data, becoming popular for unsupervised learning tasks.
How does the decoder know the original data in the first place?
Consider first a multilayer perceptron of the form shown in Figure 3 having \(D\) as inputs, \(D\) as output units, and \(M\) as hidden units with \(M < D\). The targets used to train the network are simply the input vectors themselves, so that the network is attempting to map each input vector onto itself. Such a network is said to form an autoassociative mapping. Since the number of hidden units is smaller than the number of inputs, a perfect reconstruction of all input vectors is not in general possible. This imperfection of reconstruction can be counter via using higher the number of neurons in the hidden layer so that the network can fit more patterns and therefore will lower the reconstruction error.
Overall, during the training, network parameters \(w\) is needed to be carefully choosen so that reconstruction error (error function) which captures the degree of mismatch between the input vectors and their reconstructions is minimized:
\[ E(w) = \frac{1}{2} \sum_{n=1}^{N} ||y(x_{n},w)-x_{n}||^{2} \]
This minimum the network performs a projection onto the \(M\)-dimensional subspace which is spanned by the first \(M\) principal components of the data.7,8 Thus, the vectors of weights which lead into the hidden units in Figure 3 form a basis set of “latent space representation”. This “reduction projection” keeps the maximum of information when encoding and, so, has the minimum of reconstruction error when decoding. Therefore, an autoencoder is in fact a generalization of Principle Component Analysis (PCA).
4 Type of Autoencoder
Following4, there exists a variety of autoencoders:
- undercomplete autoencoder
- sparse autoencoder
- denoising autoencoder
- variational autoencoder
4.1 Undercomplete Autoencoder
An autoencoder that has a smaller dimension in the bottleneck than its input dimension is called undercomplete. In normal word, undercomplete autoencoders have a smaller dimension for hidden layer compared to the input layer.
Learning an undercomplete representation forces the autoencoder to capture the most notable features of the training data.
4.2 Sparse Autoencoder
Sparse autoencoders is an autoencoder that have hidden nodes greater than input nodes. A more in-depth discussion on sparse autoencoders is presented by Goodfellow and Andrew Ng
4.3 Denoising Autoencoder (DAE)
The denoising autoencoder (DAE) is an autoencoder that uses a corrupted data point \(\hat{x}\) as input and is trained to recover the original, uncorrupted data point \(x\) as its output. A deeper discussion on denoising autoencoder is presented by Goodfellow
4.4 Variational Autoencoder (VAE)
The VAE is a form of autoencoder that leverage distribution of latent variables in latent spaces. This encodings distribution is regularised (method to avoid overfitting) during the training in order to ensure generate “good” data reconstruction.
In a nutshell, a VAE is an autoencoder whose encodings distribution is regularised during the training in order to ensure that its latent space has good properties allowing us to generate new reconstruction data. Instead of encoding an input as a single point, we encode it as a distribution over the latent space.
4.5 Application of Autoencoder
5 Image Denoising with Autoencoder
5.1 Simple Convolutional Autoencoder
Figure 4 illustrates the architecture of simple convolutional autoencoder(cae) that will be use. Convolutional autoencoder simply extends the basic structure of the simple autoencoder(vanilla autoencoder) by changing the fully connected layers to convolution layers.
5.2 Deep Convolutional Autoencoder
Figure 5 on the other hand, shows the deeper architecture of cae from medical domain12 that will be use to compare.
5.3 DnCNN
5.3.1 Overview
- treat image denoising as a plain discriminative learning problem, i.e., separating the noise from a noisy image by feed-forward CNN
- use CNN because it is effective in increasing the capacity and flexibility for exploiting image characteristics.
- leverage batch normalization and residual learning to capture image features and to make training faster
- network trained can handle 3 tasks: image Denoising, single image super Resolution, and JPEG deblocking.
5.3.2 Methodology
- The size of convolutional filters are set to be \(3×3\) and all pooling layers are removed. Therefore, the receptive field of DnCNN with depth of d should be \((2d+1)(2d+1)\)
- For Gaussian denoising with a certain noise level, the receptive field size of DnCNN is set to 35×35 with the corresponding depth of \(17\). For other general image denoising tasks, a larger receptive field is adopted by setting the depth to be \(20\).
- residual learning formulation is adopted to train a residual mapping: \(x = y-R(y)\)
- 3 types of layers:
- Conv+ReLU: For the first layer, 64 filters of size \(3×3×c\) are used to generate 64 feature maps. \(c\) = 1 for gray image and \(c\) = 3 for color image
- Conv+BN+ReLU: for layers 2 to \((D-1)\), 64 filters of size 3×3×64 are used, and batch normalization is added between convolution and ReLU
- Conv: for the last layer, \(c\) filters of size 3×3×64 are used to reconstruct the output.
- Simple zero padding strategy is used before convolution which does not result in any boundary artifacts.
5.4 FFDNet
5.4.1 Overview
- fast and flexible denoising convolutional neural network (FFDNet)
- premise: existing discriminative denoising methods (e.g. DnCNN, etc) are limited in flexibility, and the learned model is usually tailored to a specific noise level
- the noise level is modeled as an input and the tunable model parameters are invariant to noise level
- removes the spatially variant noise by specifying a non-uniform noise level map
5.4.2 Methodology
5.5 BRDNET
5.5.1 Overview
- batch-renormalization denoising network (BRDNet)
- BRDNet combines two networks to increase the width of BRDNet and obtain more features for image denoising.
- uses batch renormalization to address the small mini-batch problem, and applies residual learning (RL) with skip connection to obtain clean images.
- to reduce the computational cost, dilation convolutions are used to capture more features.
5.5.2 Methodology
Figure 10 show implementation strategy of BRDNet.
5.6 RIDNet
Our model in Figure 11 is composed of feature extraction, feature learning residual on the residual module, and reconstruction.
5.6.1 Overview
- single-stage blind real image denoising network (RIDNet)
- enhancement attention modules (EAM) is used to capture essential features using attention mechanism
5.7 Zero-Shot Noise2Noise
5.7.1 Overview
- Zero-Shot Noise2Noise (ZS-N2N)
- drawbacks of preparing clean-noisy image pairs dataset is expensive and time-consuming
- main idea is to generate a pair of noisy images from a single noisy image (dataset-free methods) and train a small network only on this pair
- extends Noise2Noise and Neighbour2Neighbour by enabling training on only one single noisy image
- zero-shot: only noisy image is given
- blind-denoising: no information of noise level
5.7.2 Methodology
- decompose the noisy image into a pair of downsampled images
- train a lightweight network with regularization to map one downsampled image to the other
- denoising is the applied to test noisy image
5.8 Selected of SoTa architecture
In Table 1 we select 5 denoising method using autoencoder which have high citation including the latest approach.
No | Year | Architecture | Objective | Methodology | Result | Weakness | Github |
---|---|---|---|---|---|---|---|
1 | 2018 | FFDNet | - fast and flexible denoising CAE - deal with spatially variant noise |
- the noise level is modeled as an input and the tunable model parameters are invariant to noise level - ADAM as loss optimizer - rotation and flip based data augmentation is also adopted during training |
17 | requires manual intervention to select high noise-level16 | FFDNet |
2 | 2020 | BRDNET | - increases the width rather than depth to enhance the learning ability of the denoising networks | - combines two networks to increase the width of BRDNet and obtain more features extraction - uses batch renormalization to address the small mini-batch problem, and applies residual learning (RL) with skip connection to obtain clean images - dilation convolutions are used to capture more features |
15 | TBD | BRDNet |
3 | 2017 | DnCNN | - treat image denoising as a plain discriminative learning problem, i.e., separating the noise from a noisy image by feed-forward CNN - leverage batch normalization and residual learning to capture image features |
- based on modify VGG network - adopt the residual learning formulation, and incorporate it with batch normalization for fast training and improved denoising performance |
13 | tailored to a specific noise level | DnCNN |
4 | 2019 | RIDNet | - incorporate feature attention in denoising | - modular network comprising three main modules: feature extraction, feature learning residual module, and reconstruction | 16 | TBD | RIDNet |
5 | 2023 | ZS-N2N | - zero-shot learning with simple network - to denoise images without any training data or a noise model or level as input |
- Noise2Noise and Neighbour2Neighbour by enabling training on only one single noisy image. | 18 | TBD | ZS-N2N |