MossNet

MossNet is a neural network for solving super resolution problem(SR - further) preserving memory efficiency.

Abstract

The current state-of-the-art super resolution methods are based on diffusion models, which are computationally expensive. In the same time also much model utilize text-to-image models which are making inference even more complex. The MossNet aims to remove this complexity by utilizing U-Net architecture without text-to-image model.

Architecture

MossNet proposes context-aware patch-local upscaling mechanism, which allow to use single embedding "describing" image for whole upscaling process. In the same using smaller, fixed patches allow to reduce memory footprint.

Cut image into patches
Cut image into patches
Image
Image
Patches
Patches
Embedding("description")
Embedding("description")
ContextNet
ContextNet
Decode context
Decode context
For each patch
For each patch
Concat context to patches
Concat context to pa...
NxN patch
NxN patch
Initial convolution
Initial convolution
2Nx2N patch
2Nx2N patch
UNet
UNet
Add to reconstructed image
Add to reconstructed...
Text is not SVG - cannot display

Preliminary results

Experiment setup

Model training did not include noise reduction targets or deblur targets. The reason behind this choice is that the model is aimed to users of handsets and embedded devices that may want blurred images to be displayed as is and denoise algorithms are usually separate. Thus metrics are limited to LPIPS and PSNR. The model was trained of 50000 iterations on a open images datasets. The model then was distilled to XS version.

The model was evaluated on OpenImages dataset which was also used for training. Also to assess the model out-of-sample generalization ability arbitrary images from DIV2K and Unsplash were used during visual testing.

The GFLOPS parameter was measured on SR for image of 256x256 pixels.

Quantative results

Model LPIPS↓ PSNR↑ GFLOPS Param Count
MossNet-XS 0.132 27.64 3.32 64.2K
SRGAN⚠️ 0.126 26.63 - 5.949M
StableSR@200 steps 0.311 23.26 - approx. 200M
GuideSR 0.265 24.76 - approx. 200M
EDSR⚠️ 0.133 34.64 - 1.37M

Important notice: The above quantative results may have bias since the model was never trained on Div2K dataset, unlike models from this this paper. Models for which data is likely affected by this bias are marked by a warning⚠️ icon.

Though this issue is awaiting model(among others) to be specifically re-trained on DIV2K, the problem with my GPU setup would like take time to resolve. For the same reason GFLOPS is currently unavailable for most models in above table.

Visual results

DIV2K

OpenImages

Unsplash

Just random image from Unsplash.

References

Notice

As this research is still in progress, this page is left under restrictive CC license.

Copyright

This page is licensed under the CC BY-NC-ND license.

DOI: 10.5281/zenodo.15665470

Author

Rust-loving, Python-purring Rubyist with a taste for clean UI and warm naps in the sun.