Semantic representation learning for a mask-modulated lensless camera by contrastive cross-modal transferring

Ya-Ti Chang Lee; Chung-Hao Tien; Chung-Hao Tien

doi:10.1364/AO.507549

Applied Optics
Vol. 63,
Issue 8,
pp. C24-C31
(2024)
•https://doi.org/10.1364/AO.507549

Semantic representation learning for a mask-modulated lensless camera by contrastive cross-modal transferring

Ya-Ti Chang Lee and Chung-Hao Tien

Open Access

Get PDF
Email
Share
Get Citation
Copy Citation Text
Ya-Ti Chang Lee and Chung-Hao Tien, "Semantic representation learning for a mask-modulated lensless camera by contrastive cross-modal transferring," Appl. Opt. 63, C24-C31 (2024)

Export Citation
- BibTex
- Endnote (RIS)
- HTML
- Plain Text
Citation alert
Save article

Check for updates

Related Topics
Table of Contents Category
- Imaging Systems, Image Processing, and Displays
Optics & Photonics Topics
?

The topics in this list come from the Optics and Photonics Topics applied to this article.

About this Article
History
- Original Manuscript: October 9, 2023
- Revised Manuscript: January 5, 2024
- Manuscript Accepted: January 8, 2024
- Published: February 8, 2024
Virtual Issues
Optics Express Joint Feature Issue in Optics Express and Applied Optics: Computational Optical Sensing and Imaging 2023 (2023)

Abstract

Lensless computational imaging, a technique that combines optical-modulated measurements with task-specific algorithms, has recently benefited from the application of artificial neural networks. Conventionally, lensless imaging techniques rely on prior knowledge to deal with the ill-posed nature of unstructured measurements, which requires costly supervised approaches. To address this issue, we present a self-supervised learning method that learns semantic representations for the modulated scenes from implicitly provided priors. A contrastive loss function is designed for training the target extractor (measurements) from a source extractor (structured natural scenes) to transfer cross-modal priors in the latent space. The effectiveness of the new extractor was validated by classifying the mask-modulated scenes on unseen datasets and showed the comparable accuracy to the source modality (contrastive language-image pre-trained [CLIP] network). The proposed multimodal representation learning method has the advantages of avoiding costly data annotation, being more adaptive to unseen data, and usability in a variety of downstream vision tasks with unconventional imaging settings.

Full Article | PDF Article

More Like This

Cross-domain colorization of unpaired infrared images through contrastive learning guided by color feature selection attention

Tong Jiang, Xiaodong Kuang, Sanqian Wang, Tingting Liu, Yuan Liu, Xiubao Sui, and Qian Chen
Opt. Express 32(9) 15008-15024 (2024)

NeuroSeg-III: efficient neuron segmentation in two-photon Ca²⁺ imaging data using self-supervised learning

Yukun Wu, Zhehao Xu, Shanshan Liang, Lukang Wang, Meng Wang, Hongbo Jia, Xiaowei Chen, Zhikai Zhao, and Xiang Liao
Biomed. Opt. Express 15(5) 2910-2925 (2024)

Dual-constrained physics-enhanced untrained neural network for lensless imaging

Zehua Wang, Shenghao Zheng, Zhihui Ding, and Cheng Guo
J. Opt. Soc. Am. A 41(2) 165-173 (2024)

Previous Article Next Article

Supplementary Material (1)

Name	Description
Supplement 1	Supplement 1

Data availability

The raw scenes from LFW, JAFFE, and color FERET datasets are publicly available in Refs. [12], [28], and [30], respectively.

12. G. B. Huang, M. Ramesh, T. Berg, et al., “Labeled faces in the wild: a database for studying face recognition in unconstrained environments,” in Dans Workshop on Faces in Real-Life Images: Detection, Alignment, and Recognition (University of Massachusetts, 2007).

28. M. J. Lyons, “‘Excavating AI’ re-excavated: debunking a fallacious account of the JAFFE dataset,” arXiv, arXiv:2107.13998 (2021). [CrossRef]

30. P. J. Phillips, H. Wechsler, J. Huang, et al., “The FERET database and evaluation procedure for face-recognition algorithms,” Image Vis. Comput. 16, 295–306 (1998). [CrossRef]

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.

Fig. 1. General lensless computational imaging system. A lensless imager

$\Phi$

modulates a scene

$x$

[12], and an image sensor captures the measurement

$y$

. A purpose-built algorithm

$f$

generates an estimate

$\hat t$

for a downstream task.

Download Full Size | PDF

Fig. 2. Proposed contrastive representation learning. An instance-level pretext task makes the different modalities of the same semantic instance, e.g., (

${y_1},{x_1}$

), have close representations in the latent space while pushing different scenes away.

Download Full Size | PDF

Fig. 3. Schematic diagram. The InfoNCE loss for a mini-batch is calculated by taking

$N$

queries from a mini-batch and

$O$

keys from a dictionary, which then updates

${f_q}$

via the gradient algorithm. The queries

$\{{{q_i}, i = 1, 2, \ldots ,N} \}$

and the keys

$\{{{k_i}, i = 1, 2, \ldots ,O} \}$

are extracted by

${f_q}$

and

${f_k}$

, respectively. The inner products

${q_i}{k_i}$

in orange represent the positive pairs, and the rest represents the negative pairs. In most cases, the size of the key is much larger than that of the queries

$({N \ll O})$

Download Full Size | PDF

Fig. 4. (a) Optical configuration of the lensless imaging system. A binary coded mask was posited in front of the sensor at an image distance

${d_i}$

, and a monitor for rendering the scenes was placed at an object distance

${d_o}$

from the coded mask. (b) Examples of training data. The first row denotes the raw scenes, and the second row denotes measurements.

Download Full Size | PDF

Fig. 5. Examples of the raw scene and the corresponding measurement from (a) JAFFE and (b) color FERET.

Download Full Size | PDF

Fig. 6. Representation visualization for modulated color FERET. The top five classes by quantity from color FERET are selected for visualization, where (a), (b), (c), and (d) indicate the representations generated by

${f_q}$

(RN50),

${f_k}$

(RN50),

${f_q}$

(ViT-B/32), and

${f_k}$

(ViT-B/32), respectively.

Download Full Size | PDF

Fig. 7. Measurements versus poses. Each column illustrates a scene with a person in a pose and the measurement of the modulated scene.

Download Full Size | PDF

Tables (2)

Table 1. Classification Performance^a

View Table | View all tables in this article

Table 2. Misclassification Rate Versus Pose ( $f_{q}$ with “RN50”)^a

View Table | View all tables in this article

Equations (7)

Equations on this page are rendered with MathJax. Learn more.

q = f_{q} (y) .

k = f_{k} (x) .

L = - log \frac{exp(q \cdot k_{+} / τ)}{Σ_{j = 1}^{O - 1} exp (q \cdot \frac{k_{j}}{τ})},

y = Φ x .

Y = Φ_{L} X Φ_{R}^{T},

Φ_{L} = Φ_{R} = [\begin{matrix} φ_{1} & \dots & φ_{i} & 0 & 0 & \dots & 0 \\ 0 & φ_{1} & \dots & φ_{i} & 0 & \dots & 0 \\ ⋮ & ⋱ & ⋱ & ⋱ & ⋱ & ⋱ & ⋮ \end{matrix}] .

\begin{aligned} φ_{i} & = [1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, \\ 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1] . \end{aligned}

Feature Extractor	JAFFE (modulated)	Color FERET (modulated)	JAFFE (structured)	Color FERET (structured)
$f_{q}$ (RN50)	100%	98.0%	NA	NA
$f_{q}$ (ViT-B/32)	100%	97.1%	NA	NA
$f_{k}$ (RN50)	65.5%	3.9%	100%	99%
$f_{k}$ (ViT-B/32)	80.3%	21.7%	100%	100%

Feature Extractor	JAFFE (modulated)	Color FERET (modulated)	JAFFE (structured)	Color FERET (structured)
$f_{q}$ (RN50)	100%	98.0%	NA	NA
$f_{q}$ (ViT-B/32)	100%	97.1%	NA	NA
$f_{k}$ (RN50)	65.5%	3.9%	100%	99%
$f_{k}$ (ViT-B/32)	80.3%	21.7%	100%	100%

Abstract

Supplementary Material (1)

Data availability

Cited By

Figures (7)

Tables (2)

Equations (7)

Applied Optics