Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Semantic representation learning for a mask-modulated lensless camera by contrastive cross-modal transferring

Open Access Open Access

Abstract

Lensless computational imaging, a technique that combines optical-modulated measurements with task-specific algorithms, has recently benefited from the application of artificial neural networks. Conventionally, lensless imaging techniques rely on prior knowledge to deal with the ill-posed nature of unstructured measurements, which requires costly supervised approaches. To address this issue, we present a self-supervised learning method that learns semantic representations for the modulated scenes from implicitly provided priors. A contrastive loss function is designed for training the target extractor (measurements) from a source extractor (structured natural scenes) to transfer cross-modal priors in the latent space. The effectiveness of the new extractor was validated by classifying the mask-modulated scenes on unseen datasets and showed the comparable accuracy to the source modality (contrastive language-image pre-trained [CLIP] network). The proposed multimodal representation learning method has the advantages of avoiding costly data annotation, being more adaptive to unseen data, and usability in a variety of downstream vision tasks with unconventional imaging settings.

© 2024 Optica Publishing Group

Full Article  |  PDF Article
More Like This
Cross-domain colorization of unpaired infrared images through contrastive learning guided by color feature selection attention

Tong Jiang, Xiaodong Kuang, Sanqian Wang, Tingting Liu, Yuan Liu, Xiubao Sui, and Qian Chen
Opt. Express 32(9) 15008-15024 (2024)

NeuroSeg-III: efficient neuron segmentation in two-photon Ca2+ imaging data using self-supervised learning

Yukun Wu, Zhehao Xu, Shanshan Liang, Lukang Wang, Meng Wang, Hongbo Jia, Xiaowei Chen, Zhikai Zhao, and Xiang Liao
Biomed. Opt. Express 15(5) 2910-2925 (2024)

Dual-constrained physics-enhanced untrained neural network for lensless imaging

Zehua Wang, Shenghao Zheng, Zhihui Ding, and Cheng Guo
J. Opt. Soc. Am. A 41(2) 165-173 (2024)

Supplementary Material (1)

NameDescription
Supplement 1       Supplement 1

Data availability

The raw scenes from LFW, JAFFE, and color FERET datasets are publicly available in Refs. [12], [28], and [30], respectively.

12. G. B. Huang, M. Ramesh, T. Berg, et al., “Labeled faces in the wild: a database for studying face recognition in unconstrained environments,” in Dans Workshop on Faces in Real-Life Images: Detection, Alignment, and Recognition (University of Massachusetts, 2007).

28. M. J. Lyons, “‘Excavating AI’ re-excavated: debunking a fallacious account of the JAFFE dataset,” arXiv, arXiv:2107.13998 (2021). [CrossRef]  

30. P. J. Phillips, H. Wechsler, J. Huang, et al., “The FERET database and evaluation procedure for face-recognition algorithms,” Image Vis. Comput. 16, 295–306 (1998). [CrossRef]  

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (7)

Fig. 1.
Fig. 1. General lensless computational imaging system. A lensless imager $\Phi$ modulates a scene $x$ [12], and an image sensor captures the measurement $y$. A purpose-built algorithm $f$ generates an estimate $\hat t$ for a downstream task.
Fig. 2.
Fig. 2. Proposed contrastive representation learning. An instance-level pretext task makes the different modalities of the same semantic instance, e.g., (${y_1},{x_1}$), have close representations in the latent space while pushing different scenes away.
Fig. 3.
Fig. 3. Schematic diagram. The InfoNCE loss for a mini-batch is calculated by taking $N$ queries from a mini-batch and $O$ keys from a dictionary, which then updates ${f_q}$ via the gradient algorithm. The queries $\{{{q_i}, i = 1, 2, \ldots ,N} \}$ and the keys $\{{{k_i}, i = 1, 2, \ldots ,O} \}$ are extracted by ${f_q}$ and ${f_k}$, respectively. The inner products ${q_i}{k_i}$ in orange represent the positive pairs, and the rest represents the negative pairs. In most cases, the size of the key is much larger than that of the queries $({N \ll O})$.
Fig. 4.
Fig. 4. (a) Optical configuration of the lensless imaging system. A binary coded mask was posited in front of the sensor at an image distance ${d_i}$, and a monitor for rendering the scenes was placed at an object distance ${d_o}$ from the coded mask. (b) Examples of training data. The first row denotes the raw scenes, and the second row denotes measurements.
Fig. 5.
Fig. 5. Examples of the raw scene and the corresponding measurement from (a) JAFFE and (b) color FERET.
Fig. 6.
Fig. 6. Representation visualization for modulated color FERET. The top five classes by quantity from color FERET are selected for visualization, where (a), (b), (c), and (d) indicate the representations generated by ${f_q}$(RN50), ${f_k}$(RN50), ${f_q}$(ViT-B/32), and ${f_k}$(ViT-B/32), respectively.
Fig. 7.
Fig. 7. Measurements versus poses. Each column illustrates a scene with a person in a pose and the measurement of the modulated scene.

Tables (2)

Tables Icon

Table 1. Classification Performancea

Tables Icon

Table 2. Misclassification Rate Versus Pose ( f q with “RN50”)a

Equations (7)

Equations on this page are rendered with MathJax. Learn more.

q = f q ( y ) .
k = f k ( x ) .
L = log exp( q k + / τ ) Σ j = 1 O 1 exp ( q k j τ ) ,
y = Φ x .
Y = Φ L X Φ R T ,
Φ L = Φ R = [ φ 1 φ i 0 0 0 0 φ 1 φ i 0 0 ] .
φ i = [ 1 , 1 , 1 , 0 , 0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 0 , 1 , 1 , 0 , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 1 , 1 ] .
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.