Coupled deep learning coded aperture design for compressive image classification

Jorge Bacca; Laura Galvis; Henry Arguello

doi:10.1364/OE.381479

1. Introduction

Compressed sensing (CS) has emerged as a sensing paradigm that instead of acquiring $N$ samples of a given signal $\mathbf {x} \in \mathbb {R}^{N}$, captures $M \ll N$ linear projected measurements ($\mathbf {y}=\mathbf {\Phi }\mathbf {x} \in \mathbb {R}^{M}$), resulting in hardware compression [1]. The single-pixel camera (SPC) has been lately exploited as part of research advances on the CS theory [2]. This low-cost camera acquires a multiplexed version of a scene with a single-pixel detector computing random linear measurements of the scene using a binary coded aperture [2]. This architecture is especially useful in applications where multiple-pixel sensors are expensive or infeasible, such as at shortwave-infrared and terahertz wavelengths [3]. SPC is an efficient sensing system at the expense of a slow computational recovery, that leads with the expensive task of finding a solution of an under-determined system [1]. To this end, CS provides theoretical guarantees for image recovery, assuming that the underlying image is sparse in some basis [4]. More recently, deep learning approaches have been proposed for CS recovery, addressing the speed and sparsity limitations [5–7]. However, high computation workloads come as a trade-off for the fast signal recovery from its compressed measurements under these approaches [6,8].

From a hardware point of view, recent works have focused on designing the binary coded aperture in order to improve the storage space, speed, and accuracy of the image reconstruction [3,9–11]. However, when the objective is an inference task such as segmentation, detection, or classification, the two-step process of reconstruction and task-solving is suboptimal in terms of efficiency; indeed, some compressed learning (CL) approaches have shown that it is possible to perform inference tasks directly in the compressive domain without the need to restore the scene [12,13]. Specifically in [12], theoretical and simulation results have shown that it is possible to learn features from the CS measurements, and these can be used in classifiers such as support vector machines [14], and sparse subspace clustering [15]. Also, in [16] a convolutional neural network (CNN) for compressed image classification using random sensing matrices was developed. In particular, the input of the CNN is a re-projected measurement vector $\mathbf {\hat {x}=\boldsymbol {\Phi }}^T\mathbf {y}$, where $\mathbf {\hat {x}}$ has the same size of the target image. More recently, [17] used a deep learning approach to simultaneously learn the linear projections and the non-linear classification net. Different from [16], in [17] the re-projection to the image size ($\boldsymbol {\Phi }^{T}$) is learned in the second fully connected layer of the network. Although [17] learned the linear projection operator, it cannot provide linear projections that fit with specific structures and properties of implementable CS systems. For instance, in optical imaging, elements such as spatial light modulators (SLM) or digital micro-mirror devices (DMD) are used in the acquisition as coded apertures, whose patterns are mathematically modeled by a binary sensing matrix with a specific structure determined by the optical configuration [18].

In contrast to state-of-the-art methods, this work proposes a coupled deep learning approach for the sensing matrix design, accounting for real/implementable SPC, and image classification directly from the compressed measurements. In the proposed approach, a neural network (NN) is trained to simultaneously learn the linear sensing matrix and the parameters of the non-linear classification network, considering the constraints imposed by the SPC. The first layer learns the binary sensing matrix, and the subsequent layers learn the classification parameters. After that, the optimized sensing matrix is used in a real SPC to acquire the projected data. These measurements are classified by using the already trained subsequent part of the network. In particular, for the classification task, two different NN architectures are proposed. The first, similar to [17], learns a fully connected layer to re-project the measurements to the image size so that any known CNN classifier can be applied. The second NN architecture proposes to extract small features directly from the compressive measurements without learning an image size re-projection operator. The performance of the proposed approach is demonstrated using MNIST and CIFAR-10 data sets, for which it provided better average classification accuracy for different sensing ratios compared to the works in [16] and [17]; and in some cases, it obtained similar results compared to non-implementable (non-binary) learned sensing matrices [17] for sensing ratios below $0.05$. Additionally, an optical setup was built to validate the classification results from SPC compressed measurements using the learned coded apertures.

2. Single-pixel camera model

The single-pixel camera (SPC) spatially encodes the full image before a single-pixel detector acquires projections of the image. The optical architecture consists of an objective lens, a coded aperture, a collimator lens, and a single-pixel sensor, as illustrated in Fig. 1. Precisely, the scene $f(x,y),$ where $(x,y)$ index the spatial dimensions of the scene, is modulated by a binary coded aperture $\phi ^{k}(x,y)$, for $k=1,\ldots ,K$ different number of snapshots. The coded aperture in Fig. 1, blocks-unblocks some pixels of the scene. Then, the encoded scene passes through the collimator lens, which concentrates the light into a single spatial point. This operation is expressed as

(1)$$g^{k}(x,y) = \iint \phi^{k}(x,y) f(x,y) dxdy,$$

where a single-pixel detector with a pixel size $\Delta _g$ captures the incoming light intensity. In particular, the single discrete measurement is written as

(2)$$\tilde{g}^{k} = \iint g^{k}(x,y) \textrm{ rect} \left(\frac{x}{\Delta_g}-1,\frac{y}{\Delta_g}-1\right) dxdy,$$

where $\textrm {rect(}\cdot {)}$ represents the rectangular step function accounting for the 2D sampling function. On the other hand, assuming that the coded aperture is a two-dimensional array of squared pixels of size $\Delta _t$, it can be discretely described as

(3)$$\phi^{k}(x,y) = \sum_{m=1}^{M} \sum_{n=1}^{N} \phi^{k}_{m,n} \textrm{rect}\left(\frac{x}{\Delta_t}-m,\frac{y}{\Delta_t}-n\right),$$

where $\phi ^{k}_{m,n}$ is the coding performed on the $(m,n)^{th}$ pixel at the $k^{th}$ snapshot, and $MN$ is the number of pixels in the coded aperture. Note that the distribution of the pattern in $\phi _{m,n}$ is the only physical property that can be optimized in the SPC system, and different patterns will be referred to as a set of binary weights in the next section. Using Eq. (3) in Eq. (2), the discrete sensing model of the SPC can be then rewritten as

(4)$$\tilde{g}^{k} = \sum_{m=1}^{M} \sum_{n=1}^{N} \phi^{k}_{m,n} f_{m,n} + \omega^{k},$$

where $f_{m,n}$ is the $(m,n)^{th}$ pixel of the scene, which is given by the pixel size of the coded aperture ($\Delta _{t}$), similarly to $\phi _{n,m}$ in Eq. (3) and $\omega$ stands for the noise. It is worth to highlight that, modifying the pixel values of the coded aperture, different projections of the same scene can be obtained. In particular, stacking the measurements in a single vector $\mathbf {g} = [\tilde {g}^{1},\ldots ,\tilde {g}^{K}]^T$, the forward matrix model can be expressed as

(5)$$\mathbf{g} = \boldsymbol{\Phi} \mathbf{f} + \boldsymbol{\omega},$$

where $\mathbf {f} \in \mathbb {R}^{MN}$ is the vectorization of the scene, and $\boldsymbol {\Phi } \in \mathbb {R}^{K \times MN}$ is the sensing matrix of the single-pixel camera, whose rows contain the vectorization of the coded apertures. A conventional relation which defines the compression achieved by these systems is the sensing ratio, calculated as the number of shots over the dimension of the image as

(6)$$\gamma= \frac{K}{MN}.$$

To recover the image from the measurements $\mathbf {g}$, CS theory assumes that $\mathbf {f}$ is sparse on some basis $\boldsymbol {\Psi }$; then, the underlying image can be recovered by solving an optimization problem of the form $\hat {\mathbf {x}} = \boldsymbol {\Psi } \left \lbrace \textrm{arg min} _{\boldsymbol {\alpha }} \|\boldsymbol {\Phi }\boldsymbol {\Psi \alpha } - \mathbf {g} \|_2^2 + \tau \|\boldsymbol {\alpha }\|_1 \right \rbrace , \label {eq:traditional_method}$ where the $\ell _1$-norm promotes the sparsity of $\boldsymbol {\alpha }$, and $\tau$ is a regularization parameter. The reconstruction quality depends mainly on the computational recovery algorithm [1] and the quality of the sensing matrix ($\boldsymbol {\Phi }$), which is directly related to the hardware distribution of the coded aperture [19]. However, for some specific tasks, such as segmentation, detection, or classification, this reconstruction step can be avoided solving the problem directly in the compressive domain [15]. Indeed, [15] has shown that the quality of the classification results over compressed measurements can be improved if the set of coded apertures is appropriately designed.

Fig. 1. Schematic of the single-pixel camera acquisition.

Download Full Size | PDF

3. Coded aperture design and classification: a compressed learning framework

The proposed approach aims to design a binary sensing matrix $(\boldsymbol {\Phi })$ accounting for real and implementable SPC coding patterns such that the obtained measurements can be effectively used for classification employing a deep learning (DL) scheme. In particular, a coupled neural network approach which simultaneously learns the sensing matrix and the parameters of the classification network is summarized in the top of Fig. 2. Once the sensing matrix is optimized, new SPC measurements can be acquired, and the trained inference network, can be directly applied over the compressive measurements to obtain the classification results as shown at the bottom of Fig. 2.

Fig. 2. Proposed deep learning scheme where the colors are only for illustrative purposes and represent different shots. In the training step the binary sensing matrix and CNN parameters for classification are learned. In the testing step, the learned sensing matrix is hardware implemented in a real DMD to acquire SPC compressed measurements, and in terms of software, those measurements are classified with the learned CCN.

Download Full Size | PDF

3.1 Training stage

The training step consists of two main blocks: the first one related to the binary sensing matrix optimization, and the second related to the non-linear classification operator $(\mathcal {M}_{\boldsymbol {\theta }})$ as summarized in the top of Fig. 2. Specifically, with a set of $L$ images $\{\mathbf {x}_{\ell }\}_{\ell = 1}^{L}$ and its respective labels $\{\mathbf {d}_\ell \}_{\ell = 1}^{L}$, the joint learning problem can be formulated as follows

(7)$$\begin{aligned} & \{\boldsymbol{\Phi},\boldsymbol{\theta}\} = & & \mathop{\arg \min}\limits_{\boldsymbol{\Phi}, \boldsymbol{\theta}} & & & \frac{1}{L} \sum_{\ell = 1}^{L} \mathcal{L}\left(\mathcal{M}_{\boldsymbol{\theta}}(\boldsymbol{\Phi}\mathbf{x}_\ell ), \mathbf{d}_\ell\right) & & & & \\ & & & \textrm{subject to} & & & \boldsymbol{\Phi} \in \{0,1\}_{k,n}, \;\;\; k = 1, \ldots & & & & \hspace{-1.6em}, K, \; and \; n = 1, \ldots,MN, \end{aligned}$$

where $\mathcal {L}(\cdot ,\cdot )$ stands for the loss function and $\boldsymbol {\theta }$ represents the parameters of the classification network. Notice that the constraints in Eq. (7) are imposed by the SPC system, and model the on (1) and off (0) states of the coded aperture [3]. We propose a single network under a deep compressive learning scheme to solve Eq. (7). In particular, the network is composed of a fully connected layer as the first layer (sensing layer), which is directly connected to the classification network. To take into account the constraint imposed in Eq. (7) over the sensing matrix $\boldsymbol {\Phi }$, pointed out as the weights of the first layer, a penalty term is included in the loss function to promote binary values in $\boldsymbol {\Phi }$. Besides, a bias $\mathbf {b}_1$ and an activation function $f_1$ are proposed for this layer. The activation function is intended to learn the non-linear properties, and the bias is used to delay the triggering of the activation function [20]. The proposed method can be expressed mathematically as including a penalty term to the loss function in Eq. (7) as follows

(8)$$\{\boldsymbol{\Phi},\boldsymbol{\theta}\} = \mathop{\arg \min}\limits_{\boldsymbol{\Phi}, \boldsymbol{\theta}} \frac{1}{L} \sum_{\ell = 1}^{L} \mathcal{L}(\mathcal{M}_{\boldsymbol{\theta}}(f_1(\boldsymbol{\Phi}\mathbf{x}_\ell + \mathbf{b}_1)), \mathbf{d}_\ell) + \mu \sum_{k=1}^{K} \sum_{n=1}^{MN}(1+\boldsymbol{\Phi}_{k,n})^2(1-\boldsymbol{\Phi}_{k,n})^2 ,$$

where $\mu$ is a regularization parameter which controls the trade-off between the loss function and the binary values regularizer. Notice that minimizing the second term in Eq. (8) induces $\boldsymbol {\Phi }_{k,n}$ to have values of $1$ or $-1$ (the codification with negative values can be achieved in a real experimental setup, as explained in the Appendix). Once the structure of the classification operator is established, which will be presented below for the classification task, Eq. (8) can be solved with state-of-the-art deep learning algorithms such as stochastic gradient descent (sgd) [21], mini-batch gradient descent, gradient descent with momentum [22], or a method for stochastic optimization derived from adaptive moment estimation (Adam) [23].

3.1.1 Structure of the classification operator

The inference task considered in this paper is classification; however, any other inference task can be solved using the proposed approach. In particular, defining the structure of the inference operator as a neural network, the classification loss function is usually defined as

(9)$$\mathcal{L}(\mathbf{z}_{\ell},\mathbf{d}_{\ell}) = -\left[ \mathbf{d}_{\ell} \log(\mathbf{z}_{\ell}) + (\mathbf{1}-\mathbf{d}_{\ell})\log (\mathbf{1}-\mathbf{z}_{\ell}))\right],$$

where $\mathbf {z}_{\ell }$ is the output of the classification operator applied to the $\ell$-th image, i.e., $\mathbf {z}_{\ell } = \mathcal {M}_{\boldsymbol {\theta }}(f_1(\boldsymbol {\Phi }\mathbf {x}_{\ell } + \mathbf {b}_1)$. In particular, for the classification net, which begins in the second layer of the whole scheme, we propose two different approaches. The first, takes into account the large number of state-of-the-art CNNs classification methods that require the image size, i.e., the input of this networks is the whole image [17]. For this approach, we propose a fully connected second layer to re-project the first layer output to the image size; thus, any well-known classification CNN can be concatenated after the second layer output. Usually the structure of a CCN for image classification consists of convolution operations with a set of filters ($\boldsymbol {\theta }$), followed by an element-wise non-linear operator such as a sigmoid, ReLU or tanh function [24]; between convolutional layers, maxpooling layers are usually employed to reduce dimensionality. Finally, fully connected layers with a soft-max function are applied to the last layers in order to obtain the probability for each class.

The second approach is a direct way to extract the features from the compressive measurements, without the need to return to the image size. This can be achieved with smaller fully connected layers followed by an element-wise non-linear function. This new approach has advantages because the re-projection to the image size implies a higher number of parameters to train which therefore, is prone to overfitting and requires more training time as it will be shown in the results section [25].

3.2 Testing stage

The testing stage splits in hardware and software sub-stages. In terms of hardware, once the training stage is complete, each resulting row of $\boldsymbol {\Phi }$ represents a distribution of a coded aperture, where the number of rows represent the number of shots. Thus, the trained $\boldsymbol {\Phi }$ provides the physical pattern used to acquire new compressed measurements $\mathbf {g}$ employing the SPC through the digital micromirror devices (DMD) as explained in section 2. Additionally, in the software counterpart, the classification operator $(\mathcal {M}_{\boldsymbol {\theta }})$ and its learned parameters, are used as inference operator, which can be applied directly over the compressed measurements as

(10)$$\mathbf{z} = \mathcal{M}_{\boldsymbol{\theta}}(f_1(\underbrace{\mathbf{g}}_{\boldsymbol{\Phi}\mathbf{x}} + \mathbf{b}_1)),$$

where $\mathbf {z}$ is the classification result. It is worth to highlight that $\mathcal {M}_{\boldsymbol {\theta }}(\cdot )$ is the second block of the whole learned network, since the first block was employed to obtain $\mathbf {g}$ using the previously trained $\boldsymbol {\Phi }$. This testing stage is summarized at the bottom of Fig. 2, where it is shown as hardware and software blocks; the colors in the first layer of the classification represent the number of shots used.

4. Simulations and results

This section evaluates the performance of the proposed coupled SPC binary sensing matrix design for compressive image classification. Specifically, the first approach, which preserves the image size when the classification stage starts, is denoted as Binary-Pres-Net. The second approach does not require to preserve the image size, and will be referred as Binary-NoP-Net. The proposed methods are compared with two state-of-the-art methods that perform classification over compressed measurements: Random + CNN [16] and End-to-End [17]. It should be noted that the sensing matrix used in Rand + CNN can be directly implemented as a coded aperture in the SPC architecture, since it can use binary sensing matrices, while the opposite happens with the End-to-End method, since it provides a real sensing matrix; however, End-to-End is considered in this paper for comparison purposes. The four methods were evaluated over two different data sets, whose images were divided into training and testing subsets; details of the size for each database are presented in the following subsection. The training data was used to simultaneously train the binary sensing matrix $\boldsymbol {\Phi }$ and the parameters of the classification network $\boldsymbol {\theta }$. After that, the designed sensing matrix was used to acquire compressed SPC measurements from the testing data set. These measurements were contaminated with Gaussian Noise of $30$ dB signal-to-noise ratio (SNR). Then, the resulting SPC measurements were used as input data in the trained network to classify them and to obtain the test results. It is worth mentioning that the noise is only applied in the testing step. All the methods were trained with the Adam algorithm [23], using a learning rate of $0.001$, over $100$ epochs. For the proposed method, the hyper-parameter $\mu$, which promotes binary weights, was fixed as $0.01$. These values were determined using a cross-validation strategy such that each simulation uses the value that results in the best classification accuracy. All simulations were implemented in Matlab 2018a on an Intel Xeon E5-2697 2.6GHz CPU with 192GB RAM, coupled with a Nvidia Quadro K6000 12GB GPU.

4.1 MNIST data set

The MNIST data set (available at http://yann.lecun.com/exdb/mnist/) contains 60,000 hand written images of the numbers from $0$ to $9$, each with $28 \times 28$ pixels. All results for this database are the average of $5$ trial runs, where $50,000$ and $10,000$ images were randomly selected for the training and testing sets, respectively. For the MNIST data set, the Binary-Pres-Net is a modification or extension of the LeNet-5 model [26], i.e., once the second layer output is reshaped as an image, the LeNet-5 model is concatenated to the first two layers, as described in Section 3.1.1. Figure 3(a) summarizes the layer description of the Binary-Pres-Net. Similarly, End-to-End and Random+CNN employed the same network configuration after the re-projection layer. For this data set, the proposed Binary-NoP-Net uses the binary layer followed by two fully connected layers with a ReLU as a non-linear operator and a 10-class softmax classifer as it is summarized in Fig. 3(b). Notice the difference in the number of layers used for each net, this difference is remarkable in terms of computing time, as it will be reported in the results tables.

Fig. 3. Layers description of the proposed neural networks used for the MNIST data set. a) Binary-Pres-Net and b) Binary-NoP-Net. conv denote convolutional layer and they are shown in green, fc is a fully-connected layer shown in orange color, and st stands for the stride of the max-polling. The main proposed layers are presented in blue color.

Download Full Size | PDF

Table 1 presents a comparison of the classification accuracy results for the two proposed approaches (Binary-Pres-Net, Binary-NoP-Net), the End-to-End, and the Random+CNN methods, for different sensing ratios. Boldface indicates the best result for each case, and the second best result is underlined. It is worth noting that each sensing ratio entails a different sensing matrix, which requires weights training into the networks for each case. The evaluated sensing ratios include $\gamma = \{0.01, 0.05, 0.01, 0.25\}$ which are equivalent to $K=\{8,39,78,196\}$ SPC shots calculated as in Eq. (6), respectively.

Table 1. Average classification accuracy for different sensing ratios - MNIST data set.

View Table | View all tables in this article

From Table 1, It can be observed that the proposed methods obtain better results as the sensing ratio increases. Specifically, for more than $39$ shots, the Binary-Pres-Net performs similar to the End-To-End approach. Moreover, the performance decreases for very low sensing ratios, i.e., $1\%$ and $5\%$ of the data, but these results still outperform those from the Random+CNN, which is the most comparable method because of its implementation. Notice that the simple configuration chosen for the Binary-NoP-Net needs less number of parameters to train and shows that it is possible to extract features directly in the compressed domain, nevertheless, it obtains results comparable with those achieved by the Random + CNN. Additionally, Table 2 shows, the average training time (in seconds) for each epoch using a mini-batch size of 256. Notice that the Binary-NoP-Net needs less training time as expected, since it has fewer parameters compared with the other approaches.

Table 2. Average training time in seconds per epoch - MNIST data set.

View Table | View all tables in this article

4.2 CIFAR-10 data set

The CIFAR-10 data set (available at https://www.cs.toronto.edu/~kriz/cifar.html) [27] consists of $60,000$ RGB images with $32\times 32$ pixels, where each image belongs to one of $10$ different classes. The data set is divided into $50,000$ training images and $10,000$ test images. As in the previous experiment, all the results are the average of $5$ trial runs. For the Binary-Pres-Net, End-to-End, and Random+CNN methods, the classification block is a variant of AlexNet [28] as it is shown in Fig. 4(a). Due to the complexity of the network and the database, the weights of the layers $3$ to $11$ were initialized independently of the first 2 layers with the weights learned from a training process, with $50$ epochs and with the same hyper-parameters used with the MNIST data set. For the Binary-NoP-Net, we extract features from the compressed measurements using a small fully connected layer and then we reshape the outputs of this layer into a 3D structure, then, a convolutional layer is employed as shown in Fig. 4(b). The same sensing ratios of the previous experiment were used for the CIFAR-10 data set. The obtained average classification accuracy results are summarized in Table 3. It can be observed that for this deeper network, the relative performance of the proposed Binary-Pres-Net method for lower ratios is better than for the previous experiment with the MNIST data set. Specifically, it provides the best test results for sensing ratios as low as $0.05$. Even though End-to-End provides the best accuracy for $0.01$, it yields to non-binary sensing matrices. Finally, for this data set, the training time for each epoch is summarized in Table 4. Note that Binary-NoP-Net needs less training time compared to the other methods and also has comparable accuracy as observed in Table 3.

Fig. 4. Layers description of the proposed neural networks used for the CIFAR data set. a) Binary-Pres-Net and b) Binary-NoP-Net.

Download Full Size | PDF

Table 3. Average classification accuracy for four sensing ratios - CIFAR-10 data set.

View Table | View all tables in this article

Table 4. Average training time in seconds per epoch - CIFAR data set.

View Table | View all tables in this article

5. Experimental setup

To evaluate the effectiveness of the designed sensing matrices and the proposed CCNs to classify single-pixel measurements, a SPC testbed was implemented to acquire real measurements. The experimental setup is shown in Fig. 5. It is composed of a 100-mn objective lens, a high-speed digital micro-mirror device (DMD), Texas Instruments, DLi4130 .7" VIS XGA, with a pixel size of $13,6 \mu m$ placed at the image plane; a 100-mm relay lens; a Thorlabs F220SMA-A as a condenser lens that projects the scene at a single point, where the incoming light passes through the fiber optic; and an Ocean Optics Flame S-VIS-NIR-ES spectrometer used as a detector.

Fig. 5. Test-bed implementation of the single-pixel camera.

Download Full Size | PDF

For the experiments, four randomly selected images per digit from the MNIST testing data set were printed with a size of $25 \times 25 mm$ and used as a target in the test-bed implementation. Figure 6 shows the printed numbers, they were illuminated by a lamp for the visible spectrum as illustrated in Fig. 5. As the scenes are gray-scale images, and the proposed coded apertures were trained for a single band, the sum of the spectral bands captured by the spectrometer in the range of 470 and 620 mn was used as the SPC measurement for each shot to emulate a photodiode.

Fig. 6. Printed digits used to evaluate the proposed method.

Download Full Size | PDF

Similarly, as in section 4.1, the designed coded apertures were trained using the MNIST training data set, generating binary masks of $28\times 28$ pixels that were implemented into a DMD with pixel size of $13.5 \mu m$, and using a ratio of $1$ to $25$ as shown in Visualization 1, i.e., each binary pixel has a size of $340 \mu m$. To guarantee the quality of the measurements, a black scene was subtracted to each measurement shot. For the Binary-Pres-Net and the Binary-NoP-Net methods, different number of shots were evaluated, specifically 5, 10, 30, 50 and 100 shots, which are equivalent to a compression factor of 0.64, 1.27, 3.83, 6.38 and 12.76%, respectively. Figure 7 shows the distribution of the coded apertures for the set of 5 and 10 shots, obtained with the proposed design schemes. Additionally, the two proposed network configurations were evaluated using random coded apertures (which is the equivalent to the Random+CNN methodology), and they were denoted as Rand+CNN(Pres-Net) and Rand+CNN(NoP-Net).

Fig. 7. Designed Coded aperture employed in the DMD for 5 and 10 shots with the two proposed schemes.

Download Full Size | PDF

Table 5 presents the overall accuracy obtained for the experiments. It can be observed that for low sensing ratios, the proposed method that preserves the image size achieves better results, and the behavior is similar for large sensing ratios in Table 5; nevertheless, for the $0.0383$ and $0.0638$ sensing ratios, the proposed methods have similar results. For both cases, the use of the designed coded apertures shows better performance compared with random coded apertures, and it can be appreciated that as more shots are acquired, the quality of the classification increases.

Table 5. General accuracy of the proposed methods for the experimental results.

View Table | View all tables in this article

To analyze in more detail the results, the confusion matrix for the set of 50 shots (6.38%) with the two proposed schemes is presented in Fig. 8. These matrices show the behavior per class, where the rows and columns correspond to the predicted and correct class, respectively. The values and respective percentages in the diagonal correspond to the observations that were correctly classified, and the values in the off-diagonal correspond to incorrectly classified observations. Notice that both classifiers correctly matched in all cases for the numbers 0,2,4, and 6. Additionally, the last column of each table shows the precision achieved by the classification, and the false discovery rate. The precision for both classifiers is 100% for six of the ten digits. In the bottom row, the recall and false negative rates are shown, the recall metric is higher in most of the digit cases and for both classifiers schemes. Finally, the overall accuracy reported in the cell in the bottom right of the matrix is for both classifiers 82.5%.

Fig. 8. Confusion matrix for all target digits used. (Left) Binary-Pres-Net. (Right) Binary-NoP-Net.

Download Full Size | PDF

6. Conclusions

Two coupled deep learning approaches to simultaneously learn the binary SPC sensing operator and extract non-linear features directly from SPC measurements have been proposed. To demonstrate the capabilities of the approaches, they were successfully applied in a classification task. After the training stage, the trained sensing matrix is employed to acquire the SPC measurements, and the trained classification network is used as an inference operator applied directly to these measurements. In particular, the effectiveness of the proposed approaches has been demonstrated for two well-known data sets: the MNIST and CIFAR-10. For lower sensing ratios, the proposed Binary-Pres-Net provides comparable results to the End-to-End method, however the latter is not able to provide binary sensing matrices. On the contrary, when the sensing ratios increases, the Binary-Pres-Net provides the higher classification accuracy. In terms of computing time, the proposed Binary-NoP-Net method, which is a simpler configuration, outperforms all the other compared methods.

Appendix

Notice that the second term of (8), induces the sensing matrix to have $\{-1,1\}$ values; however, the DMD can only model $\{0,1\}$ values. An intuitive post-process to get negative values is to acquire two sets of complementary measurements and subtract them [3]. However, this means taking twice the number of shots. On the other hand, a more efficient process is to first acquire a set of measurements $g_0 = \mathbf {d}^T\mathbf {f}+ \omega _0$, where $\mathbf {d} \in \{1\}^{MN}$ represents a sensing matrix with all the elements in on, i.e, all the information of the scene passes in this measurement; then, the coded measurements obtained with $\{1,-1\}$ are calculated using these measurements as follows

(11)$$\begin{aligned} \mathbf{g} = 2\tilde{\mathbf{g}}-g_0 &= 2( \tilde{\boldsymbol{\Phi}}\mathbf{f} + \tilde{ \boldsymbol{\omega}}) - \underbrace{(\mathbf{1}_{K} \otimes\mathbf{d}^T)}_{\mathbf{D}}\mathbf{f} + \underbrace{\mathbf{1}_{K} \otimes \omega_o}_{\tilde{\boldsymbol{\omega}_0}}\\ & =(2\tilde{\boldsymbol{\Phi}} - \mathbf{D})\mathbf{f} + 2\tilde{ \boldsymbol{\omega}} + \tilde{\boldsymbol{\omega}_0} \\ &=\boldsymbol{\Phi} \mathbf{f} + \boldsymbol{\omega}, \end{aligned}$$

where $\tilde {\mathbf {g}}$ stands for the measurement matrix obtained with the binary sensing matrix $\tilde {\boldsymbol {\Phi }} \in \{0,1\}^{K \times MN}$. It is worth noting that using this strategy, only one extra shot is necessary [29].

Acknowledgments

Universidad Industrial de Santander under VIE-project 2467 and Colciencias-grant No. 811-2018 titled: "Agricultura de precisión a través de la fusión de imágenes multiespectrales e hiperespectrales adquiridas bajo un sistema de muesttreo compresivo, empleando sensores de bajo costo para ser utilizado en un sistema de detección y clasificación de plagas y enfermedades en cítricos y análisis de los requerimientos mínimos para su aplicación en los procesos agrícolas colombianos”.

Disclosures

The authors declare no conflicts of interest.

References

1. E. J. Candès, “Compressive sampling,” in Proceedings of the international congress of mathematicians, (2006), pp. 1433–1452.

2. M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. E. Kelly, and R. G. Baraniuk, “Single-pixel imaging via compressive sampling,” IEEE Signal Process. Mag. 25(2), 83–91 (2008). [CrossRef]

3. C. F. Higham, R. Murray-Smith, M. J. Padgett, and M. P. Edgar, “Deep learning for real-time single-pixel video,” Sci. Rep. 8(1), 2369 (2018). [CrossRef]

4. E. Candès and J. Romberg, “Sparsity and incoherence in compressive sampling,” Inverse Problems 23(3), 969–985 (2007). [CrossRef]

5. A. Mousavi and R. G. Baraniuk, “Learning to invert: Signal recovery via deep convolutional networks,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2017), pp. 2272–2276.

6. K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, “Reconnet: Non-iterative reconstruction of images from compressively sensed measurements,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), pp. 449–458.

7. K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Transactions on Image Process. 26(9), 4509–4522 (2017). [CrossRef]

8. J. Bacca, C. V. Correa, and H. Arguello, “Noniterative hyperspectral image reconstruction from compressive fused measurements,” IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 12(4), 1231–1239 (2019). [CrossRef]

9. L. Galvis, H. Arguello, and G. R. Arce, “Coded aperture design in mismatched compressive spectral imaging,” Appl. Opt. 54(33), 9875–9882 (2015). [CrossRef]

10. H. Garcia, C. V. Correa, and H. Arguello, “Multi-resolution compressive spectral imaging reconstruction from single pixel measurements,” IEEE Transactions on Image Process. 27(12), 6174–6184 (2018). [CrossRef]

11. L. Galvis, D. Lau, X. Ma, H. Arguello, and G. R. Arce, “Coded aperture design in compressive spectral imaging based on side information,” Appl. Opt. 56(22), 6332–6340 (2017). [CrossRef]

12. M. A. Davenport, P. T. Boufounos, M. B. Wakin, and R. G. Baraniuk, “Signal processing with compressive measurements,” IEEE J. Sel. Top. Signal Process. 4(2), 445–460 (2010). [CrossRef]

13. A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, “Deep learning for computer vision: A brief review,” Comput. intelligence neuroscience 2018, 1–13 (2018). [CrossRef]

14. R. Calderbank and S. Jafarpour, “Finding needles in compressed haystacks,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2012), pp. 3441–3444.

15. C. Hinojosa, J. Bacca, and H. Arguello, “Coded aperture design for compressive spectral subspace clustering,” IEEE J. Sel. Top. Signal Process. 12(6), 1589–1600 (2018). [CrossRef]

16. S. Lohit, K. Kulkarni, and P. Turaga, “Direct inference on compressive measurements using convolutional neural networks,” in 2016 IEEE International Conference on Image Processing (ICIP), (IEEE, 2016), pp. 1913–1917.

17. E. Zisselman, A. Adler, and M. Elad, “Compressed learning for image classification: A deep neural network approach,” Process. Anal. Learn. Images, Shapes, Forms 19, 3–17 (2018). [CrossRef]

18. G. R. Arce, D. J. Brady, L. Carin, H. Arguello, and D. S. Kittle, “Compressive coded aperture spectral imaging: An introduction,” IEEE Signal Process. Mag. 31(1), 105–115 (2014). [CrossRef]

19. H. Arguello, H. Rueda, Y. Wu, D. W. Prather, and G. R. Arce, “Higher-order computational model for coded aperture spectral imaging,” Appl. Opt. 52(10), D12–D21 (2013). [CrossRef]

20. S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the bias/variance dilemma,” Neural Computation 4(1), 1–58 (1992). [CrossRef]

21. L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010, (Springer, 2010), pp. 177–186.

22. I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning, (2013), pp. 1139–1147.

23. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, (2015), pp. 1–15.

24. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530 pp. 1–15 (2016).

25. V. N. Xuan and O. Loffeld, “A deep learning framework for compressed learning and signal reconstruction,” in 5th International Workshop on Compressed Sensing applied to Radar, Multimodal Sensing, and Imaging (CoSeRa), (2018), pp. 1–5.

26. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE 86(11), 2278–2324 (1998). [CrossRef]

27. A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Tech. rep., Citeseer (2009).

28. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, (2012), pp. 1097–1105.

29. J. Bacca, H. Vargas-García, D. Molina-Velasco, and H. Arguello, “Single pixel compressive spectral polarization imaging using a movable micro-polarizer array,” Revista Fac. de Ing. Universidad de Antioquia pp. 91–99 (2018).

Sensing	Shots	Data	Methods
Ratio ( $γ$ )	( $L$ )	set	Random+CNN	End-to-End	Binary-Pres-Net	Binary-NoP-Net
0.25	196	Training	99.99 %	99.99 %	99.99 %	99.99 %
		Testing	98.32 ± 0.06 %	98.48 ± 0.04 %	98.51 ± 0.12%	97.37 ± 0.13 %
0.1	78	Training	99.95%	99.99 %	99.99 %	99.24 %
		Testing	97.01 ± 0.14 %	98.29 ± 0.02 %	98.30 ± 0.32%	95.88 ± 0.09%
0.05	39	Training	99.99 %	98.78%	99.98 %	98.97 %
		Testing	94.82 ± 0.09 %	98.09 ± 0.04 %	97.99 ± 0.24 %	94.95 ± 0.14 %
0.01	8	Training	89.98%	97.19 %	96.09 %	90.63 %
		Testing	58.94 ± 0.20 %	95.18 ± 0.04 %	89.78 ± 0.14 %	87.75 ± 0.35 %

Sensing	Shots	Average training time per method $(S e c)$
Ratio ( $γ$ )	( $L$ )	Random+CNN	End-to-End	Binary-Pres-Net	Binary-NoP-Net
0.25	196	6.67	7.87	7.93	4.33
0.1	78	6.60	7.80	7.60	4.29
0.05	39	6.55	7.68	7.47	4.25
0.01	8	6.53	7.67	7.45	4.20

Sensing	Shots	Data	Methods
Ratio ( $γ$ )	( $L$ )	set	Random+CNN	End-to-End	Binary-Pres-Net	Binary-NoP-Net
0.25	768	Training	99.82%	100 %	90.12 %	98.24 %
		Testing	45.12 ± 0.24 %	57.12 ± 0.40 %	65.17 ± 1.99 %	63.45 ± 2.41 %
0.1	307	Training	92.55%	99.87 %	89.12 %	88.15 %
		Testing	40.84 ± 0.52 %	55.59 ± 0.55 %	59.75 ± 2.20 %	58.14 ± 1.11 %
0.05	154	Training	89.78%	91.89 %	88.79 %	85.47 %
		Testing	34.25 ± 0.54 %	54.34 ± 0.73 %	55.78 ± 1.20 %	50.94 ± 0.98 %
0.01	31	Training	88.54%	90.64 %	79.45 %	82.14 %
		Testing	30.47 ± 0.54 %	50.89 ± 0.52 %	47.98 ± 2.24 %	40.59 ± 3.05 %

Sensing	Shots	Average training time per method $(S e c)$
Ratio ( $γ$ )	( $L$ )	Random+CNN	End-to-End	Binary-Pres-Net	Binary-NoP-Net
0.25	196	158.5	190.2	185.5	82.6
0.1	78	157.5	165.2	168.7	80
0.05	39	156.8	158.3	162.4	78.9
0.01	8	156.3	157.2	158.4	78.2

	Sensing Ratio ( $γ$ )
Nets	0.0064	0.0127	0.0383	0.0638	0.1276
Binary-Pres-Net	70 %	72.50 %	75 %	82.50 %	95 %
Rand+CNN(Pres-Net)	40 %	57.75 %	62.5 %	80 %	85 %
Binary-NoP-Net	50 %	70 %	77.50 %	82.50 %	87.50 %
Rand+CNN(NoP-Net)	42.5%	55 %	67.75 %	77.5 %	82.50 %

Coupled deep learning coded aperture design for compressive image classification

Abstract

1. Introduction

2. Single-pixel camera model

3. Coded aperture design and classification: a compressed learning framework

3.1 Training stage

3.1.1 Structure of the classification operator

3.2 Testing stage

4. Simulations and results

4.1 MNIST data set

4.2 CIFAR-10 data set

5. Experimental setup

6. Conclusions

Appendix

Acknowledgments

Disclosures

References

Supplementary Material (1)

Cited By

Figures (8)

Tables (5)

Equations (11)

Optics Express