## Abstract

In this paper, we develop a deep neural network based joint classification-regression approach to identify microglia, a resident central nervous system macrophage, in the brain using fluorescence lifetime imaging microscopy (FLIM) data. Microglia are responsible for several key aspects of brain development and neurodegenerative diseases. Accurate detection of microglia is key to understanding their role and function in the CNS, and has been studied extensively in recent years. In this paper, we propose a joint classification-regression scheme that can incorporate fluorescence lifetime data from two different autofluorescent metabolic co-enzymes, FAD and NADH, in the same model. This approach not only represents the lifetime data more accurately but also provides the classification engine a more diverse data source. Furthermore, the two components of model can be trained jointly which combines the strengths of the regression and classification methods. We demonstrate the efficacy of our method using datasets generated using mouse brain tissue which show that our joint learning model outperforms results on the coenzymes taken independently, providing an efficient way to classify microglia from other cells.

© 2021 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

Fluorescence lifetime image microscopy (FLIM) has shown great promise in visualization of physiological parameters in fixed and living cells in biology and medicine [1–3]. FLIM measures the average time a molecule (excited by a photon) remains in its excited state before returning to its ground state and emitting a photon simultaneously. This *lifetime* estimation of the molecule is measured based on an exponential decay function and is used to localize specific fluorophores. One advantage of fluorescence lifetime imaging is that, it can probe local cellular microenvironments around the fluorophores [4]. FLIM can also be used to identify local metabolic signatures leading to improved understanding of specific cell types [5]. These metabolic signatures can be intrinsically detected because of FLIM’s ability to probe autofluorescent metabolic coenzyme such as NADH (Nicotinamide Adenine Dinucleotide) and FAD (Flavin Adenine Dinucleotide). By quantitatively identifying lifetime and binding status, FLIM analysis of NADH and FAD can reveal metabolic alteration and indicate whether a cell is in a more glycolytic or oxidative state [6–9]. FLIM can be used as an imaging technique with a wide range of applications such as confocal microscopy, two-photon excitation microscopy, as well as multiphoton tomography [10–12].

In this study, we focused on microglia, the resident macrophages of the Central Nervous System(CNS) which play a role in virtually all neuropathologies [13–16]. They influence brain development, maintain neural environment as well as respond to injury and infection. Due to their involvement in CNS processes and neurodegenerative diseases, understanding the morphology and function of microglia is extremely critical in a number of scenarios. In this respect, accurate imaging and computational tools are needed to identify metabolic signatures specific to microglia and their activation status. FLIM is used as an imaging method for this purpose, given its ability to probe cellular microenvironment in a label-free manner which does not introduce alteration in the cellular signature. Furthermore, utilizing recent advances in Machine Learning, it has been shown that FLIM based monitoring of microglial metabolic alterations can be used to both differentiate microglia from other CNS cell-types using artificial neural network based approach [17] as well as identify their activation status [18]. These studies show promising results in the utility of computational tools for analyzing FLIM to understand the functional role of microglia in the CNS.

In this paper, we develop a neural network based approach to analyze FLIM data obtained from two metabolic coenzymes NADH and FAD jointly for the purposes of studying microglia. Most previous studies of microglia used FLIM obtained from NADH alone [18] or have been focused on network architecture for estimating and visualizing the lifetimes [17,19]. To the best of our knowledge, FLIM data obtained from two metabolic coenzymes NADH and FAD have not been used together in the same machine learning framework. . But, such an idea could potentially have value, since NADH and FAD sometimes offer unique information, in terms of relative change with metabolic alterations. For example, enzyme-bound NADH has higher lifetime than free NADH while this trend is reversed for FAD [20]. Based on the state of metabolism and relative shift, FAD lifetime and their relative bound/free status can follow different trends than NADH. Furthermore, the fluorescence lifetime of protein-bound NADH/FAD is sensitive to changes in the binding status. Therefore studying the effect of these two coenzymes together, can lead to a heterogeneous data source that can provide complementary information on how to characterize the microglial footprint and distinguish it from other CNS cell-types.

**Contributions** We propose an efficient approach to solve the above problem in a computational setting and address the key challenges that arise in this context. We formulate a customized deep-learning framework specifically designed to work with FLIM data, which inputs the bimodal fluorescent lifetime decays corresponding to each coenzyme and learns both regression and classification to separate microglia from other cell types (Fig. 1). The regression component represents the time-resolved fluorescence decay data using the non-parametric regression approach, by expanding it using a set of discrete Laguerre basis which is solved as a layer of the neural network. This is followed by a set of layers that performs the classification on the intensity image based on the similarity of the corresponding fitted exponential decay curves of the pixels. This part of the deep network is used to learn parameters of Conditional Random Fields(CRF) which are widely used for semantic segmentation. The primary advantage of CRF is three-fold: a) the objective/ maximum likely loss function can be explicitly constructed in CRF, therefore interpretable, b) as noted by [21] deep semantic segmentation neural networks are usually not perfect, especially at object borders since these have convolutional filters with large receptive fields and hence produce coarse outputs [22]. In this scenario, CRF is often used in a post processing framework to improve the final classification c) CNN based segmentation models such as U-net [23] lack smoothness constraints that encourage label agreement between similar pixels, and spatial and appearance consistency of the labelling output, a property which is explicitly captured by CRFs. By utilizing the regression and CRF components together, our deep learning architecture provides a joint regression-classification framework for learning to distinguish microglia using bimodal FLIM decay data. To the best of our knowledge, such a joint regression-classification deep learning framework for FLIM has not been demonstrated, though there have been some efforts to estimate lifetime decays using neural networks [24–26] (see next section for details).

## 2. Previous work

Our review of previous work covers three main subareas, which are directly related to the methodology in this paper. These are as follows:

**Computational approaches in FLIM:** The starting point of analysis of time-resolved fluorescent lifetime data involves the determination of the fluorescence intensity decay ($Y$) by identifying of a set of fitting parameters that best describe the characteristics of the fluorescence decay, followed by classification of the fluorescent system based on those parameters. Mathematically, the measured fluorescence intensity decay data ($Y$) is given by the convolution of the instrument response($F$) from excitation light pulse with the fluorescence response of the tissue($I$). Thus, to estimate the fluorescence of a compound, the excitation light pulse must be deconvolved from the measured fluorescence intensity pulse [27]. Several deconvolution approaches are used in this regard, the most popular being nonlinear least-square iterative reconstructions which assume the functional form of $Y$. However, there are methods which generate $Y$ directly without any assumption, such as the Fourier [28] and Laplace transform methods [29] and the exponential series method [30], among others. Under the category of Least squares approaches, the most popular method is the nonlinear least-squares iterative reconvolution (LSIR) approach [30,31]. Laguerre based deconvolution was also used in this regard [27,32], in which the fluorescence signal $Y$ is expressed as an expansion on the discrete time Laguerre basis instead of a weighted sum of exponential functions. Such papers and others [33,34] have shown that Laguerre expansion coefficients are highly correlated with the lifetime value, suggesting such an approach may be very useful in direct characterization of biochemical compounds in terms of their fluorescence emission temporal properties. In this regard, Smith et. al. [24] recently proposed a fit-free approach in fluorescent lifetime image formation using deep learning (DL) and Walsh et. al. [35] who performed auto-fluorescence imaging of NAD(P)H and FAD followed by classification (individually) of T cell activation.

**Joint Regression-Classification methods in Machine Learning** Joint regression classification approaches have been used frequently in biomedical literature, though not in the context of FLIM. Zhu et. al. [36] used this type of joint training for clinical score regression and clinical label classification in context of Alzheimer’s disease (AD). Wang et al. [37] adopted a joint regression-classification approach for identifying imaging biomarkers that are associated with both cognitive measures and AD. A deep multi-task multi-channel learning framework for simultaneous classification and regression for brain disease diagnosis, using MRI data and personal information was proposed by Liu et. al. [38]. In addition to biomedical literature, several papers in the fields of vision and machine learning use such joint learning approaches for a variety of problems such as time series classification [39] and age prediction [40].

**Learning CRFs** Conditional random fields (CRFs) is a machine learning approach adapted from statistical modelling, where the inference problem is formulated on a graphical model that implements dependencies between the different nodes of the graph. As a result, it gives a classifier a context of "neighboring samples" while classifying a given sample. CRFs are widely used in image segmentation. However, here we only focus on those methods which learn the parameters of CRF. Barbu et. al. [41] proposed a joint training of CRF together with an inference algorithm in their Active Random Field approach. Domke et. al. [42] formulated a back-propagation based parameter optimization scheme in graphical models using approximate inference. In the same vein, it has been shown [43,44] how back-propagation approaches through belief propagation can be used to optimize model parameters in CRF. Krahenbuhl et. al. [45] showed how to learn parameters of a dense CRF using a modified mean-field algorithm for inference. Zheng et. al. [22] formulated a dense CRF as a Recurrent Neural Network(RNN) which lead to an end-to-end trainable system for semantic image segmentation.

## 3. Methods

Our deep learning architecture for the joint regression-classification problem on bi-modal lifetime decay data consists of two main stages. In the first step, the lifetime decay curve (one for each pixel) is expressed as a linear combination of predefined Laguerre bases subject to additional smoothness constraints. This is solved by a quadratic programming layer, recently proposed by [46]. The coefficients obtained at this step, provide an indication of similar activation patterns among pixels. This information, in conjunction with spatial similarity can be used to distinguish similar tissue types. Therefore, in the second step, we combine the coefficients from each of the modalities (NADH and FAD) and pass this on as features to the CRF subnetwork for classification. A schematic diagram of our network can be seen in Fig. 2. We describe the details for each step next.

#### 3.1 Representation and modeling

We first describe the mathematical representation of the fluorescent decay curve. As mentioned earlier, the measured fluorescence intensity decay data($Y$) is given by the convolution of the tissue fluorescence response signal($I$) with the excitation light pulse which is a part of the instrument response($F$) plus some additive noise($\epsilon$). Each of these variables are functions of time. This relationship can be written as

where $\otimes$ represents convolution of the two signals. The function $I(t)$ can be best approximated as a multi-exponential decay function. Here we use bi-exponential decay function to represent the signal, written as where $a_1$, $a_2$ and $\tau _1$, $\tau _2$ represent the amplitude and the lifetimes for each exponential component. Such exponential functions can be expanded a number of ways, but it has been shown that using Laguerre polynomials or other orthogonal polynomials can lead to better approximations than a Taylor series expansions. Using this expansion, signal $I(t)$ can be expressed as where $b_j^\alpha (t)$ is the discrete set of Laguerre bases (of scale $\alpha$), $d$ is the number of bases and $c_j$ is the unknown coefficients of combination. Plugging this in Eq. (1), it can be rewritten in matrix form as where $V$ is the $t \times d$ matrix derived from the convolution of $F$ and $I$ and basis matrix $B=\{b_1^\alpha \cdots b_L^\alpha \}$ is also $t \times d$ (see [32] for details). Here we assume $t$ is the number of time points, $n$ is the number of pixels and $d$ is the number of Laguerre bases. The matrix $Y$ is of size $t \times n$ whereas the matrix $C$ is of size $d \times n$. In the above model (4), if the number of bases are large, it can tend to avoid overfit which does not generalize well to unseen data. To avoid this, a constraint is added on the third derivative of $BC$, to ensure a smooth reconstruction [47]. This can be written in a discrete form $DBC \leq 0$, where $D$ is the third order forward finite difference matrix. Taken together, the optimization problem can be written as The objective can be further expanded as $\mathrm {Tr}(C^T V^TVC -2Y^TVC )$. Typically the above problem is solved as a dual, formulating it as a non-linear least squares, where the coefficient matrix $C$ (whose column vectors are represented as $C_i$) are recovered after the model is solved.In our problem, we extend the model (5) to include an additional regularization term, to ensure pixels with similar decay profile is classified similarly. The details of this extension is discussed in Section 3.1.2. We give a brief overview here. We use an energy term, commonly known as Conditional Random Fields (CRF) of the form $E(C)= \sum _i\psi (i) + \sum _{i\neq j}\psi (i,j)$, where the unary energy components $\psi (i)$ measure the inverse likelihood (and therefore, the cost) of the pixel $i$ taking a given label, and pairwise energy components $\psi (i,j)$ measure the cost of assigning different labels to pixels $i$ and $j$. We assume the probability of two pixels belonging to the same label is given apriori (such as in presence of training data). This is denoted by $\mu (i,j)$ which is set to $1$ if a nearby pair of pixels are assigned different labels, $0$ otherwise. In addition, Gaussian energy function is used for representing the pairwise term, written as $K(i,j)=\exp (-\frac {(C_i-C_j)^2}{2\sigma ^2})$. This CRF energy term, when solved in conjunction with the objective in (5), can be written as follows:

### 3.1.1 Regression as a neural network layer

The above model (6), is computationally challenging to solve: particularly for the CRF energy term, since it leads to a dense CRF formulation (all pairs of pixels considered) [22,45]. Therefore, we split out model into two parts - in the first step, we solve the model (5) as a layer of the neural network. This layer is repeated twice (can be solved in parallel) - once to learn $C_{NADH}$, pertaining to the NADH lifetimes and once to learn $C_{FAD}$ corresponding to the FAD lifetime data. One can think of the $C_{NADH}$ and $C_{FAD}$ as two sets of features describing same underlying phenomenon. In the second step (the classification part), these are concatenated as $\hat {C}$ (we also add the pixel RGB values and position as additional features) and is used as input to the next set of layers. In this layer, the network learns the parameters for a Dense CRF, using an efficient approach proposed by [48], which uses Permutohedral Filtering for significant computational speedup. The whole neural network is trained jointly using gradient descent.

Here we describe the approach used to solve the regression objective. We use the OptNet(QPTH) method proposed by Amos et. al. [46] who show how neural network layers can be designed to solve quadratic objective and encode constraints. Briefly, given a quadratic program of the form

this method first constructs the Lagrangian dual. This can be written as where $\nu$ and $\lambda$ are dual variables introduced. This is used to derive the Karush–Kuhn–Tucker (KKT) conditions [49] by derivating Eq (9), wrt to the variables $z$, $\nu$ and $\alpha$ and setting the resultant derivatives to $0$. This leads to the following system.The advantage of this approach is that it can be used in batch mode and outperforms standard Quadratic programming solvers like CPLEX [51] and Gurobi [52]. Since this layer optimizes for $z$ in the form of a vector, we perform some basic transformations to our inputs to be compatible with the inputs necessary for our model. Particularly, we perform Kronecker product [53] of $V^TV$ with a identity matrix of size $m$, $I_m$, that is $Q= I_m \otimes V^TV$ and vectorize $Y^TV$ to generate $q$. The constraints are handled similarly.

### 3.1.2 Classification of microglia as learning CRF

In this section, we describe the details of our CRF model. We define a dense CRF (all pairwise potentials considered) on a set of $n$ random variables corresponding to each of the pixels $X = \{x_1,\ldots , x_n\}$, each of which is represented by feature vectors $\hat {C}_1, \ldots \hat {C}_n$ respectively. Each random variable $x_i$ of takes one label from a set of binary labels $L = \{0, 1\}$ corresponding to background and microglia respectively. As described earlier, the CRF energy function can be described as a sum of the unary and pairwise terms, written as $E= \sum _i\psi (i) + \sum _{i\neq j}\psi (i,j)$, where the unary term describes the potential for the random variable $x_i$ taking the label $0$ or $1$ and the pairwise terms describes the potentials for random variables $x_i$ and $x_j$ taking the different labels. Specifically the pairwise term can be further expanded by using Gaussian function to express the similarity of $x_i$ and $x_j$ using the corresponding feature vectors $\hat {C}_i$ and $\hat {C}_j$. This can be written as

We can rewrite the CRF problem as an optimization problem, where the goal is to find a variable $X_i \in {0, 1}$ for pixel $i$. Let $\Psi$ be the matrix in $n \times 1$, that represents the unary costs. Then, the unary potential can be simply represented as $\Psi$. The pairwise terms can be written as $X(\mu \odot K)(1-X)^T + (1-X)(\mu \odot K)X^T$. This objective counts the pairwise terms when $X_i=0$ and $X_j=1$ (and vice versa). Taken together, the CRF problem can be written as the following optimization problem:

Here the $\odot$ operation stands for element-wise multiplication. Since the number of classes in our application in two, we do not require any constraints on $X$. This model when taken together with the Model in (7), gives us the non-convex problem shown in (6).There are many ways to solve the above model, But since the QPTH framework is implemented as neural network layer, we opt for a similar approach for the CRF since we can then connect the two components and optimize the neural network as a whole. Here, we briefly outline the Recurrent neural network(RNN) framework to solve the CRF problem. This approach was initially proposed by Krahenbuhl et. al. [45] and later implemented as an RNN by Zheng et. al. [22]. First, the initialization is done by applying a soft-max function over the unary potentials at each pixel. This is followed by a message passing layer, applying $m$ Gaussian filters on previously initialized values. Since the CRF is potentially fully connected, each filter’s receptive field spans the whole image, making it computationally challenging to calculate the filters. However, in the RNN extension of CRF, this is done using a Permutohedral lattice implementation [48], which can compute the filter response in $O(n)$ time, where $n$ is the number of pixels of the image. Next, the weighted sum of the $m$ filter outputs from the previous step is computed as a convolutional layer. In the next layer, the label compatibility function $\mu$ is incorporated as another convolution layer where the spatial receptive field of the filter is $1 \times 1$, and the number of input and output channels are both set to $2$. After this, the output of the compatibility layer is subtracted from the unary potentials. Finally we apply another softmax for normalization. At this point, the output is connected back to the input and the model is trained over a few iterations. Finally, a rectified linear unit (RELU), where $RELU(x)=\max (0,x)$ (for input $x$), is used to threshold the output into two classes and obtain the output $X$.

#### 3.2 Optimization and training

We propose an end-to-end deep learning method to combine the QPTH and CRF methods for a joint regression-classification model for identifying microglia from FLIM data. The CRF is incorporated as a last set of layers which takes as its input the $\hat {C}$ from the QPTH layer. We then train the whole integrated model together by sharing the representative weights. The shared weights are further re-trained to fine-tune the performance of the merged model. The model is trained via the Adam optimizer which calculates gradients to update shared weight parameters. The loss function that is minimized for the merged model, is pixel-wise binary cross entropy loss between the output from CRF to a ground truth segmentation for each image $Z$. This can be written as

## 4. Experiments

We performed our experiments using data collected from mouse brain tissue samples ($n=103$). For each sample, we used the TCSPC (Time correlated single photon counting) software to output various parameters, including the lifetime parameters and intensity images. For each sample, we obtain both the NADH and FAD lifetime data. In addition, we use a separate antibody (Iba1) which only binds to microglia in the CNS to generate the ground truth images. Note that this is independent from the lifetime data. We briefly elaborate on the acquisition details of the dataset in next section.

#### 4.1 Tissue preparation and imaging

### 4.1.1 Animals

All animals were maintained in an ALAAC-accredited animal facility with a 12-hr light/dark cycle regime and had access to food and water ad libitum. All experiments were performed in accordance with the University of Wisconsin-Madison Institutional Animal Care and Use Committee. For FLIM imaging, 100 $\mu$m thick coronal slices were prepared from the fixed brains of young adult C57BL/6J and CX3CR1-GFP mice (Jackson Labs), aged 6-8 weeks. Animals were euthanized by isoflurane overdose and transcardially perfused with 30 ml of ice-cold PBS, followed by a second perfusion with an ice-cold solution of 4% PFA in PBS. Brains were then dissected, post fixed for 24hrs in a solution of 4% PFA in PBS, and then moved to HBSS (all performed at 4C and protected from light).

### 4.1.2 Immunohistochemistry

100 $\mu$m thick coronal sections were prepared from the midbrain region of each brain using a Leica Vibratome. Two slices from each animal were used for immunohistochemical staining. Briefly, slices were washed at room temperature with 0.3% TritonX-100 in PBS, before incubating in blocking buffer (1 % BSA, 0.3 % TritonX-100/PBS) for 2 hours at room temperature. Slices were then incubated with anti-Iba1 antibodies (1:1000; Wako Catalog No. 019-19741) in blocking buffer in the dark at 4C overnight. Slices were washed again at room temperature with 0.3% TritonX-100 in PBS followed by incubation in the dark for 2 hours with AlexaFlour594 anti-rabbit IgG antibodies (1:200) in blocking buffer, at room temperature. Slices were washed with 0.3% TritonX-100 in PBS and mounted on 1mm slides using Cytoseal60 mounting medium. Mounted sections were stored at room temperature, protected from light until they could be imaged.

### 4.1.3 Multiphoton lifetime imaging

The lifetime and intensity imaging [54] was performed on a custom multiphoton laser scanning system built around an inverted Nikon Eclipse TE2000U at the Laboratory for Optical and Computational Instrumentation [55]. A 20x objective (Nikon Plan Apo VC, 0.75 NA) (Melville, NY, USA) was used for all imaging. For NAD(P)H imaging, data was collected using an excitation wavelength of 740 nm, and the emission was filtered at 457/50 nm (Semrock, Rochester, NY) for the spectral peak for NADH. For identification of microglia, AlexaFluor594 was used as secondary with iba1 as primary. For the fluorescence intensity imaging, excitation was set at 810 nm, and a 615/20 (Semrock, Rochester, NY) bandpass emission filter was used for emission. We used time domain FLIM imaging where the FLIM decays curves were built with TCSPC (Time Correlated Single Photon Counting) electronics. FLIM images of 256 x 256 pixels were collected with 120 second collection using SPC-150 Photon Counting Electronics (Becker and Hickl GmbH, Berlin, Germany) and Hamamatsu H7422P-40 GaAsP photomultiplier tube (Hamamatsu Photonics, Bridgewater, NJ). Urea crystals were used to determine the Instrumentation Response Function (IRF) with a 370/10 bandpass emission filter (Semrock, Rochester. NY). For each sample, around 20 neighboring FOVs were randomly selected, and the average value of lifetime and free NADH ratio was calculated based on masking. The instrument response function of the optical system was measured during each imaging session. A separate antibody that binds to microglia only was used to generate the images used as ground truth. This is the iba1 antibody which is widely used for microglia imaging. Using this, we are able to see the microglia accurately, which is why it is ideal for generation of ground truth images,

#### 4.2 Evaluations

We lay out our evaluations in the following order: 1) First we use a variety of classification metrics to quantitatively and qualitatively evaluate how well our model performs for the task of distinguishing microglia from other cell types. 2) Secondly, we study the contribution of each component (regression and classification) of bimodal scheme for the given task. To do this, we evaluate each component separately. First we quantify the quality of the fit using the regression approach alone. Next, we provide the output for regression to a generic clustering approach to see how it performs compared to our scheme. 3) Finally, we test the utility of the bimodal approach by comparing the results to only one modality at a time.

### 4.2.1 Evaluation of efficacy of bimodal model via segmentation metrics

Here, we train out model with $40\%$ of the data and the remaining data is used for testing. We use $10$ different classification/segmentation metrics to evaluate the quality of the classification with respect to ground truth. These include several indices computed based on the overlap of ground truth and obtained segmentation such as Classification Accuracy (Acc), Precision, Recall, Fvalue, Dice, Jaccard coefficients and Area Under the ROC Curve(AUC) as well as information theoretic ones such as Mutual Information (MI), Normalized Variation of Information (NVI) as well as Rand Index(RI) and Adjusted Rand Index (ARI). The details of these metrics and their properties are explained next and can also be found at [56].

**Metrics**: For a given datatset, let $Y$ denotes the output of a binary classifier in $\mathbf {R}^n$ and $\hat {Y}$ also in $\mathbf {R}^n$, is the ground truth. We use the notation $Y_i$ to indicate the $i$th element of the vector $Y$ (same for $\hat {Y}$). This can be used to calculate True Positives (TP=$Y^T\times \hat {Y}$, which counts the number of data points for which both $Y$ and $\hat {Y}$ is a $1$ or in the positive class), False Positives (FP=$Y^T\times (1-\hat {Y})$, which counts the number of points where $Y$ is a $1$, but ground truth is $0$ or in the negative class), True Negatives (TN=$(1-Y)^T\times (1-\hat {Y})$, counts the number of points where both are $0$) and False Negatives (FN=$(1-Y)^T\times \hat {Y}$ which counts the number of points where $Y$ is a 0, but ground truth is $1$). Here the operator $\times$ is the vector dot-product. Based on these, we can define the following metrics:

Accuracy computes the percentage of points correctly classified and is defined as

Precision $P$ computes the proportion /percentage of identified positive class points which is actually correct and is defined as Recall $R$ calculates what proportion of positive class points is identified correctly and is given as Combining Precision and Recall, Fvalue (also called F-measure) gives an intuition of how precise a classifier is (how many points it classifies correctly), as well as how robust it is and is defined as Jaccard Coefficient, also known as Intersection over Union, is a statistic used for gauging the similarity and diversity of two clusterings/sets and is defined asTo define the information theoretic measures, we define two probability distributions $P_Y$ and $P_{\hat {Y}}$ for discrete variable $Y$ and $\hat {Y}$ respectively. Let $P_{Y\hat {Y}}$ denote their joint distribution. Then Mutual Information can be defined as

- $a$ - number of pair of data points are placed in the same group in $Y$ and in the same group in $\hat {Y}$ ;
- $b$ - number of pairs of data items placed in the same group in $Y$ and in different groups in $\hat {Y}$ ;
- $c$ -number of pairs of data items placed in same group in $\hat {Y}$ and in different groups in $Y$ and;
- $d$ - number of pairs of data items placed in different groups in $Y$ and in different groups in $\hat {Y}$

As shown in Table 1, our method performs very well overall, with only a small decrease in values in the test set compared to training. Qualitative results are shown in Fig. 3, which shows the obtained segmentation is close to the ground truth in most cases.

### 4.2.2 Evaluation of the regression approach

To study the effect of regression, we first investigate how well the Laguerre basis approximation can estimate the input fluorescent decay curve. To do this, we use the bi-exponential decay function as an approximation of $I(t)$ (which is then convolved with the instrument response) to construct the corresponding signal $Y(t)$ in Eq. (1). Observe that Laguerre basis expansion is exact, only when $I(t)$ is a biexponential function. Figure 4(a) shows the input signal (in red) along with the signal reconstructed by the Optnet(QPTH) layer, when the number of bases are set to $10$. We see that overall the reconstructed signal nicely approximates $Y(t)$. An even better fit could be attained by increasing the number of bases used in the Laguerre expansion. But this comes at cost of increased computation time. To test this, we calculated average root mean square errors of reconstruction over $20$ samples, in each of the setting $d=\{5, 10, 20\}$ where $d$ is the number of bases used. We see in Fig. 4(b), that by increasing the bases from $5$ to $10$, we see a sizable improvement in reconstruction error, but from $10$ to $20$, the decrease in error is not as significant. This makes intuitive sense as in most reconstructions, the higher order terms play a more crucial role in reducing errors, which tapers off as more terms are included. Therefore, in all our experiments, we set $d$ to $10$.

### 4.2.3 Evaluation of CRF classification

To study the utility of the CRF segmentation module, we replace the CRF layer, with a implementation of Kmeans as a Neural Network. Kmeans is one of the most popular method for clustering related tasks. We use the KmeansNet implementation proposed by Marshland [58]. All other parameters such as training and test sets and number of epochs remain the same. Results can be seen in Table 2 and Fig. 5, demonstrating that our Bimodal model with CRF as the last set of layers, works much better than the model using KmeansNet . Generally, when KMeansNet is included, the data is over segmented which adversely affects the quality metrics. Notably, however, the KMeansNet is faster than the model using CRF, and may be considered as an alternative if quicker computation times are desirable.

### 4.2.4 Evaluation of bimodal data for purposes of classification

To evaluate if the bimodal model has any advantages over using a single modality such as either FAD or NADH, we run our model in the unimodal setting, where there is only one layer(instead of two layers) for the optnet, which this is then forwarded to the CRF layer. We repeat this experiment for both NADH and FAD separately and report on the same quality metrics as before on test data. Note that the training and test data is same for each of these experiments. Table 3 shows the quantitative results of this experiment.

As we can see, the bimodal model shows improvement over each of the coenzymes used separately, especially NADH, where the improvement is substantial over all the metrics. Qualitative results of segmentation shown in Fig. 6 shows that the bimodal approach leads to more accurate identification of microglia compared either NADH or FAD used alone.

### 4.2.5 Model parameters and time complexity

We set the number of iterations in each component to be $10$, but usually the output converges in less, particularly for the RNN part. The number of epochs is set to $5$. We found that it took $30-40$ seconds to analyze each sample in each layer (this value depends on the type of GPU used, in our case, we used a NVIDIA GTX 1080 GPU). With these settings, it took on an average about $3$ hours for the training the model. We found that increasing the size of the training improved evaluation metrics slightly, but as expected, the demands on computational cost and memory increased as well. As future work, we will study how to distribute the processing across a cluster of GPUs (the Optnet implementation as provided by the authors does not allow this at present).

## 5. Conclusion

We propose a bimodal regression-classification neural network for classifying Microglia using Fluorescent lifetime images. Our model inputs the lifetime data from two different coenzymes - NADH and FAD, and then learns the regression coefficients for accurate representation which are then used as features for the classification module using CRF. The bimodal data provides a diverse data-source for the classification model, which is evident in this approach outperforming the models with uni-modal data. In our future work, we plan to extend our model to include other forms of information about microglia such as cell morphology and other shape features.

## Funding

National Institutes of Health (R01CA199996, R01NS085226).

## Acknowledgments

This research is supported by UW Laboratory for Optical and Computational Instrumentation, the Morgridge Institute for Research, UW Carbone Cancer Center, NIH R01NS085226, NIH R01CA199996. The authors would also like to thank Jenu Chacko for his advice and input.

## Disclosures

The authors declare no conflicts of interest.

## References

**1. **P. I. Bastiaens and A. Squire, “Fluorescence lifetime imaging microscopy: spatial resolution of biochemical processes in the cell,” Trends Cell Biol. **9**(2), 48–52 (1999). [CrossRef]

**2. **P. J. Verveer, A. Squire, and P. I. Bastiaens, “Global analysis of fluorescence lifetime imaging microscopy data,” Biophys. J. **78**(4), 2127–2137 (2000). [CrossRef]

**3. **E. B. van Munster and T. W. Gadella, “Fluorescence lifetime imaging microscopy (flim),” in Microscopy Techniques (Springer, 2005), pp. 143–175.

**4. **K. Suhling, L. M. Hirvonen, J. A. Levitt, P.-H. Chung, C. Tregidgo, A. Le Marois, D. A. Rusakov, K. Zheng, S. Ameer-Beg, S. Poland, S. Coelho, R. Henderson, and N. Krstajic, “Fluorescence lifetime imaging (flim): Basic concepts and some recent developments,” Med. Photonics **27**, 3–40 (2015). [CrossRef]

**5. **J. V. Chacko and K. W. Eliceiri, “Autofluorescence lifetime imaging of cellular metabolism: sensitivity toward cell density, ph, intracellular, and intercellular heterogeneity,” Cytom. Part A **95**(1), 56–69 (2019). [CrossRef]

**6. **M. C. Skala, K. M. Riching, A. Gendron-Fitzpatrick, J. Eickhoff, K. W. Eliceiri, J. G. White, and N. Ramanujam, “In vivo multiphoton microscopy of nadh and fad redox states, fluorescence lifetimes, and cellular morphology in precancerous epithelia,” Proc. Natl. Acad. Sci. **104**(49), 19494–19499 (2007). [CrossRef]

**7. **R. Cao, H. Wallrabe, K. Siller, and A. Periasamy, “Optimization of flim imaging, fitting and analysis for auto-fluorescent nad (p) h and fad in cells and tissues,” Methods Appl. Fluoresc. **8**(2), 024001 (2020). [CrossRef]

**8. **M. A. Yaseen, S. Sakadžić, W. Wu, W. Becker, K. A. Kasischke, and D. A. Boas, “In vivo imaging of cerebral energy metabolism with two-photon fluorescence lifetime microscopy of nadh,” Biomed. Opt. Express **4**(2), 307–321 (2013). [CrossRef]

**9. **D. K. Bird, L. Yan, K. M. Vrotsos, K. W. Eliceiri, E. M. Vaughan, P. J. Keely, J. G. White, and N. Ramanujam, “Metabolic mapping of mcf10a human breast cells via multiphoton fluorescence lifetime imaging of the coenzyme nadh,” Cancer Res. **65**(19), 8766–8773 (2005). [CrossRef]

**10. **S. Trautmann, V. Buschmann, S. Orthaus, F. Koberling, U. Ortmann, and R. Erdmann, “Fluorescence lifetime imaging (flim) in confocal microscopy applications: an overview,” PicoQuant GmbH **29**, 12489 (2013).

**11. **Y. Chen and A. Periasamy, “Characterization of two-photon excitation fluorescence lifetime imaging microscopy for protein localization,” Microsc. Res. Tech. **63**(1), 72–80 (2004). [CrossRef]

**12. **M. J. Koehler, A. Preller, P. Elsner, K. König, U. C. Hipler, and M. Kaatz, “Non-invasive evaluation of dermal elastosis by in vivo multiphoton tomography with autofluorescence lifetime measurements,” Exp. Dermatol. **21**(1), 48–51 (2012). [CrossRef]

**13. **J. J. Watters, J. M. Schartner, and B. Badie, “Microglia function in brain tumors,” J. Neurosci. Res. **81**(3), 447–455 (2005). [CrossRef]

**14. **G. A. Garden and T. Möller, “Microglia biology in health and disease,” J. Neuroimmune Pharmacol. **1**(2), 127–137 (2006). [CrossRef]

**15. **B. R. Tambuyzer, P. Ponsaerts, and E. J. Nouwen, “Microglia: gatekeepers of central nervous system immunology,” J. Leukocyte Biol. **85**(3), 352–370 (2009). [CrossRef]

**16. **N. A. Charles, E. C. Holland, R. Gilbertson, R. Glass, and H. Kettenmann, “The brain tumor microenvironment,” Glia **59**(8), 1169–1180 (2011). [CrossRef]

**17. **M. A. K. Sagar, K. P. Cheng, J. N. Ouellette, J. C. Williams, J. J. Watters, and K. W. Eliceiri, “Machine learning methods for fluorescence lifetime imaging (flim) based label-free detection of microglia,” Front. Neurosci. **14**, 931 (2020). [CrossRef]

**18. **M. A. K. Sagar, J. N. Ouellette, K. P. Cheng, J. C. Williams, J. J. Watters, and K. W. Eliceiri, “Microglia activation visualization via fluorescence lifetime imaging microscopy of intrinsically fluorescent metabolic cofactors,” Neurophotonics **7**(03), 1–14 (2020). [CrossRef]

**19. **D. Gao, P. R. Barber, J. V. Chacko, M. A. Kader Sagar, C. T. Rueden, A. R. Grislis, M. C. Hiner, and K. W. Eliceiri, “Flimj: An open-source imagej toolkit for fluorescence lifetime image data analysis,” PLoS One **15**(12), e0238327 (2020). [CrossRef]

**20. **N. Nakashima, K. Yoshihara, F. Tanaka, and K. Yagi, “Picosecond fluorescence lifetime of the coenzyme of d-amino acid oxidase,” J. Biol. Chem. **255**(11), 5261–5263 (1980). [CrossRef]

**21. **X. Pan, J. Zhao, and J. Xu, “An end-to-end and localized post-processing method for correcting high-resolution remote sensing classification result images,” Remote Sens. **12**(5), 852 (2020). [CrossRef]

**22. **S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” in Proceedings of the IEEE international conference on computer vision, (2015), pp. 1529–1537.

**23. **O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, (Springer, 2015), pp. 234–241.

**24. **J. T. Smith, R. Yao, N. Sinsuebphon, A. Rudkouskaya, N. Un, J. Mazurkiewicz, M. Barroso, P. Yan, and X. Intes, “Fast fit-free analysis of fluorescence lifetime imaging via deep learning,” Proc. Natl. Acad. Sci. **116**(48), 24019–24030 (2019). [CrossRef]

**25. **G. Wu, T. Nowotny, Y. Zhang, H.-Q. Yu, and D. D.-U. Li, “Artificial neural network approaches for fluorescence lifetime imaging techniques,” Opt. Lett. **41**(11), 2561–2564 (2016). [CrossRef]

**26. **M. V. Qiang Wang and J. Hopgood, “Fluorescence lifetime endomicroscopic image-based ex-vivo human lung cancer differentiation using machine learning,” TechRxiv (2020).

**27. **J. A. Jo, Q. Fang, T. Papaioannou, and L. Marcu, “Fast model-free deconvolution of fluorescence decay for analysis of biological systems,” J. Biomed. Opt. **9**(4), 743–753 (2004). [CrossRef]

**28. **D. O’Connor, W. Ware, and J. Andre, “Deconvolution of fluorescence decay curves. a critical comparison of techniques,” J. Phys. Chem. **83**(10), 1333–1343 (1979). [CrossRef]

**29. **A. Gafni, R. L. Modlin, and L. Brand, “Analysis of fluorescence decay curves by means of the laplace transformation,” Biophys. J. **15**(3), 263–280 (1975). [CrossRef]

**30. **W. R. Ware, L. J. Doemeny, and T. L. Nemzek, “Deconvolution of fluorescence and phosphorescence decay curves. least-squares method,” J. Phys. Chem. **77**(17), 2038–2048 (1973). [CrossRef]

**31. **M. L. Johnson and S. G. Frasier, “[16] nonlinear least-squares analysis,” in Methods in Enzymology, vol. 117 (Elsevier, 1985), pp. 301–342.

**32. **Y. Zhang, Y. Chen, and D. D.-U. Li, “Optimizing laguerre expansion based deconvolution methods for analysing bi-exponential fluorescence lifetime images,” Opt. Express **24**(13), 13894–13905 (2016). [CrossRef]

**33. **J. A. Jo, Q. Fang, T. Papaioannou, J. D. Baker, A. Dorafshar, T. Reil, J. Qiao, M. C. Fishbein, J. A. Freischlag, and L. Marcu, “Laguerre-based method for analysis of time-resolved fluorescence data: application to in-vivo characterization and diagnosis of atherosclerotic lesions,” J. Biomed. Opt. **11**(2), 021004 (2006). [CrossRef]

**34. **A. S. Dabir, C. Trivedi, Y. Ryu, P. Pande, and J. A. Jo, “Fully automated deconvolution method for on-line analysis of time-resolved fluorescence spectroscopy data based on an iterative laguerre expansion technique,” J. Biomed. Opt. **14**(2), 024030 (2009). [CrossRef]

**35. **A. J. Walsh, K. Mueller, I. Jones, C. M. Walsh, N. Piscopo, N. N. Niemi, D. J. Pagliarini, K. Saha, and M. C. Skala, “Label-free method for classification of t cell activation,” bioRxiv (2019).

**36. **X. Zhu, H.-I. Suk, and D. Shen, “A novel matrix-similarity based loss function for joint regression and classification in ad diagnosis,” NeuroImage **100**, 91–105 (2014). [CrossRef]

**37. **H. Wang, F. Nie, H. Huang, S. Risacher, A. J. Saykin, and L. Shen, ADNI, “Identifying ad-sensitive and cognition-relevant imaging biomarkers via joint classification and regression,” in International Conference on Medical Image Computing and Computer-Assisted Intervention (Springer, 2011), pp. 115–123.

**38. **M. Liu, J. Zhang, E. Adeli, and D. Shen, “Deep multi-task multi-channel learning for joint classification and regression of brain status,” in International Conference on Medical Image Computing and Computer-Assisted Intervention (Springer, 2017), pp. 3–11.

**39. **Z. Abraham and P.-N. Tan, “An integrated framework for simultaneous classification and regression of time-series data,” in Proceedings of the 2010 SIAM International Conference on Data Mining, (SIAM, 2010), pp. 653–664.

**40. **J. Chen, L. Cheng, X. Yang, J. Liang, B. Quan, and S. Li, “Joint learning with both classification and regression models for age prediction,” in Journal of Physics: Conference Series, vol. 1168 (IOP Publishing, 2019), p. 032016.

**41. **A. Barbu, “Training an active random field for real-time image denoising,” IEEE Trans. on Image Process. **18**(11), 2451–2462 (2009). [CrossRef]

**42. **J. Domke, “Learning graphical model parameters with approximate marginal inference,” IEEE Trans. Pattern Anal. Mach. Intell. **35**(10), 2454–2467 (2013). [CrossRef]

**43. **S. Ross, D. Munoz, M. Hebert, and J. A. Bagnell, “Learning message-passing inference machines for structured prediction,” in CVPR 2011, (IEEE, 2011), pp. 2737–2744.

**44. **V. Stoyanov, A. Ropson, and J. Eisner, “Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, (2011), pp. 725–733.

**45. **P. Krähenbühl and V. Koltun, “Efficient inference in fully connected CRFS with gaussian edge potentials,” in Advances in Neural Information Processing Systems (2011), pp. 109–117.

**46. **B. Amos and J. Z. Kolter, “Optnet: differentiable optimization as a layer in neural networks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, (JMLR. org, 2017), pp. 136–145.

**47. **J. Liu, Y. Sun, J. Qi, and L. Marcu, “A novel method for fast and robust estimation of fluorescence decay dynamics using constrained least-squares deconvolution with laguerre expansion,” Phys. Med. Biol. **57**(4), 843–865 (2012). [CrossRef]

**48. **A. Adams, J. Baek, and M. A. Davis, “Fast high-dimensional filtering using the permutohedral lattice,” in Computer Graphics Forum, vol. 29 (Wiley Online Library, 2010), pp. 753–762.

**49. **G. Gordon and R. Tibshirani, “Karush-Kuhn-Tucker conditions,” Optimization **10**, 725 (2012).

**50. **A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, eds. (Curran Associates, Inc., 2019), pp. 8024–8035.

**51. **I. I. Cplex, “V12. 1: User’s manual for cplex,” Int. Bus. Mach. Corp. **46**, 157 (2009).

**52. **L. Gurobi Optimization, “Gurobi optimizer reference manual,” (2021).

**53. **A. Graham, * Kronecker Products and Matrix Calculus with Applications* (Courier Dover Publications, 2018).

**54. **W. Denk, J. Strickler, and W. Webb, “Two-photon laser scanning fluorescence microscopy,” Science **248**(4951), 73–76 (1990). [CrossRef]

**55. **L. Yan, C. T. Rueden, J. G. White, and K. W. Eliceiri, “Applications of combined spectral lifetime microscopy for biology,” BioTechniques **41**(3), 249–257 (2006). [CrossRef]

**56. **A. A. Taha and A. Hanbury, “Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool,” BMC Med. Imaging **15**(1), 29 (2015). [CrossRef]

**57. **J. M. Santos and M. Embrechts, “On the use of the adjusted rand index as a metric for evaluating supervised classification,” in International conference on artificial neural networks, (Springer, 2009), pp. 175–184.

**58. **S. Marsland, * Machine Learning: An Algorithmic Perspective* (CRC Press, 2015).