Deep learning-based optical coherence tomography image analysis of human brain cancer

Nathan Wang; Cheng-Yu Lee; Hyeon-Cheol Park; David W. Nauen; Kaisorn L. Chaichana; Alfredo Quinones-Hinojosa; Chetan Bettegowda; Xingde Li

doi:10.1364/BOE.477311

1. Introduction

Optimal brain tumor resection, defined as maximal removal of cancerous tissues while sparing surrounding noncancerous tissues, is the key to delaying tumor recurrence and minimizing surgery-inflicted neurological damage, thus increasing patient quality of life and survival time [1,2]. This is a challenging problem, especially in the eloquent cortical and subcortical regions where high resection precision is required [3]. Hence, it is critical for a modern intraoperative brain cancer imaging system to provide an accurate segmentation map with high spatial resolution and in real-time. In this regard, optical coherence tomography (OCT) is advantageous for its rapid, non-contact, label-free, and micron-resolution imaging capability. In contrast, other modalities such as CT and MRI, which have made significant impacts on diagnosis and pre-surgery planning, are less suitable for intra-operative, continuous, real-time guidance due to the drawbacks or tradeoffs in speed, resolution, safety, and cost.

Previous studies identified tissue optical attenuation as a single parameter that can be used to distinguish between cancerous and non-cancerous brain tissue. In general, proliferation of cancer in white matter leads to myelin degradation, resulting in a decreased optical attenuation [4,5]. Optical attenuation is also useful for detecting disease in other tissues, such as the colon, and can be coupled with nanoparticle contrast agents [6,7]. For a 3D OCT data set consisting of multiple B-frames, an enface attenuation map can be generated with each pixel color-coded for cancerous and non-cancerous tissues depending on whether its value is above or below a threshold. However, the performance of thresholding is strictly limited by the overlap in the distribution of attenuation values for cancer and non-cancer tissues [8,9]. To increase robustness while achieving high performance, we propose a novel cancer classifier that is capable of integrating local and global features in an OCT image via both network-based multi-resolution analysis and classic texture analysis. Previous studies have employed machine learning methods on hand-crafted textures or deep neural networks for prediction on the B-frame, but our proposed method synergizes both [10–12]. For this purpose, deep convolutional neural networks (CNNs) are selected to perform end-to-end classification on OCT images and their textures. These two information domains are then linked by concatenating their feature embeddings via ensemble learning.

In the next section, the full data processing pipeline is described, including the dataset characteristics, sparse-sampling and texture extraction methods, and ensemble learning architecture. In the third section, the results obtained using our ensemble model is reported in comparison to its two subnetworks standalone and the network behavior is explained with gradient class activation maps (grad-CAM). In the last section, issues are discussed, and promising directions of related research are identified.

2. Methods and materials

The dataset in this study comprises 23 cancerous and 21 non-cancerous brain cortex samples from 10 patients averaging 62.6 years of age (min: 33 years, max: 78 years). Fresh human brain tissue samples were obtained at the Johns Hopkins Hospital under an approved Institutional Review Board protocol. The samples were contained in a cold salient solution and ex vivo images were taken within one hour of surgical removal of the tissues. All the samples used in the study were evaluated by a pathologist to be mostly homogeneous and were imaged with a home-built SS-OCT over a 2 mm x 2 mm area along a depth of ~4.3 mm in tissue. Each 3D OCT dataset consists of 256 B-frames where one B-frame is a collection of 2,048 A-lines measuring 2,048 pixels per line. The dataset was allocated according to Fig. 1(A): seven patients were used for training the deep networks, four patients were used for testing, and a segmentation map was generated on the last patient’s sample containing mixed cancer and non-cancer regions. Note that the testing set was comprised of samples taken from different brain locations in two patients present in the training dataset as well as samples from two completely new patients. Hence, the high accuracy of the model on the testing data indicates high intra-patient and inter-patient performance.

Fig. 1. Dataset description and visualization. (A) The dataset of 10 patients were divided for training (including validation), testing, and segmentation purposes. The blue numbers on the boxes indicate the number of patients used in each set. In order to assess intra and inter-patient accuracy, the testing dataset is comprised of two patients from the training set and two entirely new patients. Note that these samples from the training set patients were taken from different brain regions. ∼175 200 × 100 B-frame slices were sparsely selected from each sample. (B) Differences between cancer and non-cancer classes are visible in representative training examples. All B-frame slices have a corresponding 100 × 100 grey level co-occurrence matrix (GLCM) texture. Separate networks are trained on slices and textures and then connected via ensemble learning.

Download Full Size | PDF

2.1 Dataset sparse sampling

In this study, logarithmic (base 10) B-frame data is used. The transverse resolution of our OCT system is 25 µm, meaning that B-frames in this range experience no significant changes. Hence, we adopt sparse sampling methods to ensure that the examples included in the dataset are sufficiently representative of the volumetric dataset. For each OCT scan of a cortex sample, a step size of 80 µm is used to select ∼25 representative B-frames. A small proportion of B-frames with significant fragmenting at the tissue surface is manually removed. Of the remaining frames, a 5 × 5 median filter is applied to suppress noise followed by conventional OCT image processing. Each B-frame is evenly divided into seven slices of 100 A-lines separated by 290 µm. Finally, within each slice, the A-lines are truncated to cover the single scattering region, determined to be approximately 200 pixels (418 µm in total) below the sample surface. In this way, 5,831 total examples are generated (34 samples * ∼25 B-frames/sample * 7 slices/B-frame). As shown in Fig. 1(B), non-cancerous B-frame slices exhibit a stronger white to black gradient from top to bottom in comparison to cancerous B-frame slices. This is a consequence of healthy white matter attenuating the intensity of the incident light more rapidly with depth. Prior studies utilizing statistical and machine learning methods have demonstrated that image texture is also discriminative for this task [10].

2.2 Texture embedding

To capture the global image texture, the grey level co-occurrence matrix (GLCM) is used to measure each B-frame slice. The GLCM is a classic approach for quantifying the distribution of co-occurring pixel values that has been applied to extract discriminative features in many medical imaging applications, including OCT [13]. First, the global maximum and minimum of the training set are used to linearly transform the values in each B-frame slice to a new range of 0 to 99 inclusive. Then, a 100 × 100 GLCM texture map is computed for each slice using an adjacency condition of 15 pixels at an angle of 0 degrees selected by trial and error. The row and column indices of the GLCM represent pixel intensities and the matrix elements show how frequently the two given intensities are adjacent [14]. For example, if intermediate intensities (say 49 ± 10 on a 0-99 scale) are adjacent to other intermediate intensities, then the resulting GLCM will exhibit a bright white spot near its center. Figure 1(B) shows typical textural features in the training set. Qualitatively, concentrated circles are characteristic of cancer slices whereas textures of non-cancer slices are more oblong and smeared.

2.3 Ensemble learning architecture

In CNNs for classification, the dimension of the first layer is defined by the input dimensions which are 200 × 100 for B-frame slices and 100 × 100 for textures. The size of the final layer is two nodes, corresponding to the probability of cancer and non-cancer in the given input. Compared to the texture CNN, the B-frame CNN is wider in order to preserve visual information. Specifically, the number of feature maps increases as the size of each feature map decreases. By training separate CNNs on the textures and B-frame slices, their embeddings (i.e., the activation of the penultimate layer) can be extracted independently, concatenated together, and connected to a multilayer perceptron which synergistically makes the final prediction. Figure 2 shows the layer-by-layer details of the networks. Particularly, note that the penultimate fully connected layer is chosen as the embedding and the third to last convolutional layer is selected for grad-CAM visualization. grad-CAM weights the feature maps in the given layer by the average of the derivative of each pixel in the feature map with respect to the output. Overlaying the grad-CAM output on the input image heuristically indicates which regions are more or less important.

Fig. 2. Ensemble learning architecture integrating B-frame slice pixels and texture features for OCT cancer tissue analysis. The dimension of each layer is indicated, where the two dimensions outside each layer are the feature map shape and the number inside each layer is the number of maps/filters. The B-frame Slices CNN and Texture CNN are trained separately. Once they have converged, they are used to create embeddings of their respective inputs, which are linked to a fully connected network that makes the final classification.

Download Full Size | PDF

The final layer of the model is subject to the softmax activation. Binary cross entropy is selected as the loss function with balanced class weights regularization used to proportionally re-weight the loss in order to prevent bias towards the majority class [15]:

(1)$$L(\theta )= \; - \frac{1}{N}\mathop \sum \nolimits_{i = 1}^N \mathop \sum \nolimits_{j = 1}^M [{{y_{ij}}\log ({{p_{ij}}} )+ ({1 - {y_{ij}}} )\textrm{log}({1 - {p_{ij}}} )} ]. $$

In Eq. (1), the loss L is a function of the model parameters $\theta $. N = 8 is the batch size, M = 2 is the number of classes, and y and p are scalars corresponding to the label and prediction respectively. For example, p_ij denotes the model prediction for class j on training example i. Stochastic gradient descent was used as the optimizer with a learning rate of 0.001 for the “B-frame slices CNN” and a learning rate of 0.01 for the “texture CNN” and the overall ensemble. Furthermore, all labels are one-hot encoded and smoothed according to a well-established protocol $y_i^{smooth} = {y_i}({1 - \mathrm{\epsilon }} )+ \frac{\mathrm{\epsilon }}{M}$ with $\mathrm{\epsilon } = 0.05$, preventing network overconfidence [16]. This is especially important in this application since the cortex samples may contain inhomogeneities and mislabeled examples would impede training. Additionally, random up-down and left-right flipping was applied to the inputs to augment the data and the dropout rate was set to 0.5 for each dense layer to discourage overfitting [17]. Global min-max normalization was applied to all training examples. All hyperparameter values (layer size, feature maps, optimizer, loss, learning rate, etc.) were selected through empirical experimentation following general good practices as briefly described in this section. All network experiments were conducted on a GeForce RTX 3060 ti GPU and AMD Ryzen 3700x CPU.

3. Results and discussion

The ensemble model outperformed its component subnetworks on the testing set, achieving (94.90% ± 0.11%) accuracy on the testing set after 50 epochs of training. The model is also compact, capable of processing 1 B-frame every 7 milliseconds (143 fps) on a GeForce RTX 3060 ti GPU and AMD Ryzen 3700x CPU. The gradient class activation maps demonstrate that the model makes decisions based on relevant features and therefore is able to be interpreted.

3.1 Network results

The training graphs in Fig. 3(A) exhibit smooth convergence and well-correlated training and validation curves. For many epochs, the validation accuracy even exceeded the training accuracy. This phenomenon is likely due to augmentation and dropout being inactivated during validation, making the predictions more straightforward during the validation phase. The texture CNN stabilized more quickly and at a higher accuracy than the B-frame slices CNN, indicating that the problem was more easily solved in the texture domain. Additionally, the ensemble model reached high accuracy after just one epoch and achieved a slightly higher validation accuracy than either subnetwork. This is reasonable because the embeddings are sufficiently high-level such that a good prediction can already be made from a simple linear transformation. As seen in Fig. 3(A), not all networks were trained for the full 50 epochs as early stopping was implemented if there was no improvement in validation accuracy for 10 epochs.

Fig. 3. Network training curves and segmentation output. (A) Accuracy metric for all models during training. (B) Ensemble model generated segmentation map. (D) Color-coded attenuation map. Units for the color scale bar: mm^-1

Download Full Size | PDF

A segmentation map was generated from an additional patient sample including both cancer and non-cancer regions. The sample contained 256 B-frames with 20 slice and texture pairs extracted from each frame. Each pair of data were input sequentially into the ensemble model, generating a binary prediction which were then tiled in a 256 × 20 map. After resizing this map to 20 × 20 for better viewing, the width and height of each pixel are approximately 0.1 mm in resolution. Figure 3(B), 3(C), and 3D provide a side-by-side comparison of the network-generated segmentation map and the color-coded attenuation map generated using traditional method respectively all of which are in good agreement.

3.2 Network ablation analysis

To determine whether the performance gain is significant with respect to individual network components, we quantified model confidence by computing the standard deviation in accuracy over 50 evaluations on the testing set with dropout active. Because dropout randomly disconnects nodes, a different testing accuracy is generated over each iteration. Models with a low spread in testing accuracies and thus higher confidence indicate that the learned feature representations are more robust [17,18]. Hence, we calculate the t-statistic to compare model accuracies and F-ratio to compare model variances/confidences. At an alpha level of 0.05, the critical t-value is 1.661 and the critical F-ratio is 1.607.

Table 1 reports that the CNN trained on texture is significantly more accurate than the CNN trained on B-frame slices. Due to this performance disparity, the accuracy of the ensemble is not significantly increased. However, the ensemble becomes more confident in its decision-making as expected when linking two complementary information domains. In future work such as distinguishing between cancer grades, it is conceivable that the texture differences will be more subtle leading to lower accuracy in the texture CNN. We simulated this scenario by saving the weights of a “sub-optimal texture” model at an earlier epoch. When the resulting “sub-optimal ensemble” is constructed in the same way, both its accuracy and confidence increased relative to its subnetwork counterparts. Hence, the utility of ensemble learning is further justified.

Table 1. Summary of statistical testing. The t-statistic compares the mean accuracies between models, and the F-ratio compares variance in accuracies between models. Magnitudes of t greater than 1.661 or magnitudes of F greater than 1.607 indicate that the observe difference in the compared statistic is significant. Where appropriate, the models listed in the far-left column are being compared to the models in the three rightmost columns.

View Table

3.3 Network interpretability

Figure 4 displays representative grad-CAMS for both the B-frame slice and texture CNNs in each of the four possible prediction scenarios. Grad-CAM heatmaps offer a visual way to explain network behavior by highlighting pixels that contribute most greatly to the final decision [19]. In the true positive and true negative cases of the texture CNN, the superimposed heat maps outline the relevant areas of high intensity. In other words, the network learned that the shape of these textures is an important feature. On the other hand, the B-frame slice CNN identifies the very bottom part of a slice as a key feature. This is consistent with the observation that lower optical attenuation in cancerous white matter leads to more uniform intensity throughout the slice. In non-cancerous tissues, the signal becomes significantly attenuated towards the bottom of the single-scattering region. Hence, the behavior of the network correlates to the underlying principles of OCT.

Fig. 4. Grad-CAM visualizations. The positive case indicates the presence of cancer while the negative case indicates the absence of cancer. In the textures, the network has identified the shape as a key feature. In the B-frame slices, the bottom/deeper region of the frame is learned as a distinguishing feature.

Download Full Size | PDF

4. Conclusion

A deep network cancer detector capable of producing accurate segmentation maps in real-time has been presented in this paper. Specifically, the ensemble model comprising a texture CNN and a B-frame CNN has been designed and validated to be sensitive, specific, fast, stable, interpretable, and superior to its individual components alone. In terms of raw performance, the results are comparable to our previous attenuation-based methods which are between 92-100% sensitive and between 80-100% specific depending on the selected threshold [9]. For this reason, single parameter attenuation thresholding methods remain attractive for their clear explainability. However, deep learning models are quickly approaching a point where they can be even more robust. For example, in this study, we demonstrated the ability to detect cancer without reference phantom normalization. Deep learning approaches for OCT-based brain cancer detection also have the potential to be invariant to surface position and focal depth, which would be highly advantageous and worthy of investigation in further studies. Additionally, differentiation between low and high-grade cancers as well as between cancers in white or grey matter may be possible based on deep learning analysis of structure in the speckle noise given that speckle noise may present in subtly different patterns in different tissues. Additionally, deep networks may also be created that are robust to challenges in the clinical setting such as motion artifacts, image defocusing, and presence of blood. These are all important directions of future work which needs to utilize datasets of in-vivo animal model studies or more patients.

Funding

National Institutes of Health (R01CA200399, P41EB032840).

Disclosures

The authors declare no conflicts of interest.

Data Availability

The processing codes developed for this paper are available upon request.

References

1. N. Sanai, M.-Y. Polley, M. W. McDermott, A. T. Parsa, and M. S. Berger, “An extent of resection threshold for newly diagnosed glioblastomas,” J. Neurosurg. 115(1), 3–8 (2011). [CrossRef]

2. M. J. McGirt, K. L. Chaichana, M. Gathinji, F. J. Attenello, K. Than, A. Olivi, J. D. Weingart, H. Brem, and A. R. Quiñones-Hinojosa, “Independent association of extent of resection with survival in patients with malignant brain astrocytoma,” J. Neurosurg. 110(1), 156–162 (2009). [CrossRef]

3. G. E. Keles, D. A. Lundin, K. R. Lamborn, E. F. Chang, G. Ojemann, and M. S. Berger, “Intraoperative subcortical stimulation mapping for hemispherical perirolandic gliomas located within or adjacent to the descending motor pathways: evaluation of morbidity and assessment of functional outcome in 294 patients,” J. Neurosurg. 100(3), 369–375 (2004). [CrossRef]

4. M. Alaminos, V. Dávalos, S. Ropero, F. Setién, M. F. Paz, M. Herranz, M. F. Fraga, J. Mora, N.-K. V. Cheung, W. L. Gerald, and M. Esteller, “EMP3, a myelin-related gene located in the critical 19q13.3 region, is epigenetically silenced and exhibits features of a candidate tumor suppressor in glioma and neuroblastoma,” Cancer Res. 65(7), 2565–2571 (2005). [CrossRef]

5. F. J. van der Meer, D. J. Faber, M. C. G. Aalders, A. A. Poot, I. Vermes, and T. G. van Leeuwen, “Apoptosis- and necrosis-induced changes in light attenuation measured by optical coherence tomography,” Lasers Med. Sci. 25(2), 259–267 (2010). [CrossRef]

6. J. Xi, Y. Chen, and X. Li, “Characterizing optical properties of nano contrast agents by using cross-referencing OCT imaging,” Biomed. Opt. Express 4(6), 842 (2013). [CrossRef]

7. W. Yuan, Y. Feng, D. Chen, P. Gharibani, J. D. Z. Chen, H. Yu, and X. Li, “In vivo assessment of inflammatory bowel disease in rats with ultrahigh-resolution colonoscopic OCT,” Biomed. Opt. Express 13(4), 2091 (2022). [CrossRef]

8. W. Yuan, C. Kut, W. Liang, and X. Li, “Robust and fast characterization of OCT-based optical attenuation using a novel frequency-domain algorithm for brain cancer detection,” Sci. Rep. 7(1), 44909 (2017). [CrossRef]

9. C. Kut, K. L. Chaichana, J. Xi, S. M. Raza, X. Ye, E. R. McVeigh, F. J. Rodriguez, A. Quinones-Hinojosa, and X. Li, “Detection of human brain cancer infiltration ex vivo and in vivo using quantitative optical coherence tomography,” Sci. Transl. Med. 7(292), 292ra100 (2015). [CrossRef]

10. J. Möller, A. Bartsch, M. Lenz, I. Tischoff, R. Krug, H. Welp, M. R. Hofmann, K. Schmieder, and D. Miller, “Applying machine learning to optical coherence tomography images for automated tissue classification in brain metastases,” Int. J. Comput. Assist. Radiol. Surg. 16(9), 1517–1526 (2021). [CrossRef]

11. R. M. Juarez-Chambi, C. Kut, J. J. Rico-Jimenez, K. L. Chaichana, J. Xi, D. U. Campos-Delgado, F. J. Rodriguez, A. Quinones-Hinojosa, X. Li, and J. Jo, “AI-assisted in situ detection of human glioma infiltration using a novel computational method for optical coherence tomography,” Clin. Cancer Res. 25(21), 6329–6338 (2019). [CrossRef]

12. P. Strenge, B. Lange, W. Draxinger, C. Grill, V. Danicke, D. Theisen-Kunde, C. Hagel, S. Spahr-Hess, M. M. Bonsanto, H. Handels, R. Huber, and R. Brinkmann, “Differentiation of different stages of brain tumor infiltration using optical coherence tomography: Comparison of two systems and histology,” Front. Oncol. 12, 896060 (2022). [CrossRef]

13. S. Adabi, M. Hosseinzadeh, S. Noei, S. Conforto, S. Daveluy, A. Clayton, D. Mehregan, and M. Nasiriavanaki, “Universal in vivo textural model for human skin based on optical coherence tomograms,” Sci. Rep. 7(1), 17912 (2017). [CrossRef]

14. B. S. V., “Grey level co-occurrence matrices: generalisation and some new features,” Int. J. Comput. Sci. Eng. Inf. Technol. 2(2), 151–157 (2012). [CrossRef]

15. Z. Xu, C. Dan, J. Khim, and P. Ravikumar, “Class-weighted classification: trade-offs and robust approaches,” in Proceedings of the 37th International Conference on Machine Learning (PMLR, 2020), pp. 10544–10554.

16. R. Müller, S. Kornblith, and G. Hinton, “When does label smoothing help?” arXivarXiv:1906.02629v3 (2019). [CrossRef]

17. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res. 15(56), 1929–1958 (2014). [CrossRef]

18. M. Sensoy, L. Kaplan, and M. Kandemir, “Evidential deep learning to quantify classification uncertainty,” in Advances in Neural Information Processing Systems 31 (NeurIPS 2018).

19. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: visual explanations from deep networks via gradient-based localization,” Int. J. Comput. Vis. 128(2), 336–359 (2020). [CrossRef]

Model	Mean Accuracy	Standard Deviation	B-frame Slices	Texture	Sub-optimal Texture
Ensemble	94.90%	0.11%	t = 132.21, F = 23.44	t = -0.39, F = 1.81	–
B-frame Slices	85.01%	0.52%	–	t = -130.21, F = 12.94	–
Texture	94.91%	0.14%	t = 130.21, F = 12.94	–	–
Sub-optimal Ensemble	90.03%	0.17%	t = 65.11, F = 9.38	–	t = 26.40, F = 5.68
Sub-optimal Texture	88.39%	0.41%	–	t = -107.26, F = 7.91	–

Deep learning-based optical coherence tomography image analysis of human brain cancer

Abstract

1. Introduction

2. Methods and materials

2.1 Dataset sparse sampling

2.2 Texture embedding

2.3 Ensemble learning architecture

3. Results and discussion

3.1 Network results

3.2 Network ablation analysis

3.3 Network interpretability

4. Conclusion

Funding

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (4)

Tables (1)

Equations (1)

Biomedical Optics Express