Abstract
Real-time intraoperative delineation of cancer and non-cancer brain tissues, especially in the eloquent cortex, is critical for thorough cancer resection, lengthening survival, and improving quality of life. Prior studies have established that thresholding optical attenuation values reveals cancer regions with high sensitivity and specificity. However, threshold of a single value disregards local information important to making more robust predictions. Hence, we propose deep convolutional neural networks (CNNs) trained on labeled OCT images and co-occurrence matrix features extracted from these images to synergize attenuation characteristics and texture features. Specifically, we adapt a deep ensemble model trained on 5,831 examples in a training dataset of 7 patients. We obtain 93.31% sensitivity and 97.04% specificity on a holdout set of 4 patients without the need for beam profile normalization using a reference phantom. The segmentation maps produced by parsing the OCT volume and tiling the outputs of our model are in excellent agreement with attenuation mapping-based methods. Our new approach for this important application has considerable implications for clinical translation.
© 2022 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement
1. Introduction
Optimal brain tumor resection, defined as maximal removal of cancerous tissues while sparing surrounding noncancerous tissues, is the key to delaying tumor recurrence and minimizing surgery-inflicted neurological damage, thus increasing patient quality of life and survival time [1,2]. This is a challenging problem, especially in the eloquent cortical and subcortical regions where high resection precision is required [3]. Hence, it is critical for a modern intraoperative brain cancer imaging system to provide an accurate segmentation map with high spatial resolution and in real-time. In this regard, optical coherence tomography (OCT) is advantageous for its rapid, non-contact, label-free, and micron-resolution imaging capability. In contrast, other modalities such as CT and MRI, which have made significant impacts on diagnosis and pre-surgery planning, are less suitable for intra-operative, continuous, real-time guidance due to the drawbacks or tradeoffs in speed, resolution, safety, and cost.
Previous studies identified tissue optical attenuation as a single parameter that can be used to distinguish between cancerous and non-cancerous brain tissue. In general, proliferation of cancer in white matter leads to myelin degradation, resulting in a decreased optical attenuation [4,5]. Optical attenuation is also useful for detecting disease in other tissues, such as the colon, and can be coupled with nanoparticle contrast agents [6,7]. For a 3D OCT data set consisting of multiple B-frames, an enface attenuation map can be generated with each pixel color-coded for cancerous and non-cancerous tissues depending on whether its value is above or below a threshold. However, the performance of thresholding is strictly limited by the overlap in the distribution of attenuation values for cancer and non-cancer tissues [8,9]. To increase robustness while achieving high performance, we propose a novel cancer classifier that is capable of integrating local and global features in an OCT image via both network-based multi-resolution analysis and classic texture analysis. Previous studies have employed machine learning methods on hand-crafted textures or deep neural networks for prediction on the B-frame, but our proposed method synergizes both [10–12]. For this purpose, deep convolutional neural networks (CNNs) are selected to perform end-to-end classification on OCT images and their textures. These two information domains are then linked by concatenating their feature embeddings via ensemble learning.
In the next section, the full data processing pipeline is described, including the dataset characteristics, sparse-sampling and texture extraction methods, and ensemble learning architecture. In the third section, the results obtained using our ensemble model is reported in comparison to its two subnetworks standalone and the network behavior is explained with gradient class activation maps (grad-CAM). In the last section, issues are discussed, and promising directions of related research are identified.
2. Methods and materials
The dataset in this study comprises 23 cancerous and 21 non-cancerous brain cortex samples from 10 patients averaging 62.6 years of age (min: 33 years, max: 78 years). Fresh human brain tissue samples were obtained at the Johns Hopkins Hospital under an approved Institutional Review Board protocol. The samples were contained in a cold salient solution and ex vivo images were taken within one hour of surgical removal of the tissues. All the samples used in the study were evaluated by a pathologist to be mostly homogeneous and were imaged with a home-built SS-OCT over a 2 mm x 2 mm area along a depth of ~4.3 mm in tissue. Each 3D OCT dataset consists of 256 B-frames where one B-frame is a collection of 2,048 A-lines measuring 2,048 pixels per line. The dataset was allocated according to Fig. 1(A): seven patients were used for training the deep networks, four patients were used for testing, and a segmentation map was generated on the last patient’s sample containing mixed cancer and non-cancer regions. Note that the testing set was comprised of samples taken from different brain locations in two patients present in the training dataset as well as samples from two completely new patients. Hence, the high accuracy of the model on the testing data indicates high intra-patient and inter-patient performance.
2.1 Dataset sparse sampling
In this study, logarithmic (base 10) B-frame data is used. The transverse resolution of our OCT system is 25 µm, meaning that B-frames in this range experience no significant changes. Hence, we adopt sparse sampling methods to ensure that the examples included in the dataset are sufficiently representative of the volumetric dataset. For each OCT scan of a cortex sample, a step size of 80 µm is used to select ∼25 representative B-frames. A small proportion of B-frames with significant fragmenting at the tissue surface is manually removed. Of the remaining frames, a 5 × 5 median filter is applied to suppress noise followed by conventional OCT image processing. Each B-frame is evenly divided into seven slices of 100 A-lines separated by 290 µm. Finally, within each slice, the A-lines are truncated to cover the single scattering region, determined to be approximately 200 pixels (418 µm in total) below the sample surface. In this way, 5,831 total examples are generated (34 samples * ∼25 B-frames/sample * 7 slices/B-frame). As shown in Fig. 1(B), non-cancerous B-frame slices exhibit a stronger white to black gradient from top to bottom in comparison to cancerous B-frame slices. This is a consequence of healthy white matter attenuating the intensity of the incident light more rapidly with depth. Prior studies utilizing statistical and machine learning methods have demonstrated that image texture is also discriminative for this task [10].
2.2 Texture embedding
To capture the global image texture, the grey level co-occurrence matrix (GLCM) is used to measure each B-frame slice. The GLCM is a classic approach for quantifying the distribution of co-occurring pixel values that has been applied to extract discriminative features in many medical imaging applications, including OCT [13]. First, the global maximum and minimum of the training set are used to linearly transform the values in each B-frame slice to a new range of 0 to 99 inclusive. Then, a 100 × 100 GLCM texture map is computed for each slice using an adjacency condition of 15 pixels at an angle of 0 degrees selected by trial and error. The row and column indices of the GLCM represent pixel intensities and the matrix elements show how frequently the two given intensities are adjacent [14]. For example, if intermediate intensities (say 49 ± 10 on a 0-99 scale) are adjacent to other intermediate intensities, then the resulting GLCM will exhibit a bright white spot near its center. Figure 1(B) shows typical textural features in the training set. Qualitatively, concentrated circles are characteristic of cancer slices whereas textures of non-cancer slices are more oblong and smeared.
2.3 Ensemble learning architecture
In CNNs for classification, the dimension of the first layer is defined by the input dimensions which are 200 × 100 for B-frame slices and 100 × 100 for textures. The size of the final layer is two nodes, corresponding to the probability of cancer and non-cancer in the given input. Compared to the texture CNN, the B-frame CNN is wider in order to preserve visual information. Specifically, the number of feature maps increases as the size of each feature map decreases. By training separate CNNs on the textures and B-frame slices, their embeddings (i.e., the activation of the penultimate layer) can be extracted independently, concatenated together, and connected to a multilayer perceptron which synergistically makes the final prediction. Figure 2 shows the layer-by-layer details of the networks. Particularly, note that the penultimate fully connected layer is chosen as the embedding and the third to last convolutional layer is selected for grad-CAM visualization. grad-CAM weights the feature maps in the given layer by the average of the derivative of each pixel in the feature map with respect to the output. Overlaying the grad-CAM output on the input image heuristically indicates which regions are more or less important.
The final layer of the model is subject to the softmax activation. Binary cross entropy is selected as the loss function with balanced class weights regularization used to proportionally re-weight the loss in order to prevent bias towards the majority class [15]:
In Eq. (1), the loss L is a function of the model parameters $\theta $. N = 8 is the batch size, M = 2 is the number of classes, and y and p are scalars corresponding to the label and prediction respectively. For example, pij denotes the model prediction for class j on training example i. Stochastic gradient descent was used as the optimizer with a learning rate of 0.001 for the “B-frame slices CNN” and a learning rate of 0.01 for the “texture CNN” and the overall ensemble. Furthermore, all labels are one-hot encoded and smoothed according to a well-established protocol $y_i^{smooth} = {y_i}({1 - \mathrm{\epsilon }} )+ \frac{\mathrm{\epsilon }}{M}$ with $\mathrm{\epsilon } = 0.05$, preventing network overconfidence [16]. This is especially important in this application since the cortex samples may contain inhomogeneities and mislabeled examples would impede training. Additionally, random up-down and left-right flipping was applied to the inputs to augment the data and the dropout rate was set to 0.5 for each dense layer to discourage overfitting [17]. Global min-max normalization was applied to all training examples. All hyperparameter values (layer size, feature maps, optimizer, loss, learning rate, etc.) were selected through empirical experimentation following general good practices as briefly described in this section. All network experiments were conducted on a GeForce RTX 3060 ti GPU and AMD Ryzen 3700x CPU.
3. Results and discussion
The ensemble model outperformed its component subnetworks on the testing set, achieving (94.90% ± 0.11%) accuracy on the testing set after 50 epochs of training. The model is also compact, capable of processing 1 B-frame every 7 milliseconds (143 fps) on a GeForce RTX 3060 ti GPU and AMD Ryzen 3700x CPU. The gradient class activation maps demonstrate that the model makes decisions based on relevant features and therefore is able to be interpreted.
3.1 Network results
The training graphs in Fig. 3(A) exhibit smooth convergence and well-correlated training and validation curves. For many epochs, the validation accuracy even exceeded the training accuracy. This phenomenon is likely due to augmentation and dropout being inactivated during validation, making the predictions more straightforward during the validation phase. The texture CNN stabilized more quickly and at a higher accuracy than the B-frame slices CNN, indicating that the problem was more easily solved in the texture domain. Additionally, the ensemble model reached high accuracy after just one epoch and achieved a slightly higher validation accuracy than either subnetwork. This is reasonable because the embeddings are sufficiently high-level such that a good prediction can already be made from a simple linear transformation. As seen in Fig. 3(A), not all networks were trained for the full 50 epochs as early stopping was implemented if there was no improvement in validation accuracy for 10 epochs.
A segmentation map was generated from an additional patient sample including both cancer and non-cancer regions. The sample contained 256 B-frames with 20 slice and texture pairs extracted from each frame. Each pair of data were input sequentially into the ensemble model, generating a binary prediction which were then tiled in a 256 × 20 map. After resizing this map to 20 × 20 for better viewing, the width and height of each pixel are approximately 0.1 mm in resolution. Figure 3(B), 3(C), and 3D provide a side-by-side comparison of the network-generated segmentation map and the color-coded attenuation map generated using traditional method respectively all of which are in good agreement.
3.2 Network ablation analysis
To determine whether the performance gain is significant with respect to individual network components, we quantified model confidence by computing the standard deviation in accuracy over 50 evaluations on the testing set with dropout active. Because dropout randomly disconnects nodes, a different testing accuracy is generated over each iteration. Models with a low spread in testing accuracies and thus higher confidence indicate that the learned feature representations are more robust [17,18]. Hence, we calculate the t-statistic to compare model accuracies and F-ratio to compare model variances/confidences. At an alpha level of 0.05, the critical t-value is 1.661 and the critical F-ratio is 1.607.
Table 1 reports that the CNN trained on texture is significantly more accurate than the CNN trained on B-frame slices. Due to this performance disparity, the accuracy of the ensemble is not significantly increased. However, the ensemble becomes more confident in its decision-making as expected when linking two complementary information domains. In future work such as distinguishing between cancer grades, it is conceivable that the texture differences will be more subtle leading to lower accuracy in the texture CNN. We simulated this scenario by saving the weights of a “sub-optimal texture” model at an earlier epoch. When the resulting “sub-optimal ensemble” is constructed in the same way, both its accuracy and confidence increased relative to its subnetwork counterparts. Hence, the utility of ensemble learning is further justified.
3.3 Network interpretability
Figure 4 displays representative grad-CAMS for both the B-frame slice and texture CNNs in each of the four possible prediction scenarios. Grad-CAM heatmaps offer a visual way to explain network behavior by highlighting pixels that contribute most greatly to the final decision [19]. In the true positive and true negative cases of the texture CNN, the superimposed heat maps outline the relevant areas of high intensity. In other words, the network learned that the shape of these textures is an important feature. On the other hand, the B-frame slice CNN identifies the very bottom part of a slice as a key feature. This is consistent with the observation that lower optical attenuation in cancerous white matter leads to more uniform intensity throughout the slice. In non-cancerous tissues, the signal becomes significantly attenuated towards the bottom of the single-scattering region. Hence, the behavior of the network correlates to the underlying principles of OCT.
4. Conclusion
A deep network cancer detector capable of producing accurate segmentation maps in real-time has been presented in this paper. Specifically, the ensemble model comprising a texture CNN and a B-frame CNN has been designed and validated to be sensitive, specific, fast, stable, interpretable, and superior to its individual components alone. In terms of raw performance, the results are comparable to our previous attenuation-based methods which are between 92-100% sensitive and between 80-100% specific depending on the selected threshold [9]. For this reason, single parameter attenuation thresholding methods remain attractive for their clear explainability. However, deep learning models are quickly approaching a point where they can be even more robust. For example, in this study, we demonstrated the ability to detect cancer without reference phantom normalization. Deep learning approaches for OCT-based brain cancer detection also have the potential to be invariant to surface position and focal depth, which would be highly advantageous and worthy of investigation in further studies. Additionally, differentiation between low and high-grade cancers as well as between cancers in white or grey matter may be possible based on deep learning analysis of structure in the speckle noise given that speckle noise may present in subtly different patterns in different tissues. Additionally, deep networks may also be created that are robust to challenges in the clinical setting such as motion artifacts, image defocusing, and presence of blood. These are all important directions of future work which needs to utilize datasets of in-vivo animal model studies or more patients.
Funding
National Institutes of Health (R01CA200399, P41EB032840).
Disclosures
The authors declare no conflicts of interest.
Data Availability
The processing codes developed for this paper are available upon request.
References
1. N. Sanai, M.-Y. Polley, M. W. McDermott, A. T. Parsa, and M. S. Berger, “An extent of resection threshold for newly diagnosed glioblastomas,” J. Neurosurg. 115(1), 3–8 (2011). [CrossRef]
2. M. J. McGirt, K. L. Chaichana, M. Gathinji, F. J. Attenello, K. Than, A. Olivi, J. D. Weingart, H. Brem, and A. R. Quiñones-Hinojosa, “Independent association of extent of resection with survival in patients with malignant brain astrocytoma,” J. Neurosurg. 110(1), 156–162 (2009). [CrossRef]
3. G. E. Keles, D. A. Lundin, K. R. Lamborn, E. F. Chang, G. Ojemann, and M. S. Berger, “Intraoperative subcortical stimulation mapping for hemispherical perirolandic gliomas located within or adjacent to the descending motor pathways: evaluation of morbidity and assessment of functional outcome in 294 patients,” J. Neurosurg. 100(3), 369–375 (2004). [CrossRef]
4. M. Alaminos, V. Dávalos, S. Ropero, F. Setién, M. F. Paz, M. Herranz, M. F. Fraga, J. Mora, N.-K. V. Cheung, W. L. Gerald, and M. Esteller, “EMP3, a myelin-related gene located in the critical 19q13.3 region, is epigenetically silenced and exhibits features of a candidate tumor suppressor in glioma and neuroblastoma,” Cancer Res. 65(7), 2565–2571 (2005). [CrossRef]
5. F. J. van der Meer, D. J. Faber, M. C. G. Aalders, A. A. Poot, I. Vermes, and T. G. van Leeuwen, “Apoptosis- and necrosis-induced changes in light attenuation measured by optical coherence tomography,” Lasers Med. Sci. 25(2), 259–267 (2010). [CrossRef]
6. J. Xi, Y. Chen, and X. Li, “Characterizing optical properties of nano contrast agents by using cross-referencing OCT imaging,” Biomed. Opt. Express 4(6), 842 (2013). [CrossRef]
7. W. Yuan, Y. Feng, D. Chen, P. Gharibani, J. D. Z. Chen, H. Yu, and X. Li, “In vivo assessment of inflammatory bowel disease in rats with ultrahigh-resolution colonoscopic OCT,” Biomed. Opt. Express 13(4), 2091 (2022). [CrossRef]
8. W. Yuan, C. Kut, W. Liang, and X. Li, “Robust and fast characterization of OCT-based optical attenuation using a novel frequency-domain algorithm for brain cancer detection,” Sci. Rep. 7(1), 44909 (2017). [CrossRef]
9. C. Kut, K. L. Chaichana, J. Xi, S. M. Raza, X. Ye, E. R. McVeigh, F. J. Rodriguez, A. Quinones-Hinojosa, and X. Li, “Detection of human brain cancer infiltration ex vivo and in vivo using quantitative optical coherence tomography,” Sci. Transl. Med. 7(292), 292ra100 (2015). [CrossRef]
10. J. Möller, A. Bartsch, M. Lenz, I. Tischoff, R. Krug, H. Welp, M. R. Hofmann, K. Schmieder, and D. Miller, “Applying machine learning to optical coherence tomography images for automated tissue classification in brain metastases,” Int. J. Comput. Assist. Radiol. Surg. 16(9), 1517–1526 (2021). [CrossRef]
11. R. M. Juarez-Chambi, C. Kut, J. J. Rico-Jimenez, K. L. Chaichana, J. Xi, D. U. Campos-Delgado, F. J. Rodriguez, A. Quinones-Hinojosa, X. Li, and J. Jo, “AI-assisted in situ detection of human glioma infiltration using a novel computational method for optical coherence tomography,” Clin. Cancer Res. 25(21), 6329–6338 (2019). [CrossRef]
12. P. Strenge, B. Lange, W. Draxinger, C. Grill, V. Danicke, D. Theisen-Kunde, C. Hagel, S. Spahr-Hess, M. M. Bonsanto, H. Handels, R. Huber, and R. Brinkmann, “Differentiation of different stages of brain tumor infiltration using optical coherence tomography: Comparison of two systems and histology,” Front. Oncol. 12, 896060 (2022). [CrossRef]
13. S. Adabi, M. Hosseinzadeh, S. Noei, S. Conforto, S. Daveluy, A. Clayton, D. Mehregan, and M. Nasiriavanaki, “Universal in vivo textural model for human skin based on optical coherence tomograms,” Sci. Rep. 7(1), 17912 (2017). [CrossRef]
14. B. S. V., “Grey level co-occurrence matrices: generalisation and some new features,” Int. J. Comput. Sci. Eng. Inf. Technol. 2(2), 151–157 (2012). [CrossRef]
15. Z. Xu, C. Dan, J. Khim, and P. Ravikumar, “Class-weighted classification: trade-offs and robust approaches,” in Proceedings of the 37th International Conference on Machine Learning (PMLR, 2020), pp. 10544–10554.
16. R. Müller, S. Kornblith, and G. Hinton, “When does label smoothing help?” arXivarXiv:1906.02629v3 (2019). [CrossRef]
17. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res. 15(56), 1929–1958 (2014). [CrossRef]
18. M. Sensoy, L. Kaplan, and M. Kandemir, “Evidential deep learning to quantify classification uncertainty,” in Advances in Neural Information Processing Systems 31 (NeurIPS 2018).
19. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: visual explanations from deep networks via gradient-based localization,” Int. J. Comput. Vis. 128(2), 336–359 (2020). [CrossRef]