Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

MEMO: dataset and methods for robust multimodal retinal image registration with large or small vessel density differences

Open Access Open Access

Abstract

The measurement of retinal blood flow (RBF) in capillaries can provide a powerful biomarker for the early diagnosis and treatment of ocular diseases. However, no single modality can determine capillary flowrates with high precision. Combining erythrocyte-mediated angiography (EMA) with optical coherence tomography angiography (OCTA) has the potential to achieve this goal, as EMA can measure the absolute RBF of retinal microvasculature and OCTA can provide the structural images of capillaries. However, multimodal retinal image registration between these two modalities remains largely unexplored. To fill this gap, we establish MEMO, the first public multimodal EMA and OCTA retinal image dataset. A unique challenge in multimodal retinal image registration between these modalities is the relatively large difference in vessel density (VD). To address this challenge, we propose a segmentation-based deep-learning framework (VDD-Reg), which provides robust results despite differences in vessel density. VDD-Reg consists of a vessel segmentation module and a registration module. To train the vessel segmentation module, we further designed a two-stage semi-supervised learning framework (LVD-Seg) combining supervised and unsupervised losses. We demonstrate that VDD-Reg outperforms existing methods quantitatively and qualitatively for cases of both small VD differences (using the CF-FA dataset) and large VD differences (using our MEMO dataset). Moreover, VDD-Reg requires as few as three annotated vessel segmentation masks to maintain its accuracy, demonstrating its feasibility.

© 2024 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

Retinal blood flow (RBF) is a key functional biomarker, implicated in three of the four major causes of blindness worldwide, glaucoma [1], diabetic retinopathy [2], age-related macular degeneration [3], as well as in neurodegenerative diseases such as Alzheimer’s dementia [4,5]. Specifically, RBF in capillaries may provide sensitive biomarkers for the early diagnosis of ocular diseases, and could aid in the development of novel therapies. Unfortunately, accurately measuring RBF in capillaries is challenging because it requires the precise measurement of both absolute erythrocyte velocities and capillary width. Moreover, it also has high requirements of sensor resolution and repeatability.

Current methods of measuring RBF are limited. For instance, laser Doppler imaging [6] is limited by high variability of measured flowrates. Dynamic OCTA [7,8] and color Doppler imaging [9] can only measure relative flowrates, leading to poor intra-platform and cross-platform measurement repeatability. Adaptive optics scanning laser ophthalmoscopy (AO-SLO) [10,11] and AO-OCT [12,13] have a limited field of view. Erythrocyte mediated angiography (EMA) [14], on the other hand, is a novel technique which has the capability of determining absolute erythrocyte flowrates of arterioles and venules in vivo with high precision and a large field of view. EMA determines the flowrates by following the motion of individual fluorescently labeled erythrocyte ghosts in the retinal capillary circulation which can be visualized in vivo [1517]. Despite the aforementioned advantages, a major limitation of EMA is its inability to delineate the capillary structures through which the erythrocytes are flowing [18].

One potential solution to address this limitation of EMA is to combine it with another modality that provides high-resolution structural imaging of retinal capillaries. Optical coherence tomography angiography (OCTA) is an ideal candidate, as it can generate high resolution imaging down to the capillary level in different layers of the retina [1921]. Combining EMA and OCTA may enable absolute capillary RBF measurement for diagnosis and treatments of ocular diseases. A key requirement for this is accurate registration. Manual approaches to registration are time-consuming, necessitating the development of an automated approach to registration of EMA and OCTA image pairs.

Multimodal retinal image registration has been extensively studied in recent years [2229]. However, current approaches have primarily utilized the public CF-FA dataset [30] (color fundus and fluorescein angiography) or private datasets with modalities other than EMA and OCTA, such as CF and fundus autofluorescence (FAF) [22,23], and CF and infrared reflectance (IR) imaging [23,27,28]. The lack of new and publicly available multimodal retinal image datasets not only makes it difficult for researchers to fairly and thoroughly compare existing methods, but also prevents the identification of novel methods of multimodal imaging registration.

To fill in these gaps in knowledge, we conducted experiments on non-human primates (NHPs) and created a public dataset of EMA and OCTA pairs. NHPs are used extensively in ophthalmic research and provide some of the best models for glaucoma as well as other ocular diseases [31]. The homology between NHPs and human eyes allows for ease in translation of experimental findings and applicability to human imaging and disease. Our dataset has the unique features of being well-controlled, one of the few datasets that includes OCTA images and the only dataset to include EMA sequences. This dataset is described as the multimodal EMA and OCTA (MEMO) retinal image dataset. MEMO contains EMA and OCTA image pairs with manually labeled matched points for studying multimodal retinal image registration. Additionally, MEMO includes OCTA projection images [32] from all three retinal vascular plexi (superficial vascular plexus (SVP), intermediate capillary plexus (ICP) and deep capillary plexus (DCP)) and EMA image sequences.

Using the MEMO dataset, we address a unique challenge in multimodal retinal image registration between EMA and OCTA images arising from the relatively large difference in vessel density between the two modalities. In this paper, the vessel density (VD) is defined as the proportion of image area occupied by vessels divided by the entire captured area. As compared to other modalities available in public datasets, such as CF-FA [30], EMA and OCTA have a VD difference over 30% between the two modalities (Fig. 1). Through extensive experiments, we found that large VD differences dramatically decrease registration performance, as the majority of smaller vessels in OCTA could not be visualized in EMA due to fundamental differences in image acquisition.

 figure: Fig. 1.

Fig. 1. Sample images of (a) CF, (b) FA, (c) EMA and (d) OCTA with vessel density (VD). (a) and (b) are taken from the CF-FA dataset [30]. In this example, the vessel density of OCTA (d) is five times grater than that of EMA (c) since most capillaries cannot be visualized in EMA images.

Download Full Size | PDF

To overcome the challenges posed by large VD differences, we propose VDD-Reg, a segmentation-based deep-learning framework for multimodal retinal image registration that can robustly register two imaging modalities despite vessel density differences. VDD-Reg consists of a vessel segmentation module and a registration module. Here, instead of trying to extract every vessel as accurately as possible, the goal of the segmentation module is to extract vessels that are visible in both modalities so that the registration module can detect and match feature points more accurately. To achieve this goal, we designed a novel two-stage semi-supervised learning framework, LVD-Seg, which requires only a few (e.g., three) labeled vessel segmentation masks from the modality with lower vessel density (EMA in our case). Specifically, LVD-Seg first uses a supervised loss (i.e., MSE) to stabilize the training of the vessel segmentation module, and then uses an unsupervised loss (i.e., style loss [33]) with a unified style target image to guide the segmentation module to extract common vessels visible in both EMA and OCTA images, improving the registration accuracy.

The contributions of our work can be summarized as follows:

  • 1. We establish MEMO, the first public multimodal EMA and OCTA retinal image dataset. MEMO provides registration ground truth, all three retinal vascular plexi of OCTA projection images, and EMA image sequences containing moving erythrocytes. This also has the potential for use in any research involving OCTA registration with other modalities that use a scanning laser ophthalmoscope. MEMO is the first retinal image dataset containing EMA images and also the first multimodal retinal image registration dataset containing modalities with a large difference in vessel density (VD). MEMO is available at https://chiaoyiwang0424.github.io/MEMO/.
  • 2. We propose a segmentation-based deep-learning framework, VDD-Reg, for multimodal retinal image registration that is robust with respect to vessel density differences. To train the segmentation module in VDD-Reg, we further designed a two-stage semi-supervised learning framework, LVD-Seg, which requires as few as three labeled vessel segmentation masks.

The rest of the paper is organized as follows. Section 2 summarizes the existing public retinal image datasets with image pairs and multimodal retinal image registration methods. Section 3 illustrates the details of our MEMO dataset. In Section 4, we describe the proposed VDD-Reg framework. Section 5 illustrates our experimental settings. Section 6 and Section 7 present the results and discussion. Section 8 includes the conclusion of the paper.

2. Related works

2.1 Retinal image datasets with image pairs

There are relatively few public retinal image datasets specifically curated for image registration. In Table 1, we summarized the public retinal image datasets with image pairs as they could be potentially repurposed for image registration with proper ground truth annotations. The datasets listed in Table 1 can be divided into monomodal and multimodal. The monomodal datasets, such as e-ophtha [34], VARIA [35], RODREP [36], FIRE [37] and FLORI21 [38], contain images from only one modality, limiting their utility for multimodal retinal image registration research. On the other hand, existing multimodal retinal image datasets provide images from various modalities, such as OCT-OCTA pairs [39], ultra-widefield fundus photography-angiography pairs [40], and CF-FA pairs [30]. Compared to the above datasets, MEMO has three major advantages. Firstly, MEMO is the first retinal dataset with EMA images and also the first multimodal retinal image registration dataset providing two modalities with relatively large VD differences. Secondly, MEMO officially provides six corresponding point pairs per image pair as the global registration ground truth, which is crucial for fair comparisons of methods. Finally, MEMO provides raw EMA sequences and OCTA projection images, which may be useful for multiple research fields such as automated erythrocyte tracking.

Tables Icon

Table 1. Comparison of Public Retinal Image Datasets with Image Pairs

2.2 Multi-modal retinal image registration

Multimodal retinal image registration methods can be categorized into conventional and deep learning-based methods. The conventional methods can be further divided into two types: direct and indirect methods. The direct conventional methods try to detect and match features directly on the raw images by manually designing more powerful feature descriptors or more robust matching algorithms. For example, Chen et al. [41] proposed a partial intensity invariant feature descriptor (PIIFD) and designed an image registration framework called Harris-PIIFD based on the proposed descriptor. Ghassabi et al. [42] combined UR-SIFT and PIIFD for image registration with large content or scale changes. Wang et al. [43] presented an image registration framework combining SURF, PIIFD and robust point matching. Lee et al. [44] introduced a low-dimensional step pattern analysis method to align retinal image pairs that were poorly aligned with baseline methods. Hossein-Nejad et al. [45] adopted adaptive Random Sample Consensus (A-RANSAC) for feature matching. On the other hand, the indirect conventional methods attempt to first transfer the images from different modalities into a similar "style", such as the vessel mask or the phase image, before detecting and matching features. For instance, Hernandez et al. [46] proposed line structures segmentation with a tensor-voting approach to improve registration. Hervella et al. [47] combined feature-based and intensity-based registration methods and employed a domain-adapted similarity metric to detect vessel bifurcations and crossovers. Motta et al. [48] proposed a registration framework based on optimal transport theory for vessel extraction on retinal fundus images. Li et al. [49] proposed a two-step registration method which converted raw images into phase images and adopted log-Gabor filters for global registration.

Recently, many deep learning-based multimodal retinal image registration methods have been proposed, demonstrating comparable or superior performance as compared to conventional methods. Similar to the conventional methods, deep learning-based methods can also be roughly divided into direct and indirect methods. The direct deep learning-based methods usually try to directly learn a feature matching network using raw image datasets. For example, De Silva et al. [23] proposed an end-to-end network following the conventional feature point-based registration steps, using a VGG-16 feature extractor [50] and a feature matching network for predicting patch displacements. Lee et al. [26] extracted pattern patches surrounding the intersection points and used a Convolutional Neural Network (CNN) to select matched patches. The indirect deep learning-based methods, on the other hand, try to learn a transformation network to first transform the two modalities into the same domain such as the vessel mask instead of directly performing image registration. For instance, Arikan et al. [25] used a U-Net for vessel segmentation and a mask R-CNN for vessel junctions detection based on supervised learning before multimodal image registration. Luo et al. [24] proposed a two-stage affine registration framework. The first stage used two individual U-Nets to segment the optic discs in two modalities, and the second stage adopted ResNet for fine registration. Zhang et al. [27] proposed a vessel segmentation-based two-step registration method integrating global and deformable registration. Their vessel segmentation networks were trained with a deformable registration network using ground truth registration affine matrices. Wang et al. [28] proposed a content-adaptive multimodal retinal image registration method, which adopted pixel-adaptive convolution (PAC) [51] and style loss [33] in their vessel segmentation network. In addition to transforming images into the vessel masks, Santarossa et al. [22] and Sindel et al. [29] applied CycleGAN [52] to transform the images from one modality to the other before extracting features.

Although many methods have been proposed for multimodal retinal image registration, none of them tackle the registration between EMA and OCTA. Compared to the modalities used in existing works, the vessel density (VD) difference between EMA and OCTA used in our MEMO dataset is relatively large, making image registration much more challenging.

3. MEMO dataset

3.1 Overview

A sample EMA and OCTA image pair from the MEMO dataset is shown in Fig. 2. The dataset contains 30 pairs of EMA and OCTA images. For each image pair, 6 corresponding point pairs were manually annotated. The annotated points were chosen from the visually distinctive points in EMA and OCTA images, such as vessel bifurcation points and vessel bending points. The procedure for EMA and OCTA image acquisition is shown in Fig. 3. All images were acquired following a protocol approved by Institutional Animal Care and Use Committee of the University of Maryland, Baltimore. Four eyes from two healthy non-human primates (rhesus monkeys, i.e., Macaca mulatta), aged 14 and 20, were used to acquired paired EMA and OCTA images. Each pair was collected in the same session on the same date. All image pairs were captured by a Heidelberg Spectralis platform (Heidelberg Engineering, Heidelberg Germany), which minimizes the nonlinear effect between two modalities or between different machines. Prior to the experimental session, the animal was sedated with ketamine and xylazine (5-10 and 0.2-0.4 mg/kg by intramuscular injection). The animal was intubated by trained veterinary technicians with an endotracheal tube and general anesthesia was maintained with 1.5% to 3% isoflurane with 100% oxygen. The animal was paralyzed with vecuronium (40-60 ug/kg, followed by 0.35-45 ug/kg/min), preventing eye movement during image acquisition. Body temperature was maintained at physiologic levels using a thermal blanket and blood pressure was monitored using a blood pressure cuff on the arm. The animal was laid in a prone position during the imaging session. A wire lid speculum was used to keep the eyelids open during imaging and tropicamide 1% was administered for pupillary dilation.

 figure: Fig. 2.

Fig. 2. A typical sample EMA and OCTA pair from our MEMO dataset. Images inside the orange boxes were used for ground truth labeling. (A-1, A-2 and A-3: frame 0, 10 and 20 in the sample EMA image sequence. A-4: the stacked images of the EMA sequence. C-1, C-2 and C-3: the sample OCTA projection images representing DCP, ICP and SVP layer. C-4: the OCTA B-scan image. B and D: the six corresponding point pairs of the sample EMA and OCTA pair.)

Download Full Size | PDF

 figure: Fig. 3.

Fig. 3. The procedure for image acquisition. The numbers shown in the figure indicate the order.

Download Full Size | PDF

3.2 EMA

All EMA image sequences were captured by a Heidelberg Spectralis platform (Heidelberg Engineering, Heidelberg Germany). Each image sequence represents a time sequence capturing the trajectory of a single moving erythrocyte as it travels from the artery through the capillary to the vein. Approximately 17 mL of blood was drawn for processing with 5,6-carboxyfluorescein diacetate succinimidyl ester (CFSE) (Molecular Probes, USA) reconstituted in anhydrous dimethyl sulfoxide with a method similar to the human erythrocyte preparation, as previously described in our previous work [16]. Autologous erythrocytes were isolated from whole blood and loaded with 7.5 mM of CFSE using the osmotic shock method. Following cell preparation, up to 1.2 mL of CFSE-loaded cells were intravenously injected during image acquisition. After the cells were injected, ten-second angiograms centered on the macula were obtained with the Heidelberg Spectralis (Heidelberg Engineering GmbH, Germany) using a high speed 15-degree horizontal x 15-degree vertical field of view taken at 15 frames per second. More details of the procedure can be referred to our published protocol [16,53,54].

All image frames from the EMA image sequences were stored in TIF format. The number of image frames for each sequence is different, since the speed of each erythrocyte is different. In addition, six image sequences had the image size of $512\times 512$ pixels, while the other 24 had the image size of $384\times 384$ pixels. The reason for different sizes of image is because some images were taken in high-speed mode, while others were taken in high-resolution mode which has a higher pixel density. The pixel size of every EMA image sequence was provided. The stacked image of each EMA image sequence was used for registration ground truth labeling.

3.3 OCTA

OCTA scans centered on the fovea were taken using the same Heidelberg Spectralis with a $10\times 10$ degree protocol, consisting of 512 a-scans $\times$ 512 b-scans with 5-10 microns between b-scans and 5-7 frames averaged per b-scan location. Projection images of the superficial vascular plexus (SVP), intermediate capillary plexus (ICP), and deep capillary plexus (DCP) were generated using the segmentation algorithms and slab definitions provided by the Spectralis software (Heidelberg Eye Explorer, version 1.10.3.0, Heidelberg Engineering, Germany). The SVP slab was defined as between the internal limiting membrane to the anterior border of the inner plexiform layer, the ICP included the entire inner plexiform layer, and the DCP ranged from the posterior border of the inner plexiform layer to the anterior border of the outer plexiform layer. The projection images were processed using projection artifact removal (PAR).

All images from the OCTA image groups were stored in TIF format. Each of the OCTA image groups contained three images from the three layers (i.e., SVP, ICP and DCP). Fifteen OCTA image groups had the image size of $512\times 512$ pixels, while the other 15 had a image size of $768\times 768$ pixels. The SVP image from each OCTA image group was used for registration ground truth labeling.

3.4 Dataset analysis

Despite its size, our MEMO dataset has sufficient diversity. To showcase this, additional image pairs corresponding to each eye of each non-human primate (NHP) are shown in Fig. 4. The images exhibit noticeable differences from each other, even if captured from the same NHP or eye. These differences arise from varying physical conditions of the NHPs, different capture times, imaging variations, etc. Besides, we present additional statistics of our MEMO dataset in Fig. 5, particularly focusing on the image registration aspect. It is observed that our MEMO dataset covers various image transformations.

 figure: Fig. 4.

Fig. 4. Image samples corresponding to each eye of each NHP. The EMA image is placed on top of the OCTA image for each image pair.

Download Full Size | PDF

 figure: Fig. 5.

Fig. 5. The Statistics of our MEMO dataset. The number of image pairs (count) falling within different ranges of (a) translation in the x-axis (pixel), (b) translation in the y-axis (pixel), or (c) rotation (degree) are presented. The division of training and test data for the MEMO dataset is outlined in Sec. 5.1.1.

Download Full Size | PDF

4. Proposed method

The overview of the proposed framework, VDD-Reg, for multimodal retinal image registration is shown in Fig. 6, which consists of a vessel segmentation module and a registration module. In VDD-Reg, multimodal images were first transformed into binary vessel masks by the vessel segmentation module. The global registration matrix was then estimated by the registration module from the two binary vessel masks.

 figure: Fig. 6.

Fig. 6. The proposed VDD-Reg framework. VDD-Reg includes a vessel segmentation module and a registration module. The vessel segmentation module is trained with the proposed two-stage semi-supervised learning framework (LVD-Seg). DRIU [55] and SuperPoint [56] are adopted for our segmentation networks and registration network, respectively. $M_{reg}^{global}$ denotes the partial affine transformation matrix for global image registration.

Download Full Size | PDF

4.1 Vessel segmentation module

4.1.1 LVD-Seg background

As discussed in Section 2.2, vessel segmentation has frequently been used as the first step for multimodal retinal image registration [25,27,28,4648], primarily because features of vessels are considered to be more consistent across different modalities. Recently, deep learning-based vessel segmentation methods have shown superior performance. They can be categorized into two groups, supervised [25,55] and unsupervised methods [27,28], which present different limitations. The supervised vessel segmentation methods [25,55] usually require a large number of high-quality pixel-level vessel masks for training to ensure test performance, but such high-quality pixel-level vessel masks are often difficult and time-consuming to acquire. To avoid the need for pixel-level vessel masks, the unsupervised vessel segmentation methods based on style transfer have been proposed [27,28]. However, due to the lack of direct supervision, the unsupervised vessel segmentation methods generally performs worse than the supervised ones in terms of the segmentation quality.

Unlike general vessel segmentation, which aims to accurately extract every vessel, the goal of the vessel segmentation module in our VDD-Reg is to extract vessels that are visible in both modalities. This is particularly crucial for multimodal retinal image registration when a majority of vessels in one modality (e.g. OCTA) are not visible in the other modality (e.g. EMA) due to the fundamental differences between the two modalities. To this end, we designed a novel two-stage semi-supervised learning framework, LVD-Seg, to train our vessel segmentation module. LVD-Seg was designed based on two key insights. First, LVD-Seg combined supervised and unsupervised training so that the resulted segmentation masks could be effectively used by the following registration module despite having very few pixel-level vessel masks. Second, only the pixel-level vessel masks from the modality with lower vessel density (e.g., EMA) were used as the supervisory signal for both supervised and unsupervised training, guiding the segmentation module to extract vessels that are visible in both modalities. Details of the two stages in LVD-Seg are described as follows.

4.1.2 LVD-Seg stage 1: supervised loss

In this stage, we trained our vessel segmentation module on $n$ manually-annotated EMA vessel segmentation masks, where $n$ could be as few as three according to our experiment results (Sec. 7.2). We used EMA vessel segmentation masks because the vessel density of EMA is much lower than that of OCTA and vessels that can be captured by EMA are visible in both modalities. In other words, OCTA images contain a plethora of small capillaries which do not present in the corresponding EMA images and are not helpful for image registration. Moreover, labeling the less complex EMA vessel segmentation masks is much more feasible in terms of efficiency than labeling OCTA vessel segmentation masks.

Following [27,28], we adopted the DRIU [55] network for segmenting EMA images. The DRIU network used a pre-trained VGG-16 network [50] for feature extraction and was followed by a segmentation prediction layer. The mean squared error (MSE), denoted as $L_{v}$, was adopted to train the network, which is defined as

$$L_{v} = \frac{1}{N} \sum_{i=1}^{N}(Pred(I(i)) - M(i))^2.$$
where $I$ represents the input EMA image and $M$ represents the ground truth EMA mask. $Pred(I)$ represents the predicted segmentation mask of $I$. $i$ represents the $i^{th}$ pixel of the predicted segmentation mask or the ground truth mask. $N$ denotes the total number of pixels. In addition to MSE, the self-comparison loss [27,28], denoted as $L_{sc}$, was also adopted to make the prediction robust against input image rotation. Specifically, $L_{sc}$ is defined as
$$L_{sc} = MSE(Rot_{{-}90}(Pred(Rot_{90}(I))), \ Pred(I)),$$
where $Rot_{\theta }(I)$ represents $I$ rotated by $\theta$ degree. Here, $L_{sc}$ can be seen as an alternative way to perform data augmentation. Overall, the training loss for stage 1, denoted as $L_{s1}$, can be written as
$$L_{s1}= w_{v} * L_{v} + w_{sc} * L_{sc},$$
where $w_{v}$ and $w_{sc}$ represent the weighting factors for the MSE and the self-comparison loss. In this paper, $w_{v}$ and $w_{sc}$ were set to 1 and 1e-3, respectively. Here, $w_{sc}$ was set to a relatively smaller value compared to $w_{v}$ to ensure that $L_{v}$, the primary supervisory signal in this stage, could effectively guide the learning process. The trained weights of the EMA vessel segmentation network in stage 1 were used to initialize both the EMA and OCTA vessel segmentation networks in stage 2.

4.1.3 LVD-Seg stage 2: unsupervised loss

To ensure the extraction of common vessels visible in both modalities, we further optimized the segmentation networks using style loss [27,28,33] with a joint style target mask, a stand-alone EMA ground truth segmentation mask. Using a joint style target mask with style loss guided the EMA and OCTA segmentation networks to segment shared vessels, improving the registration accuracy. Moreover, using the same trained weights to initialize both the EMA and OCTA vessel segmentation networks further stabilized the unsupervised training in this stage.

Style loss penalized the difference between the predicted segmentation mask and the style target mask using Gram matrices [33,57]. The Gram matrix was used to capture the style information but remove the spatial information, whose elements can be written as

$$G_{j}(I)_{c,c^{'}} = \frac{1}{C_j H_j W_j} \sum_{h=1}^{H_j} \sum_{w=1}^{W_j} \phi_{j}(I)_{h, w, c} \phi_{j}(I)_{h, w, c^{'}}.$$

Here, $\phi _{j}(I)$ denotes the feature map with shape $C_j \times H_j \times W_j$ obtained from the $j^{th}$ layer of the pre-trained VGG-16 network [50] by feeding the network with input image $I$, and $c, c^{'} \in [1, \ C_j]$. The Gram matrix, a $C_j \times C_j$ matrix, is utilized to capture correlations among features, thus representing the ‘style’ of the input image $I$. Style loss ($L_{st}$) is then defined as the squared Frobenius norm of the difference between the Gram matrices of the predicted segmentation mask ($Pred(I)$) and the style target mask ($M_{t}$), which can be written as

$$L_{st_{j}}(Pred(I), \ M_{t}) ={\parallel} G_{j}(Pred(I)) - G_{j}(M_{t}) \parallel_{F}^{2},$$
$$L_{st} = \sum_{j \in J} L_{st_{j}}(Pred(I), \ M_{t}).$$

$I$ represents the input EMA or OCTA image. Style loss was computed at four different layers of the VGG-16 network. Overall, the training loss for stage 2, denoted as $L_{s2}$, can be written as

$$L_{s2}= w_{st}^{e} L_{st}^{e} + w_{st}^{o} L_{st}^{o} + w_{sc} (L_{sc}^{e} + L_{sc}^{o}).$$

Note that the two different modalities used different weighting factors for style loss. The self-comparison loss was also adopted as an additional constraint for the predicted segmentation masks. In this paper, $w_{st}^{e}$, $w_{st}^{o}$ and $w_{sc}$ were set to 100, 1 and 1e-3, respectively. Similar to Eq. (3), $w_{sc}$ was set to a relatively smaller value to ensure that the primary supervisory signals in this stage, $L_{st}^{e}$ and $L_{st}^{o}$, effectively guided the learning process. Additionally, we observed that $L_{st}^{o}$, representing the style loss for OCTA, tended to be larger than $L_{st}^{e}$ due to the use of the EMA vessel segmentation ground truth mask as the style target. To balance the impacts of $L_{st}^{o}$ and $L_{st}^{e}$, we assigned $w_{st}^{e}$ to a relatively larger value compared to $w_{st}^{o}$.

The outputs of the segmentation module were EMA and OCTA pixel-wise probability maps, which represented the probability of each pixel belonging to a vessel. The probability maps were transformed into binary segmentation masks with threshold set to 0.5.

4.2 Registration module

4.2.1 Feature detection and description

We adopted pre-trained SuperPoint [56] as our feature detector and descriptor. It demonstrated superior performance than many traditional feature detectors and descriptors, and has been widely adopted in many applications which require feature matching, including multimodal image registration [27,58,59], localization [60] and two-view geometry estimation [61]. The network structure of SuperPoint is shown in Fig. 7. It contains a shared encoder, an interest point decoder and a descriptor decoder. It was first trained on a synthetic dataset with labeled interest points to detect feature points. Next, Homographic Adaptation [56] was used to self-label a large unlabeled real world image dataset. Finally, the model was jointly-trained to extract feature points and their corresponding descriptors with self-supervision. We refer the readers to [56] for more details. In this paper, the non-maximum suppression distance was set to 4 and the detector confidence threshold was set to 0.015 for keypoint detection.

 figure: Fig. 7.

Fig. 7. The network structure of SuperPoint [56]. The green circles indicate the feature points detected by SuperPoint.

Download Full Size | PDF

4.2.2 Feature matching and registration

We determined the matched feature points based on bidirectional calculation of the minimum Euclidean distance. That is, a feature point X in an OCTA image is said to match a feature point Y in the corresponding EMA image only when the Euclidean distance between each other’s feature descriptors is smaller than (1) the Euclidean distance between X and any other feature point in the EMA image and (2) the Euclidean distance between Y and any other feature point in the OCTA image. The Random Sample Consensus (RANSAC) [62] method was applied to remove outliers and estimate the partial affine transformation matrix between the EMA and OCTA pair. Here, the partial affine transformation (i.e., 4 degrees of freedom) was adopted because the EMA and OCTA images in MEMO were captured using the same device and already have the same pixel density (i.e., scale factor).

5. Experimental settings

5.1 Dataset

We used our MEMO dataset and the CF-FA [30] dataset to conduct the experiments. The two datasets were chosen to examine how the proposed and the competing methods performed for both scenarios of small VD differences (using the CF-FA dataset) and large VD differences (using our MEMO dataset).

5.1.1 MEMO dataset

The MEMO dataset contains 30 pairs of images. Fifteen pairs (with even indices) were selected as the training set, and the rest of the pairs (with odd indices) were used as the test set. For OCTA, the SVP layer projection images were used in our experiments because they contained clearer arterioles and venules, and were free of projection artifact. For EMA, the stacked image of each EMA image sequence was used for the purpose of denoising. Furthermore, we annotated the vessel segmentation mask of each EMA stacked image for VDD-Reg. We also annotated one EMA stacked image that is not part of the MEMO dataset as the style target. Note that even though we annotated the vessel masks for all EMA images, our VDD-Reg actually required only three of those to maintain its performance.

For the image pre-processing, the OCTA images were first resized to $256\times 256$ pixels. Then, the EMA images were resized using the same scaling factors of the corresponding OCTA images. Next, to meet the requirement of our model, the resized EMA images were then cropped to ensure that their widths and heights were multiples of 8. Finally, the annotated EMA vessel segmentation masks and the registration ground truth were pre-processed accordingly to ensure their correct scale. To ensure the quality and consistency of annotation, all annotations were drawn by the same human annotator and checked by an experienced ophthalmologist (OJS).

5.1.2 CF-FA dataset

The CF-FA dataset contains 59 pairs of color fundus (720 $\times$ 576, RGB) and fluorescein angiography (720 $\times$ 576, grayscale) images. Twenty-nine pairs of images are from healthy subjects, while the other 30 pairs are from patients with retinopathy. We manually labeled 6 pairs of corresponding points for all image pairs as the registration ground truth. We selected 29 image pairs (with odd indices) as the training set, 29 image pairs (with even indices) as the test set, and 1 image pair (normal/1-1) as the style target. Similar to the MEMO dataset, we manually annotated the vessel segmentation masks of the style target and three selected color fundus images from the training set for the proposed VDD-Reg. For image pre-processing, the color fundus (CF) images were converted to grayscale, with no additional pre-processing steps applied.

5.2 Comparison to existing methods

We compared VDD-Reg with five existing methods listed in Table 2. These existing and previously described methods were selected to represent each type of methods we discussed in Sec. 2.2. SURF-PIIFD-RPM [43] is one of the well-known traditional methods for multimodal retinal image registration that demonstrated good performance. For the deep-learning based methods, two direct and two indirect methods were chosen for comparison. SuperGlue (SG) [64] and LoFTR [63] are two direct methods for feature detection and description which were proposed more recently. For indirect methods, we selected two methods [28,29] that utilize different transfer methods, including CycleGAN [52] and vessel segmentation.

Tables Icon

Table 2. Summary of Five Existing Methods

For fair comparison, we made all methods except for SURF-PIIFD-RPM estimate the partial affine transformation matrix and adopt RANSAC with the same hyperparameters. For SURF-PIIFD-RPM, as we directly adopted the official code, the affine transformation matrix was applied and RANSAC was not used. Moreover, for methods that used SuperPoint [56] for keypoint detection and description, including SG, CycleGAN-based, Content-Adaptive and VDD-Reg, the official pretrained network was adopted without fine-tuning. More details about these five existing methods are listed as follows:

  • SURF-PIIFD-RPM [43]: This method utilized SURF and PIIFD for more robust feature extraction and RPM for outliers rejection. The official MATLAB code was used.
  • LoFTR [63]: This method exploited Transformer [65] for processing and matching the dense local features extracted from the backbone. The official pretrained model was adopted with the default setting and was applied directly to the raw images without fine-tuning.
  • SuperGlue (SG) [64]: This method used a graph neural network (GNN) for finding correspondences and rejecting non-matchable points between two sets of local features. SuperPoint was used for feature detection and description. The official pretrained networks were adopted, where the SuperPoint detection threshold was set as 0.015 and the SuperGlue match threshold was set as 0.1.
  • CycleGAN-based [29]: This method combined a keypoint detection and description network designed for retinal images (i.e., RetinaCraquelureNet [66]) with SuperGlue. The networks were trained using self-supervised learning on synthetic multimodal images generated by CycleGAN [52]. As the code was unavailable, we implemented a simplified alternative of this approach by using CycleGAN to transfer images from one modality to another and adopting SuperPoint for feature detection and description. Content-Adaptive [28]: This method designed a content-adaptive vessel segmentation network based on the pixel-adaptive convolution (PAC) [51] guided by the phase images. The network was trained with style loss and the self-comparison loss. The image registration loss based on the ground truth transformation matrix was also used. We implemented this method based on the code from [67]. Unlike the original paper [28], we ignored the outlier rejection network and did not fine-tuned SuperPoint because these were general techniques applicable to all the other competing methods, which were not within the scope of this paper.

5.3 Training and testing details

All networks in our method were implemented in PyTorch. The network architectures were implemented based on the implementations provided in [27] and [56], which were adopted with default settings for the DRIU and SuperPoint models, respectively. For the vessel segmentation module, both stages took 1000 epochs for training. The trained networks in stage 1 were used to initialize the networks in stage 2. The Adam optimizer [68] with learning rate 1e-4 was used. A batch size of 1 was used due to the limitation of our GPU memory. OpenCV was used for data pre-processing and the RANSAC algorithm, where the $cv2.estimateAffinePartial2D$ function, an OpenCV function that estimates partial affine transformation, was adopted by setting the maximum reprojection error to 5 pixels and the maximum iteration to 2000.

5.4 Evaluation metrics

5.4.1 RMSE/MAE

Based on the predicted registration matrices, the labeled points in all test OCTA images were reprojected to the corresponding test EMA images. Then, the root-mean-square-error (RMSE) [42] between the reprojected points and the labeled points was computed. In addition, the average maximum-absolute-error (MAE) [27], defined as the maximum reprojection error for each image pair, was also reported.

5.4.2 Success rate

The success rate was defined as the number of image pairs with successful registration over the total number of test image pairs. Since the definition of success varies, we defined a registration as successful when its RMSE < 10 pixels or its MAE < 10 pixels, based on the clinical tolerance.

5.4.3 Soft dice/masked soft dice

Dice [49] is widely used to evaluate the registration quality by calculating the pixel alignment between the warped source vessel masks and the target vessel masks. Soft Dice [27] has been proposed as an extension of Dice for assessing registration quality. For Soft Dice, CLAHE [69] was first applied to enhance the contrast of two input images and the Frangi vesselness filter [70] was then used to generate the vesselness probability masks of the two input images. We calculated Soft Dice by

$$Soft\ Dice = \frac{2 \sum_{i=1}^{N} min(F_{i}^{s'}, \ F_{i}^{t})}{\sum_{i=1}^{N} F_{i}^{s'} + \sum_{i=1}^{N} F_{i}^{t}},$$
where $F^{s'}$ and $F^{t}$ are the vesselness probability masks of the warped source images and the target images, and $N$ denotes the total number of pixels. In our experiments, FA and EMA images were viewed as the source images.

Additionally, we found that Soft Dice could not accurately represent performance when a relatively large VD difference was present between the two modalities, such as in our MEMO dataset. Moreover, it was particularly unreliable when the results of each competing method had relatively small differences. As most vessels in OCTA images do not exist in the corresponding EMA images, calculating Soft Dice based on every pixel is not ideal. Hence, when evaluating on the MEMO dataset, we extended Soft Dice to Masked Soft Dice which considered only the pixels within the ground truth segmentation masks of EMA images. Masked Soft Dice is defined as

$$Masked\ Soft\ Dice = \frac{2 \sum_{i=1}^{N} min(M_{i}^{e'} F_{i}^{e'}, \ M_{i}^{e'} F_{i}^{o})}{\sum_{i=1}^{N} M_{i}^{e'} F_{i}^{e'} + \sum_{i=1}^{N} M_{i}^{e'} F_{i}^{o}},$$
where $F^{e'}$ and $F^{o}$ are the vesselness probability masks of the warped EMA images and the OCTA images, and $M^{e'}$ represents the warped ground truth segmentation masks of EMA images. In Fig. 8, we demonstrated that Masked Soft Dice has better ability to assess the registration performance on our MEMO dataset, as it is less sensitive to the VD difference and noise.

 figure: Fig. 8.

Fig. 8. The average (a) Soft Dice and (b) Masked Soft Dice values over image pairs in MEMO by adding different x and y shifts to the ground truth registration. The top-left value in each figure represents the average Soft Dice or Masked Soft Dice value obtained by ground truth registration. All values are color-coded.

Download Full Size | PDF

6. Results

6.1 CF-FA dataset

Table 3 illustrates the quantitative results of our method and the existing methods on the CF-FA dataset. The Masked Soft Dice metric was not used as the VD difference of the CF-FA dataset is relatively small. From Table 3, we can observe that our VDD-Reg achieved 100% success rate and the lowest RMSE among all the methods on the CF-FA dataset. Surprisingly, SURF-PIIFD-RPM, the only conventional multimodal registration method in Table 3, achieved decent performance (82.76%) compared to the other methods. This might indicate that a well-designed conventional method is still competitive if the target dataset is not too difficult. LoFTR and SG are two direct and deep learning-based registration methods in Table 3. LoFTR, despite performing well on a general homography estimation dataset [63,71], achieved the worst performance (41.28%) on the CF-FA dataset according to Table 3. Fine-tuning LoFTR on the CF-FA dataset might be helpful. However, as LoFTR was originally trained on the ground-truth labels obtained from a large-scale synthetic indoor scenes datasets [72], it is unclear how to effectively fine-tune LoFTR on a multimodal retinal image registration dataset such as the CF-FA dataset. Compared to LoFTR, SG demonstrated better generalization to the CF-FA dataset (82.76%), even though it was trained on the same synthetic dataset [72] as LoFTR. This was possibly because SuperPoint (SP), the feature detection and description network used by SG, had good generalization capability. Compared to the direct methods, indirect deep learning-based methods in Table 3 generally achieved better registration performance on the CF-FA dataset. CycleGAN-based (FA$\rightarrow$CF), which transformed FA images to CF images with CycleGAN first before registration, also achieved 100% success rate as our VDD-Reg. Its counterpart, CycleGAN-based (CF$\rightarrow$FA), had a slightly lower success rate (86.21%) mainly because the transformation from FA to CF worked better than the opposite direction using CycleGAN. Content-Adaptive, a segmentation-based method similar to our VDD-Reg, performed slightly worse (89.67%) than our method. Due to our two-stage semi-supervised learning framework, VDD-Seg could produce vessel segmentation masks more suitable for image registration.

Tables Icon

Table 3. Results of different methods on the CF-FA test set (Best results are marked in bold)

6.2 MEMO dataset

Table 4 shows the quantitative registration results of our method and the existing methods on our MEMO dataset. We also demonstrate the qualitative registration results of our method and the existing methods on a selected image pair from our MEMO dataset in Fig. 9. Due to the relatively large VD difference, all the methods performed worse compared to the results on the CF-FA dataset. Still, our VDD-Reg outperformed these five existing methods by a large margin (86.67%). From Table 4, we observed that SURF-PIIFD-RPM performed poorly (6.67%) on our MEMO dataset, which suggested that the hand-crafted features might be insufficiently powerful for this scenario. The two deep learning-based direct methods, LoFTR and SG, also produced unsatisfactory results (13.33% and 0%) due to the large distribution gap between their training dataset [72] and our MEMO dataset. CycleGAN-based with pretrained SuperPoint (SP) achieved relatively better performance compared to other competing methods, demonstrating its potential to solving difficult multimodal registration problems. Content-Adaptive, which is also a segmentation-based method, performed much worse (33.33%) than our method on the MEMO dataset. We attributed this to our use of annotated vessel segmentation masks from a single modality (EMA in our case). Different from our two-stage semi-supervised learning framework (LVD-Seg), Content-Adaptive trained the segmentation networks naively with style loss, the self-comparison loss and the image registration loss. To improve the segmentation quality, Content-Adaptive additionally guided the segmentation networks with mean phase images of input images using pixel-adaptive convolution (PAC) [51]. However, due to the high complexity of OCTA images, the OCTA mean phase images were usually too noisy to correctly guide the segmentation networks. On the other hand, our LVD-Seg framework used very few (e.g., three) annotated vessel segmentation masks from one modality to guide the segmentation networks to segment similar vessels in EMA and OCTA image pairs for both stages. In addition, the two-stage design also enhanced training stability when style loss was involved. These are particularly important for multimodal retinal image registration when a large VD difference exists between the two modalities.

 figure: Fig. 9.

Fig. 9. Registration results of our method and the existing methods on a selected image pair from our MEMO dataset. The top row shows the grid images where the EMA and OCTA images are interlaced as small grids. The bottom row shows the overlay images of the EMA (green) and OCTA (orange) extracted vessels registered by each method. The vessel segmentation masks were generated by our method. The RMSE, MAE and Masked Soft Dice of each method are listed below each overlay image.

Download Full Size | PDF

Tables Icon

Table 4. Results of different methods on the MEMO test set (Best results are marked in bold)

7. Discussion

7.1 Ablation study on the two-stage learning framework

In this section, we investigated the benefits of each stage in the proposed two-stage semi-supervised learning framework (LVD-Seg) by removing one of the stages from the framework. The results are shown in Table 5. For Stage 1 only, we trained the EMA vessel segmentation network following the procedure described in Section 4.1.2 and used the same EMA vessel segmentation network for segmenting OCTA images. For Stage 2 only, we trained the segmentation networks with style loss only. In general, the performance of both variants decreased significantly. Stage 1 only performed the worst due to the limited number (three in our case) of annotated vessel segmentation masks used for supervised training, making the segmentation network generalize poorly on the test images and resulting in poor registration performance. Furthermore, Stage 1 only directly applied the trained EMA segmentation network on OCTA images, which did not work due to the relatively large VD difference between the two modalities. Although Stage 2 only worked better than Stage 1 only, it still significantly lagged behind our LVD-Seg. This implied that relying purely on style loss could let the training become unstable and resulted in unreliable registration. All these results emphasized the effectiveness of the proposed two-stage semi-supervised learning framework (LVD-Seg).

Tables Icon

Table 5. Results of removing either stage of LVD-Seg when training the segmentation module on the MEMO Dataset (Best results are marked in bold)

7.2 Ablation study on number of required vessel masks

The major advantage of our method lies in the requirement for few manually annotated vessel segmentation masks during stage 1 of the LVD-Seg framework. In this section, we further investigated the performance of our method on the MEMO dataset by using different numbers of labeled EMA vessel segmentation masks for supervised training. Specifically, during stage 1 of LVD-Seg, we trained our vessel segmentation module using randomly sampled 3, 5, 10 and 15 annotated EMA vessel segmentation masks. In stage 2, we used all 15 training pairs as our default setting. The results are shown in Table 6. We found that using more annotated EMA vessel segmentation masks during the supervised training (i.e., stage 1 of LVD-Seg) did not affect the performance significantly. For instance, there was only a 6.66% difference between the highest and the lowest success rates. In other words, the proposed method required very few (e.g., three) annotated vessel segmentation masks to maintain its accuracy, demonstrating its feasibility.

Tables Icon

Table 6. Results of Using Different Number of Annotated Vessel Segmentation Masks in Stage 1 of LVD-Seg (Best results are marked in bold)

7.3 Ablation study on data used for supervised training

As mentioned in the previous section, the primary cost of our method lies in the requirement for few manually annotated vessel segmentation masks. In this section, we further investigated whether existing retinal image segmentation datasets could potentially be used to train our segmentation module during stage 1 of LVD-Seg. We selected two datasets, HRF [73] and DRIVE [74], to conduct the experiments. Specifically, HRF and DRIVE are two retinal color fundus (CF) image datasets providing ground truth vessel segmentation masks. We randomly chose three images from each dataset to train our segmentation network during stage 1 of LVD-Seg. Other than that, the default settings were adopted. The results are shown in Table 7. Compared to the performance of using our MEMO dataset, the performance of using the HRF and DRIVE datasets in stage 1 both decreased. Additionally, using the HRF dataset achieved superior performance than using the DRIVE dataset. One possible reason for this was that the HRF dataset had a more similar VD with our MEMO dataset. The average VD of MEMO (EMA), HRF and DRIVE are 4.71%, 10.05% and 11.21%, respectively. This implies that selecting the ground truth vessel segmentation masks whose VD is closer to the target images (EMA images in our case) might be very important for achieving better results when using the proposed framework.

Tables Icon

Table 7. Results of using different vessel segmentation datasets during stage 1 of LVD-Seg (Best results are marked in bold)

7.4 Generalization ability of the proposed method across different NHPs

So far, all experiments conducted on the MEMO dataset have utilized a train-test split that included both NHPs in each set. In this section, we further explored whether the proposed method trained on images of one NHP could be generalized to another. To achieve this, we divided the MEMO dataset by NHPs. Specifically, eight image pairs captured from one NHP were used as the training set, while the remaining 22 image pairs captured from the other NHP were used as the test set. Other than that, the default settings were adopted for training. The results are shown in Table 8. Given that CycelGAN-based achieved the second-best performance in Table 4, we compared our method with it in this study. Interestingly, we found that whether the training and test set included the same NHP did not affect the performance of our method, achieving a success rate of 90.9%. Additionally, our method outperformed CycelGAN-based by 9% in success rate, indicating its potential to register image pairs from different subjects.

Tables Icon

Table 8. Results of splitting the dataset by NHPs (Best results are marked in bold)

7.5 Potential of the proposed method

The proposed VDD-Reg requires very little labeling. It could potentially be applied to other vessel imaging modalities, especially for modalities with large differences on vessel structures. This has wider applications for any comparison of SLO images with OCTA. For instance, the registration of FAF to OCTA images may benefit from this approach [75]. Furthermore, multimodal adaptive optics devices which use both AO-SLO and AO-OCT methods could also benefit from this approach [76,77].

8. Conclusion

In this paper, we present MEMO, the first public multimodal EMA and OCTA retinal image dataset. MEMO provides registration ground truth, EMA image sequences and OCTA projection images, desirable for various research fields. With MEMO, we first uncover a unique challenge of multimodal retinal image registration between modalities with large VD differences. After that, we propose a segmentation-based deep-learning registration framework, VDD-Reg, to deal with the large vessel density difference between EMA and OCTA in multimodal retinal image registration. Moreover, to train the segmentation module in our VDD-Reg, we design a novel two-stage semi-supervised learning framework, LVD-Seg, which combines supervised and unsupervised losses. Both quantitative and qualitative results demonstrate that VDD-Reg outperforms the existing methods in both small VD differences (i.e., CF-FA) and large VD differences (i.e., MEMO). Additionally, VDD-Reg requires as few as three annotated vessel segmentation masks to maintain its performance, which demonstrates its promising potential for registering other modalities.

Funding

University of Maryland (MPowering the State); National Institutes of Health (R01EY031731).

Disclosures

The authors declare no conflicts of interest.

Data availability

The CF-FA dataset and MEMO dataset underlying the results presented in this paper are available at [30] and [78].

References

1. M. T. Nicolela, B. E. Walman, A. R. Buckley, et al., “Ocular hypertension and primary open-angle glaucoma: a comparative study of their retrobulbar blood flow velocity,” J. glaucoma 5(5), 308–310 (1996). [CrossRef]  

2. V. Patel, S. Rassam, R. Newsom, et al., “Retinal blood flow in diabetic retinopathy,” Br. Med. J. 305(6855), 678–683 (1992). [CrossRef]  

3. T. A. Ciulla, A. Harris, H. S. Chung, et al., “Color Doppler imaging discloses reduced ocular blood flow velocities in nonexudative age-related macular degeneration,” Am. J. Ophthalmol. 128(1), 75–80 (1999). [CrossRef]  

4. G. T. Feke, B. T. Hyman, R. A. Stern, et al., “Retinal blood flow in mild cognitive impairment and Alzheimer’s disease,” Alzheimer’s & Dementia: Diagn. Assess. & Dis. Monit. 1(2), 144–151 (2015). [CrossRef]  

5. F. Berisha, G. T. Feke, C. L. Trempe, et al., “Retinal abnormalities in early Alzheimer’s disease,” Invest. Ophthalmol. Vis. Sci. 48(5), 2285–2289 (2007). [CrossRef]  

6. C. E. Riva, M. Geiser, B. L. Petrig, et al., “Ocular blood flow assessment using continuous laser Doppler flowmetry,” Acta Ophthalmol. 88(6), 622–629 (2010). [CrossRef]  

7. B. Lee, E. A. Novais, N. K. Waheed, et al., “En face Doppler optical coherence tomography measurement of total retinal blood flow in diabetic retinopathy and diabetic macular edema,” JAMA Ophthalmol. 135(3), 244–251 (2017). [CrossRef]  

8. Y. Jia, O. Tan, J. Tokayer, et al., “Split-spectrum amplitude-decorrelation angiography with optical coherence tomography,” Opt. Express 20(4), 4710–4725 (2012). [CrossRef]  

9. W. Goebel, W. E. Lieb, A. Ho, et al., “Color Doppler imaging: a new technique to assess orbital blood flow in patients with diabetic retinopathy,” Invest. Ophthalmol. Vis. Sci. 36(5), 864–870 (1995).

10. A. Roorda, “Applications of adaptive optics scanning laser ophthalmoscopy,” Optom. Vis. Sci. 87(4), 260–268 (2010). [CrossRef]  

11. S. Arichika, A. Uji, M. Hangai, et al., “Noninvasive and direct monitoring of erythrocyte aggregates in human retinal microvasculature using adaptive optics scanning laser ophthalmoscopy,” Investig. Ophthalmology & Vis. Sci. 54(6), 4394–4402 (2013). [CrossRef]  

12. M. Pircher and R. J. Zawadzki, “Review of adaptive optics oct (ao-oct): principles and applications for retinal imaging,” Biomed. Opt. Express 8(5), 2536–2562 (2017). [CrossRef]  

13. J. Carroll, D. B. Kay, D. Scoles, et al., “Adaptive optics retinal imaging–clinical opportunities and challenges,” Curr. Eye Res. 38(7), 709–721 (2013). [CrossRef]  

14. R. Flower, E. Peiretti, M. Magnani, et al., “Observation of erythrocyte dynamics in the retinal capillaries and choriocapillaris using ICG-loaded erythrocyte ghost cells,” Invest. Ophthalmol. Vis. Sci. 49(12), 5510–5516 (2008). [CrossRef]  

15. O. Saeedi, B. Tracey, C. Renner, et al., “Determination of absolute erythrocyte velocity and flow in the human retinal microvasculature by direct visualization of icg-labelled erythrocytes,” Investigative Ophthalmology & Visual Science 59, 3950 (2018).

16. B. M. Tracey, L. N. Mayo, C. T. Le, et al., “Measurement of retinal microvascular blood velocity using erythrocyte mediated velocimetry,” Sci. Rep. 9(1), 20178 (2019). [CrossRef]  

17. S. Asanad, A. Park, J. Pottenburgh, et al., “Erythrocyte-mediated angiography: quantifying absolute episcleral blood flow in humans,” Ophthalmology 128(5), 799–801 (2021). [CrossRef]  

18. D. Wang, A. Haytham, L. Mayo, et al., “Automated retinal microvascular velocimetry based on erythrocyte mediated angiography,” Biomed. Opt. Express 10(7), 3681–3697 (2019). [CrossRef]  

19. A. H. Kashani, C. L. Chen, J. K. Gahm, et al., “Optical coherence tomography angiography: a comprehensive review of current methods and clinical applications,” Prog. Retinal Eye Res. 60, 66–100 (2017). [CrossRef]  

20. S. S. Gao, Y. Jia, M. Zhang, et al., “Optical coherence tomography angiography,” Invest. Ophthalmol. Vis. Sci. 57, OCT27–OCT36 (2016).

21. Y. Watanabe, Y. Takahashi, and H. Numazawa, “Graphics processing unit accelerated intensity-based optical coherence tomography angiography using differential frames with real-time motion correction,” J. Biomed. Opt. 19(2), 021105 (2013). [CrossRef]  

22. M. Santarossa, A. Kilic, C. von der Burchard, et al., “Medregnet: unsupervised multimodal retinal-image registration with gans and ranking loss,” in Medical Imaging 2022: Image Processing, vol. 12032 (SPIE, 2022), pp. 321–333.

23. T. De Silva, E. Y. Chew, N. Hotaling, et al., “Deep-learning based multi-modal retinal image registration for the longitudinal analysis of patients with age-related macular degeneration,” Biomed. Opt. Express 12(1), 619–636 (2021). [CrossRef]  

24. G. Luo, X. Chen, F. Shi, et al., “Multimodal affine registration for ICGA and MCSL fundus images of high myopia,” Biomed. Opt. Express 11(8), 4443–4457 (2020). [CrossRef]  

25. M. Arikan, A. Sadeghipour, B. Gerendas, et al., “Deep learning based multi-modal registration for retinal imaging,” in Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support: Second International Workshop, iMIMIC 2019, and 9th International Workshop, ML-CDS 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 17, 2019, Proceedings 9, (Springer, 2019), pp. 75–82.

26. J. A. Lee, P. Liu, J. Cheng, et al., “A deep step pattern representation for multimodal retinal image registration,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), pp. 5077–5086.

27. J. Zhang, Y. Wang, J. Dai, et al., “Two-step registration on multi-modal retinal images via deep neural networks,” IEEE Trans. on Image Process. 31, 823–838 (2022). [CrossRef]  

28. Y. Wang, J. Zhang, M. Cavichini, et al., “Robust content-adaptive global registration for multimodal retinal images using weakly supervised deep-learning framework,” IEEE Trans. on Image Process. 30, 3167–3178 (2021). [CrossRef]  

29. A. Sindel, B. Hohberger, A. Maier, et al., “Multi-modal retinal image registration using a keypoint-based vessel structure aligning network,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2022), pp. 108–118.

30. S. Hajeb Mohammad Alipour, H. Rabbani, and M. R. Akhlaghi, “Diabetic retinopathy grading by digital curvelet transform,” Comput. Mathematical Methods Med. 2012, 1–11 (2012). [CrossRef]  

31. C. F. Burgoyne, “The non-human primate experimental glaucoma model,” Exp. Eye Res. 141, 57–73 (2015). [CrossRef]  

32. R. Rocholz, F. Corvi, J. Weichsel, et al., “OCT angiography (OCTA) in retinal diagnostics,” High Resolution Imaging in Microscopy and Ophthalmology: New Frontiers In Biomedical Optics, (Springer, 2019), pp. 135–160.

33. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 (Springer, 2016), pp. 694–711.

34. E. Decenciere, G. Cazuguel, X. Zhang, et al., “Teleophta: Machine learning and image processing methods for teleophthalmology,” Irbm 34(2), 196–203 (2013). [CrossRef]  

35. M. Ortega, M. G. Penedo, J. Rouco, et al., “Retinal verification using a feature points-based biometric pattern,” EURASIP J. on Adv. Signal Process. 2009(1), 235746 (2009). [CrossRef]  

36. K. M. Adal, P. G. van Etten, J. P. Martinez, et al., “Accuracy assessment of intra-and intervisit fundus image registration for diabetic retinopathy screening,” Investig. Ophthalmology & Vis. Sci. 56(3), 1805–1812 (2015). [CrossRef]  

37. C. Hernandez-Matas, X. Zabulis, A. Triantafyllou, et al., “Fire: fundus image registration dataset,” Model. Artif. Intell. Ophthalmol. 1(4), 16–28 (2017). [CrossRef]  

38. L. Ding, T. Kang, A. Kuriyan, et al., “Flori21: Fluorescein angiography longitudinal retinal image registration dataset,” IEEE Dataport, 2021, https://doi.org.10.21227/ydp8-zf19.

39. M. Li, Y. Chen, Z. Ji, et al., “Image projection network: 3d to 2d image segmentation in octa images,” IEEE Trans. Med. Imaging 39(11), 3343–3354 (2020). [CrossRef]  

40. L. Ding, A. E. Kuriyan, R. S. Ramchandran, et al., “Weakly-supervised vessel detection in ultra-widefield fundus photography via iterative multi-modal registration and learning,” IEEE Trans. Med. Imaging 40(10), 2748–2758 (2021). [CrossRef]  

41. J. Chen, J. Tian, N. Lee, et al., “A partial intensity invariant feature descriptor for multimodal retinal image registration,” IEEE Trans. Biomed. Eng. 57(7), 1707–1718 (2010). [CrossRef]  

42. Z. Ghassabi, J. Shanbehzadeh, A. Sedaghat, et al., “An efficient approach for robust multimodal retinal image registration based on UR-SIFT features and PIIFD descriptors,” EURASIP J. on Image Video Process. 2013(1), 25 (2013). [CrossRef]  

43. G. Wang, Z. Wang, Y. Chen, et al., “Robust point matching method for multimodal retinal image registration,” Biomed. Signal Process. Control. 19, 68–76 (2015). [CrossRef]  

44. J. Addison Lee, J. Cheng, B. Hai Lee, et al., “A low-dimensional step pattern analysis algorithm with application to multimodal retinal image registration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1046–1053.

45. Z. Hossein-Nejad and M. Nasri, “A-ransac: adaptive random sample consensus method in multimodal retinal image registration,” Biomed. Signal Process. Control. 45, 325–338 (2018). [CrossRef]  

46. M. Hernandez, G. Medioni, Z. Hu, et al., “Multimodal registration of multiple retinal images based on line structures,” in 2015 IEEE Winter Conference on Applications of Computer Vision (IEEE, 2015), pp. 907–914.

47. Á. S. Hervella, J. Rouco, J. Novo, et al., “Multimodal registration of retinal images using domain-specific landmarks and vessel enhancement,” Procedia Computer Science 126, 97–104 (2018). [CrossRef]  

48. D. Motta, W. Casaca, and A. Paiva, “Vessel optimal transport for automated alignment of retinal fundus images,” IEEE Trans. on Image Process. 28(12), 6154–6168 (2019). [CrossRef]  

49. Z. Li, F. Huang, J. Zhang, et al., “Multi-modal and multi-vendor retina image registration,” Biomed. Opt. Express 9(2), 410–422 (2018). [CrossRef]  

50. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, arXiv:1409.1556 (2014). [CrossRef]  

51. H. Su, V. Jampani, D. Sun, et al., “Pixel-adaptive convolutional neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 11166–11175.

52. J.-Y. Zhu, T. Park, P. Isola, et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision (2017), pp. 2223–2232.

53. J. Pottenburgh, L. Mayo, E. Ma, et al., “Use of fitc and cfse labeled erythrocytes for in vivo retinal imaging in non-human primates,” Invest. Ophthalmol. Vis. Sci. 61(7), 897 (2020).

54. S.-E. Chen, V. Chen, J. Pottenburgh, et al., “In vivo measurement of plexus-specific retinal erythrocyte velocity and acceleration in human subjects and nhps,” Invest. Ophthalmol. Vis. Sci. 63(7), 3502 (2022).

55. K.-K. Maninis, J. Pont-Tuset, P. Arbeláez, et al., “Deep retinal image understanding,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19 (Springer, 2016), pp. 140–148.

56. D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition workshops, (2018), pp. 224–236.

57. L. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using convolutional neural networks,” Advances in Neural Information Processing Systems 28 (2015).

58. X. Jiang, J. Ma, G. Xiao, et al., “A review of multimodal image matching: Methods and applications,” Inf. Fusion 73, 22–71 (2021). [CrossRef]  

59. M. Zhao, G. Zhang, and M. Ding, “Heterogeneous self-supervised interest point matching for multi-modal remote sensing image registration,” Int. J. Remote. Sens. 43(3), 915–931 (2022). [CrossRef]  

60. P.-E. Sarlin, C. Cadena, R. Siegwart, et al., “From coarse to fine: Robust hierarchical localization at large scale,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 12716–12725.

61. J. Zhang, D. Sun, Z. Luo, et al., “Learning two-view correspondences and geometry using order-aware network,” in Proceedings of the IEEE/CVF international conference on computer vision (2019), pp. 5845–5854.

62. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM 24(6), 381–395 (1981). [CrossRef]  

63. J. Sun, Z. Shen, Y. Wang, et al., “Loftr: Detector-free local feature matching with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 8922–8931.

64. P.-E. Sarlin, D. DeTone, T. Malisiewicz, et al., “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 4938–4947.

65. A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Advances in Neural Information Processing Systems (2017).

66. A. Sindel, B. Hohberger, S. F. Dehcordi, et al., “A keypoint detection and description network based on the vessel structure for multi-modal retinal image registration,” in Bildverarbeitung für die Medizin 2022: Proceedings, German Workshop on Medical Image Computing, Heidelberg, June 26-28, 2022 (Springer, 2022), pp. 57–62.

67. J. Zhang, C. An, J. Dai, et al., “Joint vessel segmentation and deformable registration on multi-modal retinal images based on style transfer,” in 2019 IEEE International Conference on Image Processing (ICIP) (IEEE, 2019), pp. 839–843.

68. D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv, arXiv:1412.6980 (2014). [CrossRef]  

69. K. J. Zuiderveld, “Contrast limited adaptive histogram equalization,” in Graphics Gems (Academic Press, 1994).

70. A. F. Frangi, W. J. Niessen, K. L. Vincken, et al., “Multiscale vessel enhancement filtering,” in Medical Image Computing and Computer-Assisted Intervention—MICCAI’98: First International Conference Cambridge, MA, USA, October 11–13, 1998 Proceedings 1 (Springer, 1998), pp. 130–137.

71. V. Balntas, K. Lenc, A. Vedaldi, et al., “Hpatches: a benchmark and evaluation of handcrafted and learned local descriptors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), pp. 5173–5182.

72. A. Dai, A. X. Chang, M. Savva, et al., “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 5828–5839.

73. A. Budai, R. Bock, A. Maier, et al., “Robust vessel segmentation in fundus images,” Int. J. Biomed. Imaging 2013, 1–11 (2013). [CrossRef]  

74. J. Staal, M. D. Abràmoff, M. Niemeijer, et al., “Ridge-based vessel segmentation in color images of the retina,” IEEE Trans. Med. Imaging 23(4), 501–509 (2004). [CrossRef]  

75. H. Narasimha-Iyer, B. Lujan, J. Oakley, et al., “Registration of cirrus HD-OCT images with fundus photographs, fluorescein angiographs and fundus autofluorescence images,” Investigative Ophthalmology & Visual Science 49(13), 1831 (2008).

76. Z. Liu, J. Tam, O. Saeedi, et al., “Trans-retinal cellular imaging with multimodal adaptive optics,” Biomed. Opt. Express 9(9), 4246–4262 (2018). [CrossRef]  

77. C. T. Le, D. Wang, R. Villanueva, et al., “Novel application of long short-term memory network for 3D to 2D retinal vessel segmentation in adaptive optics—optical coherence tomography volumes,” Appl. Sci. 11(20), 9475 (2021). [CrossRef]  

78. C. Y. Wang, F. K. Sadrieh, Y. T. Shen, et al., “MEMO: a multimodal EMA and OCTA retinal image dataset,” Github, 2024https://chiaoyiwang0424.github.io/MEMO/

Data availability

The CF-FA dataset and MEMO dataset underlying the results presented in this paper are available at [30] and [78].

30. S. Hajeb Mohammad Alipour, H. Rabbani, and M. R. Akhlaghi, “Diabetic retinopathy grading by digital curvelet transform,” Comput. Mathematical Methods Med. 2012, 1–11 (2012). [CrossRef]  

78. C. Y. Wang, F. K. Sadrieh, Y. T. Shen, et al., “MEMO: a multimodal EMA and OCTA retinal image dataset,” Github, 2024https://chiaoyiwang0424.github.io/MEMO/

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (9)

Fig. 1.
Fig. 1. Sample images of (a) CF, (b) FA, (c) EMA and (d) OCTA with vessel density (VD). (a) and (b) are taken from the CF-FA dataset [30]. In this example, the vessel density of OCTA (d) is five times grater than that of EMA (c) since most capillaries cannot be visualized in EMA images.
Fig. 2.
Fig. 2. A typical sample EMA and OCTA pair from our MEMO dataset. Images inside the orange boxes were used for ground truth labeling. (A-1, A-2 and A-3: frame 0, 10 and 20 in the sample EMA image sequence. A-4: the stacked images of the EMA sequence. C-1, C-2 and C-3: the sample OCTA projection images representing DCP, ICP and SVP layer. C-4: the OCTA B-scan image. B and D: the six corresponding point pairs of the sample EMA and OCTA pair.)
Fig. 3.
Fig. 3. The procedure for image acquisition. The numbers shown in the figure indicate the order.
Fig. 4.
Fig. 4. Image samples corresponding to each eye of each NHP. The EMA image is placed on top of the OCTA image for each image pair.
Fig. 5.
Fig. 5. The Statistics of our MEMO dataset. The number of image pairs (count) falling within different ranges of (a) translation in the x-axis (pixel), (b) translation in the y-axis (pixel), or (c) rotation (degree) are presented. The division of training and test data for the MEMO dataset is outlined in Sec. 5.1.1.
Fig. 6.
Fig. 6. The proposed VDD-Reg framework. VDD-Reg includes a vessel segmentation module and a registration module. The vessel segmentation module is trained with the proposed two-stage semi-supervised learning framework (LVD-Seg). DRIU [55] and SuperPoint [56] are adopted for our segmentation networks and registration network, respectively. $M_{reg}^{global}$ denotes the partial affine transformation matrix for global image registration.
Fig. 7.
Fig. 7. The network structure of SuperPoint [56]. The green circles indicate the feature points detected by SuperPoint.
Fig. 8.
Fig. 8. The average (a) Soft Dice and (b) Masked Soft Dice values over image pairs in MEMO by adding different x and y shifts to the ground truth registration. The top-left value in each figure represents the average Soft Dice or Masked Soft Dice value obtained by ground truth registration. All values are color-coded.
Fig. 9.
Fig. 9. Registration results of our method and the existing methods on a selected image pair from our MEMO dataset. The top row shows the grid images where the EMA and OCTA images are interlaced as small grids. The bottom row shows the overlay images of the EMA (green) and OCTA (orange) extracted vessels registered by each method. The vessel segmentation masks were generated by our method. The RMSE, MAE and Masked Soft Dice of each method are listed below each overlay image.

Tables (8)

Tables Icon

Table 1. Comparison of Public Retinal Image Datasets with Image Pairs

Tables Icon

Table 2. Summary of Five Existing Methods

Tables Icon

Table 3. Results of different methods on the CF-FA test set (Best results are marked in bold)

Tables Icon

Table 4. Results of different methods on the MEMO test set (Best results are marked in bold)

Tables Icon

Table 5. Results of removing either stage of LVD-Seg when training the segmentation module on the MEMO Dataset (Best results are marked in bold)

Tables Icon

Table 6. Results of Using Different Number of Annotated Vessel Segmentation Masks in Stage 1 of LVD-Seg (Best results are marked in bold)

Tables Icon

Table 7. Results of using different vessel segmentation datasets during stage 1 of LVD-Seg (Best results are marked in bold)

Tables Icon

Table 8. Results of splitting the dataset by NHPs (Best results are marked in bold)

Equations (9)

Equations on this page are rendered with MathJax. Learn more.

L v = 1 N i = 1 N ( P r e d ( I ( i ) ) M ( i ) ) 2 .
L s c = M S E ( R o t 90 ( P r e d ( R o t 90 ( I ) ) ) ,   P r e d ( I ) ) ,
L s 1 = w v L v + w s c L s c ,
G j ( I ) c , c = 1 C j H j W j h = 1 H j w = 1 W j ϕ j ( I ) h , w , c ϕ j ( I ) h , w , c .
L s t j ( P r e d ( I ) ,   M t ) = G j ( P r e d ( I ) ) G j ( M t ) F 2 ,
L s t = j J L s t j ( P r e d ( I ) ,   M t ) .
L s 2 = w s t e L s t e + w s t o L s t o + w s c ( L s c e + L s c o ) .
S o f t   D i c e = 2 i = 1 N m i n ( F i s ,   F i t ) i = 1 N F i s + i = 1 N F i t ,
M a s k e d   S o f t   D i c e = 2 i = 1 N m i n ( M i e F i e ,   M i e F i o ) i = 1 N M i e F i e + i = 1 N M i e F i o ,
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.