Development of ocean color algorithms for estimating chlorophyll-a concentrations and inherent optical properties using gene expression programming (GEP)

Chih-Hua Chang

doi:10.1364/OE.23.005417

1. Introduction

The application of ocean-color remote sensing is promising with regard to its ability to allow for a high spatial and temporal assessment of important physical, chemical, and biological parameters of the global ocean from space [1]. Deriving optical properties associated with these essential parameters from remotely sensed radiance above the sea surface, is known as the “inverse problem” in the field of ocean optics. Retrieval algorithms or inversion algorithms/models are applied to solve the inverse problem, which use inputs representing “ocean-color” [2] (R_rs, the ratio between the water-leaving radiance and downwelling irradiance, sr⁻¹) measured by a satellite sensor at a limited number of spectral bands in the visible range. Typical outputs are the total phytoplankton content in terms of Chlorophyll-a concentration (Chla); and ocean inherent optical properties (IOPs) [3], such as the absorption and back-scattering characteristics for phytoplankton, detritus (non-algal particles) and colored dissolved organic matters (CDOM or gelbstoff). Estimation of the bio-optical properties through solving the inverse problem is of importance in understanding the related environmental dynamics at both local and global scales.

The various inversion algorithms can generally be classified into two different approaches based on whether a forward-model that physically describes the bio-optical relationships and radiative-transfer processes is or is not involved in the inversion process [4]. First, model-based approaches solve the inverse problem by deriving algebraic results [5], or approximating suitable solutions using optimization or spectral matching methods, such as the Levenberg-Marquart approach [6], and nonlinear least-square schemes [7, 8]. If a representative model is available, the major benefit obtained from such approaches is that few training data are required to develop the algorithm. In order to obtain algebraic solutions or improve the efficiency of optimization, it is commonly necessary to simplify the forward-model, such as by the use of semi-analytical and quasi-analytical approximations [5, 6]. However, simplification may create unrepresentative models and lead to incorrect results, particularly when dealing with optically complex waters [4]. Second, without the use of a predefined model, empirical or regression approaches can directly relate the inputs to the outputs by training with a set of known samples. Although it is indicated that the model-based approach may have a higher level of complexity and perhaps better accuracy for Case-2 waters [4], current operational algorithms used for mapping global ocean-color products are developed using a data-driven or empirical approach, mainly due to higher model transferability and lower computational load. The IOP products were mostly derived by model-based approaches or empirical algorithms developed by data generated from mechanistic models. The performance of empirical methods is highly dependent on the level of representativeness and complexity of the training data. Because the interactions between different IOPs are commonly not linear [4], bio-optical models are not globally the same [9], and bio-optical measurements often return imprecise and uncertain data [10, 11], the concepts of a universally applicable model or representative data are problematic in real oceans. This has led to the use of soft-computing techniques in the ocean-color community, to develop more reliable and robust algorithms to deal with such nonlinear systems. For example, recent work has applied artificial neural networks (ANNs) [12–17], support vector regression (SVR) [18, 19], cross-entropy method [20], active learning method [21], and evolutionary algorithms (EAs) [22, 23].

EAs are inspired by natural processes that drive evolution [24], such as inheritance, variation and the Darwinian struggle for survival. EAs have been widely used in many disciplines to solve complex problems, and are effective, robust and easy-to-implement methods that can search for global solutions to optimization problems [25]. In general, EAs encompass all adaptive and computational models of natural evolutionary systems [25], such as genetic algorithms (GAs) [24] and genetic programming (GP) techniques [26]. GAs are genotype-algorithms, in which the individuals are fixed-length linear strings and function as genomes (chromosomes). A GA works as a nonlinear optimization method in the model-based approach for solving the inverse problem [22, 23], in which candidate solutions are encoded as chromosomes and manipulated by natural selection operators, e.g., reproduction, crossover and mutation, and an individual’s fitness for survival is evaluated in decoded forms (using real values) by the predefined forward model and objective function. In contrast, a GP is a phenotype-algorithm, where the phenotypes of individuals are represented and manipulated as nonlinear entities of different sizes and shapes, commonly a tree-structured function [26]. Solving the inverse problem with GP is also known as symbolic regression [27, 28], in which the most appropriate backward model for linking inputs and outputs is developed by directly optimizing a population of tree-structured functions through the related evolutionary processes. GPs have had significant success in creating innovative (i.e., commonly improved) functions and equations for industrial applications. However, in a Koza-style GP [26], both the searching operators and the fitness functions are based on the phenotype. Therefore, without the use of a genotype, illegal or invalid tree-structures may often be created in the selection process, which limits the optimization efficiency and algorithm transferability [29].

Several genotype-phenotype GP techniques which can manipulate individuals genetically as GA and can maintain the functional complexity of the phenotype, have recently been proposed, such as the grammatical evolution [30] and gene expression programming (GEP) techniques [29]. A genotype-phenotype system makes a separation between an organism’s genotype and phenotype, which means that it is more compatible with how genetics actually operates in nature, and is more likely to provide reasonable, efficient and simple algorithms to solve the inverse problem. In GEP, the individuals are encoded as GAs, which are then translated into tree-structured functions, the gene expression trees (ETs), by a functional relationship between chromosomes and phenotype expressions. With the unequivocal translation system [29], any modification made in the genome can always result in correct ETs. However to date no research has been performed using the GEP to develop ocean-color retrieval algorithms.

With an attempt to gain the computational advantages from both the evolutionary algorithm and empirical approaches, this work develops new inversion algorithms for the retrieval of Chla and IOPs from remote sensing of ocean color using GEP. First, version 2 of the NASA bio-Optical Marine Algorithm Data set (NOMAD) [31] is examined to summarize all the available data points of R_rs(λ), Chla and IOPs. Second, available data pairs are sampled into two subsets: a training data set for the creation of inversion models, and a validation data set that is used to check the performance of the model. Third, the GEP technique is applied to develop six ocean-color inversion algorithms for the estimation of (1) Chla (mg/m³); (2) absorption coefficients at 443nm (m⁻¹), including phytoplankton [a_ph(443)], detritus-gelbstoff [a_dg(443)], particulate matters [a_p(443)], and total optically-active-substances [a(443)]; and (3) the particle backscattering coefficients at 555nm (m⁻¹) [bb_p(555)]. Finally, the application of these new models is validated against the in situ, synthetic, and Sea Wide Field-of-view Sensor (SeaWiFS) match-up data sets compiled by NASA and International Ocean Color Coordinate Group (IOCCG) [32]. Most of the in situ data used in this study comes from locations relatively close to the coast (some are Case 2 waters), and the IOCCG synthetic data set comprises a wide range of parameters characterizing the global ocean (Case 1 waters). Compared to existing empirical- and model-based approaches, the inversion algorithms proposed in this study are clearly expressed functions which have higher transferability and can be readily applied, and they are capable of: (1) providing better retrievals for in situ data; (2) being used to process the SeaWiFS data at regional and global scales; and (3) generating ocean-color products which give more information on the distribution of water constituents.

2. Data sets

NOMAD is a collection of in situ coincident observation data of radiometric, phytoplankton pigment concentrations and bio-optical properties from NASA’s SeaWiFS Bio-optical Archive and Storage System (SeaBASS). NOMAD is also the most popular and representative publicly available data set used to support algorithm development and satellite product validation, having been cited more than 100 times in the ISI Web of Knowledge since 2005. The latest version of the NOMAD data set (updated on 18 July, 2008), referred as NOMADv2a, was used in this study for the development of inversion models through GEP. The IOP data with coincident ground measured values of R_rs(λ_i) available at the first five SeaWiFS bands (λ_i = 411, 443, 489, 510 and 555nm) are used in this work, resulting in the following numbers of corresponding data points: HPLC measured Chla is 1,210, a_ph(443) is 897, a_dg(443) is 788, a_p(443) is 932, a(443) is 799 and bb_p(555) is 332. Although studies have shown that the R_rs measured at 670nm is very important for inversion of coastal waters [33], it was not used in this study when developing the algorithm, because there is an insufficient number of accurate measurements in NOMADv2a. The frequency distributions of the six selected data pairs are shown in Fig. 1. The NOMADv2a data set contains some near coastal samples, but the majority of the data is distributed in the lower end of each histogram, representing clear waters. The six data pairs were sorted by increasing the levels of Chla and IOPs, and afterwards the odd- and even-numbered data were used separately as training and validation samples for the development of GEP inversion models. A limited number of bb_p(555) observations were available, therefore an additional 56 data points were allocated to the training samples from the validation set, and thus two-thirds of the data were used for training. As shown in Table 1, there were 222 to 605 data points available to train the GEP algorithms.

Fig. 1 The frequency distributions of six selected IOP data points from the NOMADv2a data set. The variations of coincident observed R_rs(λ) are illustrated as boxplots and labeled on sub-axes.

Download Full Size | PDF

Table 1. Statistical summary of the bio-optical data for training and validation of the GEP-derived inversion models

View Table | View all tables in this article

The IOCCG prepared a synthetic and an in situ data set (hereafter referred as IOCCG synthetic and in situ data set) containing IOPs and the corresponding apparent optical properties (AOPs) for its Report Number 5 [32], in order to evaluate and compare the performance of a variety of retrieval algorithms commonly used in remote sensing practices. Note that the uncertainties in IOP measurements were designed into the IOCCG synthetic data by mapping one Chla to 25 different sets of IOPs based on theory and field data [34], resulting in 500 IOPs that are generated from 20 levels of Chla in the range of 0.03–30 mg/m³, and the corresponding R_rs(λ) simulated by Hydrolight for the sun at 30° and 60° from zenith, respectively. The IOCCG in situ data set comprises 656 cases including Chla, a(443), a_dg(443), a_ph(443) and R_rs(λ), which also originated from the SeaBASS data.

Another subset of NOMAD, released earlier as version 1.3 on 19 Sept 2005 (referred to as NOMADv13), consists of coincident observations of R_rs(λ) from SeaWiFS imagery, and was used to ensure the inversion models are also able to process real satellite data. Due to difficulties in meeting the criteria of satellite “match-up” [35], the sample size of NOMADv13 is relatively small, where 268, 123, 111, 123,111 and 28 cases are available for on-orbit validation of the derived Chla, a_ph(443), a_dg(443), a_p(443), a(443) and bb_p(555), respectively.

Overall, among the four data sets used, only the training samples (Table 1) in NOMADv2a are used to develop the inversion models, with the remainder used for validation. The IOCCG synthetic and in situ data sets are applied to evaluate and compare the performance of our inversion models and existing algorithms. NOMADv13 was used as satellite-matchup data for the validation of retrievals generated by GEP-derived inversion models, in which the inputs are the satellite observed R_rs(λ), rather than the ground measurements in other data sets.

3. Symbolic regression

The ocean color inverse problem can be regarded as an issue of symbolic regression [26], which requires finding a function in symbolic form that fits the relationships among given data points that consist of R_rs(λ) observed at a number of wavebands as the predictor variables, and the corresponding measurements of an ocean-color product as the variable to be predicted. Symbolic regression is also known as error-driven-evolution [26]. In an ocean-color inverse problem, where the evolution of solutions identified by symbolic regression is driven by a quantifiable objective, namely fitness. The fitness evaluation function is commonly a measure of the deviation between the simulated and observed values. Compared to other conventional linear or nonlinear regression techniques, the major difference with symbolic regression is that no a priori or pre-defined functions/models are specified. In symbolic regression, the computational constraint is that all of the mathematical operators or logical expressions used in the function should be chosen from a pre-defined pool of mathematical expressions or symbols. In the first part of this section, the general procedures for solving the symbolic regression problem with GEP will be introduced. Particular attention will be given to demonstrate how the unequivocal translation system works. The steps and settings for the development of ocean-color inversion algorithms using the GEP symbolic regression approach will then be described in the second part.

3.1 GEP based symbolic regression

GEP is a genotype-phenotype genetic algorithm, in which the genotype of each individual is represented as a linear symbolic string (chromosome) with one or more fixed-length genes, as shown in Fig. 2(a); and the inferred phenotype (ET) is expressed nonlinearly in diagram form, as shown in Fig. 2(b). The computational strength of GEP is that not only can the ET be read straightforwardly from the “easily manipulated” linear string, but that it can also represent a real mathematic function [Fig. 2(c)] without losing the computational complexity. The process of transferring genetic information from a gene into an ET is carried out by the so-called translation system, in which several rules are applied to prevent the forming of illegal ETs and ensure that all possible messages or changes made in the gene can be directly translated into an ET. The rules that govern this are briefly described, as follows:

Fig. 2 Example of different representations of an individual in GEP, (a) an example of GEP coded linear chromosome with one 13-digit gene, (b) gene expression tree (ET) of the example chromosome, and (c) the corresponding algebraic expression.

Download Full Size | PDF

1. The head region of a gene [Fig. 2(a)] can contain symbols that represent both the functional and terminal nodes in the inferred ET [Fig. 2(b)], whereas the tail region contains only terminals. For instance, the symbols “Q”, “*”, “M” and “+” in the head part [Fig. 2(a)] represent the square root, multiplication, maximum of the two, and addition function, respectively, and these are functional nodes. The symbols “a” and “b” represent the predictor variables or numerical constants, and these are terminal nodes [Fig. 2(a)].
2. The length of the head region (h) in a chromosome is freely chosen, and the length of the tail (t) is then calculated as follows: $t = h (n - 1) + 1.$
Where n is the maximum number of arguments that a predefined function or operator can take. In Fig. 2, h is 6 and n equals the number of arguments used in the “maximum of two” function (represented as “M”), and hence the tail length is 7. Consequently, the size and shape of an ET, as well as the created real function, depend on the number/type of functional symbols used in the head region. In this case, when the head region is full of mathematical expressions, each of which has an argument number of 2, then all of the noncoding region will be coded, and the size of ET reaches the maximum available number of 13.
3. Multigenic chromosomes are often used in GEP to increase the diversity of individuals, in which each gene of the chromosome represents a particular subunit of the related ET. Note that both the fitness values of sub-ETs and multi-subunit ETs will be evaluated during the evolutionary process. This is useful when both the size (complexity) and performance (fitness) of the identified function are of interest to developers.
4. Reproduction of the genotype of an individual with best performing fitness through genetic operators, i.e., replication, mutation, transposition and recombination (crossover), is an essential part of evolution. With the exception of replication, the common genetic operators used in GEP are illustrated in Fig. 3. Because of the linear-string structure, possible modifications made on phenotypes by the genetic operators can be directly translated into the phenotype without using special rules, as shown in Figs. 3(a) and 3(b).

Fig. 3 Example of performing different genetic operations to the genotype of individuals with GEP and the corresponding changes in the phenotype, (a) an example of GEP one = point mutation, (b) an example of genetic transposition operation, and (c) the GEP recombination operation.

Download Full Size | PDF

To solve symbolic regression problems, the GEP creates and optimizes regression models to the highest fitness value following the steps below:

1. The GEP process begins by randomly generating a population of individuals (genotype).
2. Each individual, which is translated to one possible regression model (phenotype) by the translation system, can be any combination of predefined mathematical operators, logical expressions, predictor variables and numerical constants.
3. The probability for each individual to survive in the evolutionary process is assessed by a fitness evaluation function.
4. The second and subsequent generations are created through a reproduction process. In this process, an individual with a high fitness value has a greater opportunity of being selected and passing its genetic characteristics on to its offspring by duplicating or processing genetic operations.
5. In the final generation, the optimum regression model linking the predictor and predictand variables can be obtained.

3.2 Development of ocean-color retrieval algorithms using GEP

The symbolic regression function for solving the ocean-color inverse problem can be defined as the following form:

\hat{Y} = f (d_{0}, d_{1}, ..., d_{4}) .

Where

\hat{Y}

is the (water constituent) bio-optical property predicted by the measured ocean color, and d_i (i = 0 to 4) is the R_rs(λ_i) measured at the first five SeaWiFS bands in the visible range, i.e., λ_i = 411, 443, 489, 510 and 555nm. By using the GEP approach, the process of finding the optimum function, as shown in Eq. (2), consists of the following steps:

First, formulate the fitness functions to guide and constrain the direction of evolution. The most widely used evaluation functions in symbolic regression are the coefficient of determination (R²) and root-mean-square-error (RMSE) [29, 36], which tend to match several peak and higher values rather than fitting the overall distribution patterns [37]. In the contrast, the log-transformed RMSE (logRMSE), which places more emphasis on fitting the lower observed values [5, 38], can cover a greater portion of IOP data in NOMADv2a [Fig. 1], which is formulated as follows:

logRMSE = {\frac{\sum_{i = 1}^{N} {[\log ({\hat{Y}}_{i}) - \log (Y_{i})]}^{2}}{N}}^{1 / 2} .

The evaluation of fitness is also subject to “constraints” to guide the individual towards making not only accurate but also sensible predictions. For example, the predicted IOP should always be a positive value. Therefore, the multi-objective fitness function used is formulated as:

f i t n e s s = 1000 (\frac{1}{1 + logRMSE}) - p u n i s h .

Where punish is the number of predicted values that failed to fall in the range of observed Y (Table 1, calculated from both the training and validation samples).

Second, set up the pre-defined pool of mathematical expressions or symbols used to form the regression models. A total of 13 mathematical functions are selected (Table 2) based on the recommendation from Ferreira [29] for symbolic regression problems, and a review of existing empirical algorithms [39–41]. Moore, et al. [42] indicated that the use of the R_rs(λ) pre-classification technique can better characterize the uncertainty of satellite data products, and it is also applicable to screening an appropriate algorithm for a specific optical water type. To implement this concept of pre-classification in the proposed inversion model using GEP, another six logical expressions are introduced in this study, as shown in Table 2, including the “maximum of two”, “maximum of three”, and four different “Greater Or Equal To” (GOE2) function sets.

Table 2. Summary of the GEP symbolic regression settings for the development of ocean-color inversion models

View Table | View all tables in this article

Third, create the genetic diversity that is needed for individuals to evolve, thus making it more likely that the fittest individual will be found. In GEP, the level of genetic diversification is controlled by various parameters describing the probability of specific genetic operations occurring, e.g., the rate of mutation and crossover. The initial settings of the genetic operation parameters are given in Table 2, and were also recommended by Ferreira [29] for optimum evolution.

Finally, other settings for planning the regression model architecture, e.g., the head size, available number of numerical constants, number of genes (sub-ETs) per chromosome, and function linking sub-ETs, are based on the discussions in the forums hosted by Gepsoft [43], and determined by trial and error tests. All of the GEP settings are summarized in Table 2, where (i) x, y and z cannot only represent a predictor or a numerical constant, but also a functional node. For example, the addition function set “x+y” can be the addition of two predictors “d₀+d₁”, one predictor and one function set “d₀+d₂/d₁”, and two function sets, and (ii) the function sets consist of only one variable x indicates that the number of arguments the function can take (arity) is 1. For example, the number of arities for x + y and max3(x,y,z) is 2 and 3, respectively.

4. Performance and model-closure evaluation

To evaluate and compare the performance of the GEP derived inversion models, measurements of (1) model rationality, in terms of the number of valid retrievals (n); (2) coefficient of determination in log phase (R²); (3) logRMSE; (4) mean absolute percentage deviation (MAPD); and (5) Type II regression results (the intercept and slope of major axis regression) are used in this study.

First, the performance of trained GEP models was assessed when applied to the overall NOMADv2a data set by comparing their retrievals to those from (1) NASA’s operational near-surface Chla algorithm for SeaWiFS (OC4v6); (2) a recently published Chla retrieval algorithm developed using GP based symbolic regression (GPSR) [28]; and (3) the latest version of the Quasi-Analytical Algorithm for IOPs (QAAv5) [5]. Note that the GPSR model was also developed based on NOMADv2a, but with three log-10 band reflectance ratios as inputs. In order to independently evaluate the strength of GEP between other data-driven empirical approaches, three inversion models were developed using the same training data as GEP, and also included in the benchmark comparison: (1) the OC4-type band-ratio polynomial regression model (OC4x); (2) the Multilayer Perceptron model (MLP), which has one hidden layer (6 neurons) and a logistic activation function for both hidden and output layers; and (3) the ε-insensitive SVR model (SVR) equipped with the Radial Basis Function kernels [18]. The MLP and SVR were both developed and optimally tuned using the DTREG commercial software, version 10.7.18.

Secondly, performance indicators for the GEP inversion models are calculated with the same procedures and data sets utilized in the IOCCG Report Number 5 [32], and are evaluated by comparing them with the indicator values obtained from various algorithms tested in this earlier report. The following inversion algorithms were selected for comparison: (1) the QAAv4, (2) the MERIS neural network algorithm (NN) [44], (3) the linear matrix inversion algorithm (LMI) [45], (4) the GSM SA bio-optical model (GSM) [46] and (5) the SA reflectance model based inversion algorithm (SARA) [47].

Inversion algorithms for IOPs and diffuse attenuation coefficients are mostly based on physical models or a bio-optical data set generated by “unbiased” and “closure-achieved” forward models. Since the GEP-derived IOP algorithms were fully based on in situ observations to define regression functions, they may be affected by errors within the data set. Additionally, the application of symbolic regression for IOP is limited to only one dependent variable, e.g., a_ph(443), and each IOP retrieval algorithm is developed independently and might have difficulties in achieving closure when combined with each other. Therefore, the validity of proposed algorithms should be evaluated by confirming that the IOPs obtained by independent algorithms are capable of being linked by bio-optical models forming IOPs spectra, and then achieving optical closure when a radiative-transfer model is applied. As shown on Fig. 4, a model-closure evaluation for the proposed algorithms is conducted by comparing the input R_rs(λ) (IOCCG synthetic data) of the inversion algorithms with the R_rs(λ) simulated by the forward models and corresponding IOP retrievals. The models, parameters, and approaches used to generate a(λ) and bb(λ) with IOPs derived at specific wavelengths, and the settings of HydroLight radiative transfer model used to simulate R_rs(λ) with the generated IOP spectra, are given in Lee (2006) [34]. Slight changes were made by shifting the reference wavelengths of absorption and scattering from 440 and 550nm to 443 and 555nm, respectively, and combining a_d(λ) and a_g(λ) in Lee’s approach to give a_dg(λ). The q-index developed by Doerffer and Schiller (2007) [44] to determine the degree of R_rs(λ) mismatch (mismatch when q $\geq$ 4) was used for closure evaluation.

Fig. 4 The procedure of closure evaluation for IOPs derived from a_ph(443), a_dg(443) and bb_p(555) algorithms.

Download Full Size | PDF

5. Results and discussion

5.1 The GEP-derived ocean-color inversion models

The inversion models developed by GEP were in the form of ETs, which were translated to the following algebraic equations, based on the procedure described in Section 3.1.

\begin{array}{l} C h l a = {(\frac{f_{3} (d_{0} d_{4}, 0.881 - d_{4})}{d_{1}^{2}})}^{2} + Abs [f_{3} (\frac{f_{3} (d_{3}, d_{4})}{0 .09} +0 .226, \frac{1}{f_{3} (d_{3}, d_{0})})] \\ + f_{2} {f_{3} [f_{4} (max(d_{3}, d_{1}, d_{0}), d_{3}), d_{1} + d_{4}], f_{3} [f_{3} (d_{4},0 .369), \sqrt{d_{2}}]}, \end{array}

a_{p h} (443) = d_{0}^{(\frac{d_{3}}{d_{4}}) + 0.8423 - {(d_{0}^{d_{3}})}^{2}} + f_{3} (\max 2 (f_{3} {(d_{2}, d_{0})}^{2}, d_{0}^{2}), 0 .0092) + {[0.2475 {(\frac{d_{3}}{d_{2}})}^{2}]}^{{(\sqrt[3]{d_{1}} + 1.3014)}^{2}},

\begin{array}{l} a_{d g} (443)= f_{4} (d_{3} - 0.9756, \frac{d_{4}}{d_{0}}) (d_{0} - 0.0798 - f_{2} (d_{1}, d_{2})) + f_{3} (\sqrt{\frac{f_{4} (d_{4}, d_{2})}{d_{0}}}, d_{4} - 0 .0002 - d_{3}) \\ + f_{3} [d_{1} + f_{4} (d_{2} - 2 d_{3},-1 .0691 d_{3}), d_{4}], \end{array}

\begin{array}{l} a_{p} (443) = d_{2} + f_{4} ({0.0048}^{10^{d_{0} -0 .016}}, d_{3}^{d_{0} / d_{2}}) - \max 3 (f_{2} [f_{2} (0 .0236, d_{4}), f_{3} (d_{4}, d_{0})], d_{1},-0 .2639) \\ + f_{3} (0.5034, d_{4}^{2}) (125 + \frac{0.2262}{d_{2}^{2}}), \end{array}

\begin{array}{l} a (443) = {(\frac{0.4344 d_{4}}{\max 3 (d_{0} - 0.005, 0.3757 d_{4}, d_{2})})}^{2} + f_{4} [0 .0162 + 0.8085 d_{0}, f_{4} (2 d_{3}, f_{2} (d_{2}, d_{0}))] \\ and \\ + f_{4} [f_{1} (\frac{d_{4}^{2}}{d_{0}}, \max 2 (d_{2}, 0.0068)), f_{2} (d_{3}, d_{2}) + \max 3 (d_{4}, d_{2}, 0.0071)], \end{array}

\begin{array}{l} b b_{p} (555) = d_{4} - f_{4} {[d_{1}, f_{3} (\frac{d_{0}}{d_{3}}, f_{3} (d_{2}, d_{4}))]}^{2} + f_{4} {[f_{2} (0.0043, d_{1}), f_{2} (d_{3}, 0.003)]}^{2} \\ + f_{3} [\max 3 (d_{1}, d_{3}, 0.0035), 10^{d_{1}}] + f_{4} [f_{4} (ln(d_{4})+2 .3932,0 .0021), - 0.6994 d_{4} - d_{0} + d_{1}] . \end{array}

Where f₁, f₂, f₃, and f₄ represent the “Greater Or Equal 2” function of GOE2A, GOE2C, GOE2D and GOE2E (Table 2); max2 is the maximum of two function; and max3 is the maximum of three function. Interestingly, the GOE2 functions are the most frequently chosen by the above equations. It is evident that logical expressions play an important role in GEP-derived inversion models by varying the functional relationships between variables according to the input value of R_rs(λ). Although similar pre-classification strategies have been widely used in existing algorithms, such as the max3 function in OCx algorithms [41] and the color index (CI) threshold in the CI-based algorithm [48], this study is the first to develop an optimum inversion model derived using the GOE2 functions.

5.2 The limitation of GEP-derived algorithms

A synthetic R_rs(λ) data set was developed in this study to evaluate the robustness of the proposed algorithms. The test spectra were generated by taking 100,000 random Latin-hypercube samples of the R_rs(λ) spectrum from a multivariate lognormal distribution with a mean and covariance matrix obtained by all available measurements (n = 3,079) in the NOMADv2a data set. The evaluation aimed to be comprehensive and independent, as a major portion of the measurements was not included in the training data, and the testing spectra were simulated to sufficiently cover possible spectral values and patterns in the NOMADv2a data set. However, the simulated data may contain biases from the measurement errors within the in situ data set. The evaluation results are shown on Table 3. Although for the bb_p(555) retrievals are relatively less robust, more than 99% of the retrieved values from other algorithms are within the scope of training/validation data set. The proposed algorithms were robust even when R_rs(411) was at a relatively low level of 0.00014 sr⁻¹. However, an exceptional condition may occur while processing ocean-color imagery over coastal regions, where the R_rs(411) and R_rs(443) are often biased lower than 0.00014 sr⁻¹ or become negative values due to the atmospheric correction errors [49]. Further analysis using different SeaWiFS scenes is required to ensure the validity of GEP-derived algorithms.

Table 3. Statistical analysis results of Chla and IOPs retrievals obtained by the GEP-derived algorithms

View Table | View all tables in this article

5.3 Performance of GEP models for the NOMADv2a in situ bio-optical data set

The overall performance of the GEP-derived algorithms is shown in Table 4 and Fig. 5, by comparing their retrievals to the known Chla and IOPs in both training and validation data sets. The performance indicators for GEP, MLP, and SVR in training and validation stages are both reported in Table 4. The performance for all data is generally the average value of both stages. There are no significant differences among all empirical algorithms with regard to the type II regression. Because of a greater size of training data, the OC4v6 developed using all of the available samples in the NOMADv2a data set (n = 1,243) performs better than the OC4x based on the same training data used in this study (n = 605). Nevertheless, the other four empirical algorithms derived by the state-of-the-art regression approaches with less training data, e.g., GEP, GPSR, MLP and SVR, provide tighter and stronger relationships with the known values in terms of having lower errors (logRMSE and MAPD) and higher R² values than OC4v6. Based on the same training data, although the SVR performs the best across GEP and MLP for all and the major portion of cases (within the range of 0.25 and 3 mg/m³), the GEP performs the best for oligotrophic waters, where the levels of Chla were commonly lower than 0.25 mg/m³ [48], and well-balanced its performance across different levels of Chla. This advantage can help to process an ocean-color image with a higher dynamic range of Chla distribution. As shown in Fig. 5(a), because of the relatively few samples taken from eutrophic waters, a similar trend of under-estimation was found in both GEP and OC4v6 algorithms when the observations were higher than 10 mg/m³. The band-ratio algorithm derived by symbolic regression (GPSR) has a simpler architecture than our GEP-derived model [Eq. (5)]. However, the increased complexity of the GEP algorithm with additional GOE2 functions can obtain more accurate results, as seen in the fact that all of the performance indicators are better. In practice, there are no significant differences between GEP, SVR, and MLP in terms of the overall performance. The GEP-derived algorithm has the potential to be improved by using more function sets and adding more sub-ETs. The SVR and MLP can also be improved by tuning the network structures until fully optimized. The key strengths of the GEP-derived algorithm are its greater computational efficiency, which has been shown by GP in Tang et al. (2012) [28], and its greater algorithm transferability, due to its explicit nature (compared to SVR and MLP, which are implicit, black-box approaches).

Table 4. Performance of the GEP-derived and other compared inversion algorithms for Chla and IOPs, using the NOMADv2a in situ data set

View Table | View all tables in this article

Fig. 5 Comparing the retrievals obtained by GEP and selected algorithms to the known Chla and IOPs data from the NOMADv2a data set. (a) GEP- and OC4v6-derived Chla, (b) GEP- and QAAv5-derived a_ph(443), (c) GEP- and QAAv5-derived a_dg(443), (d) GEP-derived a_p(443), (e) GEP- and QAAv5-derived a(443), and (f) GEP- and QAAv5-derived bb_p(555).

Download Full Size | PDF

Comparisons were also made to the IOP retrievals from the model-based QAAv5 algorithm using the same data set, as shown in Table 4 and Fig. 5. The empirically-based GEP algorithms perform better than the model-based QAAv5 in terms of all performance metrics, for the NOMADv2a in situ data. For the category of a(443) or bb_p(555), which combine the contributions from both phytoplankton and detritus-gelbstoff, two algorithms are able to attain good retrievals. However, without the use of the additional measurements of R_rs(667), it is evident that QAAv5 has difficulty in separating the contributions of a_ph(443) and a_dg(443) from a(443), particularly when a_ph(443) is lower than 0.1 m⁻¹. With the advantage of using an independent algorithm for each inversion [Eqs. (6) and (7)], the GEP has better retrievals of both a_ph(443) and a_dg(443), which can help in accurately deriving the concentration of the specified water constituent.

The four- and three-component bio-optical model is summarized in the following equation:

a (λ) = a_{w} (λ) + a_{p h} (λ) + a_{d} (λ) + a_{g} (λ) = a_{w} (λ) + a_{p h} (λ) + a_{d g} (λ) .

Where a_w(λ) is the known absorption coefficient of pure water [50]. Based in Eq. (11), the a_dg(443) can be calculated either by Eq. (7) alone, or by subtracting a_ph(443) and a_w(443) from a(443) by utilizing the Eqs. (6)-(9). Although these equations were developed independently by GEP, and appear to be very different in their formulation, a comparison between the a_dg(443) retrievals obtained in two different ways [Fig. 6(c)] shows that both approaches are capable of carrying out the inversion, and hence the algorithms are generally interoperable. Therefore, specific information about gelbstoff and detritus in Eq. (11), i.e., a_g(443) and a_d(443), is obtainable from an “allocation” procedure. The a_g(443) can be obtained by subtracting a_p(443) and a_w(443) from a(443), and then used to derive a_d(443) from a_dg(443). The performances of deriving a_g(443) and a_d(443) using Eqs. (6)-(9) and following the above procedure are shown in Figs. 6(a) and 6(b), respectively. Although a similar overestimation is found for both a_dg(443) and a_g(443) in the lower values, all of the a_g(443) retrievals are valid (n/N = 100% and R² = 0.65) by comparing them to the available cases in NOVADv2a (with coincident measurements of a_g(443) and R_rs(λ) at the first five SeaWiFS visible bands). On the other hand, the a_d(443) retrievals from GEP algorithms and the “allocation” procedure appear to be reasonable when higher than 0.01 m⁻¹, but most are underestimated, with invalid estimates (<0) for the lower values. This indicates that even though the proposed GEP algorithms are developed empirically and individually, they are still capable of dividing the contributions between a_ph(443), a_g(443) and a_d(443), from a(443) based on the assumptions of the four-component bio-optical model. As shown on Figs. 6(a) and 6(b), such application may be limited when the derived a_g(λ) < 0.1 m⁻¹ or a_d(λ) < 0.01 m⁻¹, while a_dg(λ) is too small to be correctly allocated.

Fig. 6 Comparisons between GEP derived and known IOPs for the NOMADv2a in situ data. (a) a_g(443), (b) a_dg(443), and (c) a_dg(443) obtained through two different pathways of GEP algorithms.

Download Full Size | PDF

5.4 Comparison with model-based IOP algorithms using IOCCG synthetic data set

Table 5 summaries the performance indicators of GEP and selected inversion algorithms by comparing their retrievals to 500 synthetic and 656 in situ cases in the IOCCG data set. Because the performance (Table 5) is compared without using the same data set to tune the algorithms, it might not be an independent and unbiased evaluation of algorithm performance. Results included in this section allow for easy comparison between various existing methods and the new GEP-derived algorithms, and provides general information on the utilization of the algorithms for interested researchers. Note that the bb_p(555) is not included in the in situ data, and the comparison made to the bb_p(555) retrievals from GEP is limited to values less than 0.01 m⁻¹ in the synthetic data set, as the data set used to train and validate the GEP algorithms (Table 1) does not have any observations higher than 0.01 m⁻¹. Due to wavelength mismatch, the performance of the NN algorithm for in situ data was not applied and reported in the IOCCG Report. Additionally, the proposed GEP algorithm is the only full-empirical approach among all the algorithms compared.

Table 5. Performance of the GEP-derived and other compared inversion algorithms for IOPs, using a synthetic and an in situ data set compiled by IOCCG

View Table | View all tables in this article

When the synthetic data set is applied, both the GEP and model-based approaches can achieve good retrievals, as shown in Table 5. Notably, an earlier version of QAAv5, the QAAv4, obtains the best retrievals for nearly all categories of the IOCCG synthetic data set, whereas the GEP approach generally ranks second to fourth among the tested algorithms. As shown in Fig. 7, R_rs(λ) simulated at different sun angles may not produce retrievals that are significantly different from one another. The IOCCG synthetic data set was generated using various reasonable bio-optical models, with parameters that were mostly obtained from field measurements [34]. In contrast, the GEP-derived algorithms may include all of the possible relationships that exist between IOPs and R_rs(λ), whether they are reasonable or not. Therefore, some minor systematic deviations were found in the GEP derived retrievals, and these decreased the overall performance of GEP. As shown in Fig. 7 and the results of the type II regression, over-estimation was commonly observed for Chla, a_ph(443), a_p(443) and the lower part of the a_dg(443) retrievals, while under-estimation occurred in the higher parts of both a_dg(443) and a(443). Except for the systematic errors originating from the differences between the empirical and model-based approaches, the GEP algorithms generally perform well in dealing with the noisy synthetic data.

Fig. 7 Comparisons between GEP-algorithms derived and known IOPs for the IOCCG synthetic data. (a) Chla, (b) a_ph(443), (c) a_dg(443), (d) a_p(443), (e) a(443) and (f) bb_p(555).

Download Full Size | PDF

As shown in Table 5 (numbers in the parentheses), the performance of all inversion algorithms decreased due to the higher uncertainty in measurement errors in the in situ data set. For example, the logRMSE values of the QAAv4 increased from 2.0 (for a_ph) to 2.9 (for a) times the results of the synthetic cases. The empirical-based GEP algorithms are relatively robust, with only the a(443) having a 1.3 times higher logRMSE value, and they perform the best on the categories of a_ph and a_dg. Additionally, no system deviations were found in retrieving the 656 in situ cases using the GEP-derived algorithms, as seen in the type II regression results (Table 5) and Fig. 8. Overall, compared with algorithms that disregard several cases due some quality checking mechanisms, the performance of the GEP algorithms was evaluated for a combination of 1,156 cases (for bb_p(553) is 893). No unreasonable retrievals (i.e., negative IOPs) were found for any of the cases.

Fig. 8 Comparisons between GEP-algorithms derived and known IOPs for the IOCCG in situ data. (a) Chla, (b) a_ph(443), (c) a_dg(443), and (d) a(443).

Download Full Size | PDF

The results of the closure evaluation for IOPs obtained by GEP- algorithms are shown in Fig. 9 and Table 6. Again, the closure evaluation was only performed with the 240 available R_rs(λ) cases that had retrieved bb_p(555) values of less than 0.01 m⁻¹. The GEP were fully based on in situ data to define empirical functions, and hence the goodness of closure were highly dependent to the level of errors and uncertainties within the data. In the other hand, the model-based algorithms were algebraic or optimization solutions of simplified analytical models, which are commonly regarded as unbiased and closure-achieved. Model-based approaches such as QAA and GSM can achieve nearly 100% closure using synthetic data. As shown on Fig. 9 and Table 6, the agreement between the input reflectance values and those derived by the retrieved IOPs is reasonable for an empirical algorithm developed by in situ data. The q-indexes show that there are only three mismatched cases among the 240 test spectra, with a low average value of 1.48. These results confirm that the IOPs retrieved at a single wavelength are capable of being combined by forward models to achieve closure when bb_p(555) retrievals are valid within the region of training data.

Fig. 9 Comparison between the input reflectance values and those simulated by the retrieved IOPs at the closure evaluated wavelengths.

Download Full Size | PDF

Table 6. Summary of the closure evaluation results for 240 selected R_rs spectra in the IOCCG synthetic data set

View Table | View all tables in this article

5.5 Algorithm verification with level-2 match-up data

The satellite match-up data from NOMADv13 was applied to validate the practical application of GEP-derived algorithms, in which the retrievals were processed through R_rs(λ) from SeaWiFS imagery, rather than the ground measurements in the NOMADv2a data set. Note that a few cases were excluded in the verification when one of the R_rs(λ) values was lower than an estimated detection limit of 0.00005 sr⁻¹. Table 7 and Fig. 10 present the verification results by comparing the retrievals to the SeaWiFS match-ups. The retrievals from GEP-derived algorithms have good agreement with the match-up data, with all R² values higher than 0.8 and MAPD less than 50% on most categories, except for the HPLC-measured Chla and a_dg(443). Compared to the 248 Chla data measured on-boat using fluorometers (an average of 1.07 mg/m³), much fewer samples were analyzed by HPLC and these were collected mostly in coastal regions, where the higher concentrations (an average of 1.55 mg/m³) led to larger retrieving errors. Problems with over-estimation were consistently found for lower a_dg(443) values from all data sets used in this study. This partly indicates that the current selection of logical/mathematical expressions in GEP may be further improved for a_dg(443), from the view of symbolic regression.

Table 7. Performance of the GEP-derived inversion algorithms for satellite match-ups, using the NOMADv13 in situ data set

View Table | View all tables in this article

Fig. 10 Comparing the retrievals derived by GEP-algorithms to known SeaWiFS match-up data. (a) A combination of HPLC measured Chla and Fluorometric Chla, (b) a_ph(443), (c) a_dg(443), (d) a_p(443), (e) a(443) and (f) bb_p(555).

Download Full Size | PDF

5.6 Application of GEP-derived algorithms

Figure 11 shows the results of applying the developed inversion algorithms [Eqs. (5)-(8)] to the SeaWiFS Global Area Coverage (GAC) data. In the test scene, more than 99% of the retrieved pixel values were within the range of training/validation data and flagged as reasonable retrievals. Note that the bb_p retrievals were relatively low, at 97%, because of its narrower parameter range. The majority of not-valid pixels are located in very turbid coastal regions, due to their negative R_rs(411) or R_rs(443) values. Compared to the OC4v6, which uses a maximum R_rs(blue) function to determine the most appropriate blue-to-green ratio, the GEP algorithms developed in their current form rely more heavily on the quality of R_rs(blue). This special SeaWiFS scene taken of the Black Sea and part of the Mediterranean on May 4, 2002, and shows a very wide range of Chla concentrations and optical properties across different water bodies (Sea of Azov, Black Sea and Mediterranean). Figure 11(b) shows that the standard SeaWiFS Chla product can provide differentiation of phytoplankton blooms in the Black Sea at a gross level. However, both the Chla and a_ph(443) derived by the GEP algorithms [Figs. 11(c) and 11(d)] provide differentiation at a finer scale. For example, using the two algorithms developed in this study we are not only able to depict algal blooms close to the western shorelines of the Black Sea, but also identify the special offshore “bright blooms” [51]. It is notable that the a_p(443) algorithm improved contrast in the speckling pattern [Fig. 11(f)], which may indicate the presence of artifacts from the empirical algorithm, or may be a correct depiction of the spatial distribution of oceanic particles. The band-shift corrections [52] can be applied with respect to the NOMAD training data to allow for the application of GEP algorithms to modern sensors with different central-wavebands, e.g., MODIS and MERIS.

Fig. 11 Applications of the GEP-derived algorithms in processing Level-2 SeaWiFS Global Area Coverage data for the Black Sea and part of the Mediterranean (May 4, 2002)” (a) true color image showing bright blooms developed in Black Sea [51], (b) Chla derived by OC4v6 algorithm (standard SeaWiFS product), (c) Chla derived by Eq. (5), (d) a_ph(443) derived by Eq. (6), (e) a_dg(443) derived by Eq. (7), and (f) a_p(443) from Eq. (8).

Download Full Size | PDF

6. Concluding remarks

This work presents a new symbolic regression approach using GEP for the development of ocean-color inversion algorithms. The algorithms are trained by a subset of samples from the NOMADv2a data set, and are validated and evaluated against the remainder of NOMADv2a, a synthetic and in situ data set compiled by IOCCG, and a satellite match-up data set. The GEP-derived algorithms consist of various mathematical and logical expressions that are optimized and selected through evolutionary processes. The results of function finding indicate that a series of special GOE2 functions are the most chosen, and play a significant role in the classification of R_rs(λ) inputs and determining the algorithm formulations.

The results of comparison of GEP algorithms with other algorithms using the NOMADv2a data set demonstrate that the state-of-the-art regression approaches, e.g., GEP, GPSR, MLP and SVR, outperform the OCv6 with less training data. The SVR and GEP perform the best for the majority of cases (3 ≥ Chla > 0.25 mg/m³) and for oligotrophic waters (Chla ≤ 0.25 mg/m³), respectively. However, it is concluded that there are no significant differences between GEP, SVR, and MLP in terms of the overall performance, since these state-of-the-art data interpolation approaches all have the potential to be improved. The GEP algorithms proposed in their current forms are recommended for ocean color data processing because of the following advantages: (1) well-balanced performance across different Chla levels; (2) higher model transferability owing to their explicit nature; (3) appropriate function size compared to other GP derived algorithms, which commonly appear in the form of tree-structure functions or source code for computing; and (4) higher flexibility of model improvement, due to a wide choice of function sets and model architectures.

For IOPs retrievals, the GEP algorithms are robust and their performance is comparable to those of existing model-based approaches regarding the model-synthesized IOCCG data with added noise. Analytical or semi-analytical algorithms enable opportunities to understand the basics of ocean optics. On the other hand, simple empirical relationships extend the potential application of ocean-color products due to the easy and efficient processing ability. It might be the fundamental limitation for algorithms capable to perform complex data interpolations is that they are only valid at the data interval used for the model development. The general “out-of-range” limitation is applicable for these approaches to characterize the quality of each retrieved pixel value in ocean-color products. In this study, the statistical analysis using 100,000 simulated R_rs spectra shows that the GEP-derived algorithms are very reliable and robust at conditions similar to the NOMADv2a data set. The results of image processing show that some exceptional cases may occur in very turbid coastal regions where the R_rs(411) or R_rs(443) are negative. For those within the parameter region, but with questionable retrievals, the results of the closure evaluation provide some support for the GEP-derived IOP algorithms. The success of closure evaluation depends strongly upon the accurate field measurements in the NOMADv2a data set. However, unlike the MLP approach, which has the advantage of building bi-directional networks to derive pixel-by-pixel q-indexes as a threshold to identify not-valid values, the GEP symbolic regression is limited to only one dependent variable. To the best of my knowledge, there are no established procedures for estimating confidence or prediction intervals for symbolic regression models to prevent questionable retrievals. The strategy of assembling a set of diverse models to provide a confidence prediction [53] is thus suggested in any future GEP scheme.

As an empirically-based approach, the GEP-derived algorithms are easily utilized and fast at processing the satellite imagery. After processing by the GEP-derived algorithms, the SeaWiFS GAC image shows more details of developing algal blooms in the Black Sea, and gives more information on the distribution of particles and CDOM.

The GEP algorithms proposed in this study fully rely on training data to define the symbolic regression model, and thus it is recommended that a further study include more data for algorithm development, which can be a data set simulated by mechanistic models or a collection of accurate in situ data, particularly for higher bb_p samples.

Acknowledgments

The authors thank the Ministry of Science and Technology, Taiwan ROC, for the financial support through grants NSC-101-2221-E-006-153 and MOST-103-2221-E-006-014.

References and links

1. T. Platt, “Primary production of the ocean water column as a function of surface light intensity: algorithms for remote sensing,” Deep Sea Res. Part I Oceanogr. Res. Pap. 33(2), 149–163 (1986). [CrossRef]

2. Z. P. Lee, K. L. Carder, R. G. Steward, T. G. Peacock, C. O. Davis, and J. L. Mueller, “Remote sensing reflectance and inherent optical properties of oceanic waters derived from above-water measurements,” Proc. SPIE 2963, 160–166 (1997). [CrossRef]

3. A. Morel, “Optical modeling of the upper ocean in relation to its biogenous matter content (case I waters).,” J. Geophys. Res.-Oceans 93(C9), 10749–10768 (1988). [CrossRef]

4. S. Sathyendranath, Remote Sensing of Ocean Colour in Coastal, and Other Optically-Complex, Waters (IOCCG, 2000)

5. Z. P. Lee, K. L. Carder, and R. A. Arnone, “Deriving inherent optical properties from water color: a multiband quasi-analytical algorithm for optically deep waters,” Appl. Opt. 41(27), 5755–5772 (2002). [CrossRef] [PubMed]

6. K. L. Carder, F. R. Chen, Z. P. Lee, S. K. Hawes, and D. Kamykowski, “Semianalytic Moderate-Resolution Imaging Spectrometer algorithms for chlorophyll a and absorption with bio-optical domains based on nitrate-depletion temperatures,” J. Geophys. Res.- Oceans 104(C3), 5403–5421 (1999). [CrossRef]

7. S. A. Garver and D. A. Siegel, “Inherent optical property inversion of ocean color spectra and its biogeochemical interpretation. 1. Time series from the Sargasso Sea,” J. Geophys. Res.- Oceans 102(C8), 18607–18625 (1997). [CrossRef]

8. C. C. Liu and R. L. Miller, “A spectrum matching method for estimating the chlorophyll-a concentration, CDOM ratio and backscatter fraction from remote sensing of ocean color,” Can. J. Rem. Sens. 34(4), 343–355 (2008). [CrossRef]

9. F. Melin, “Global distribution of the random uncertainty associated with satellite-derived Chla,” IEEE Geosci. Remote Sens. 7(1), 220–224 (2010). [CrossRef]

10. Z. Lee, R. Arnone, C. Hu, P. J. Werdell, and B. Lubac, “Uncertainties of optical parameters and their propagations in an analytical ocean color inversion algorithm,” Appl. Opt. 49(3), 369–381 (2010). [CrossRef] [PubMed]

11. D. A. Toole, D. A. Siegel, D. W. Menzies, M. J. Neumann, and R. C. Smith, “Remote-sensing reflectance determinations in the coastal ocean environment: impact of instrumental characteristics and environmental variability,” Appl. Opt. 39(3), 456–469 (2000). [CrossRef] [PubMed]

12. J. Chen, W. Quan, T. Cui, Q. Song, and C. Lin, “Remote sensing of absorption and scattering coefficient using neural network model: Development, validation, and application,” Remote Sens. Environ. 149, 213–226 (2014). [CrossRef]

13. P. Cipollini, G. Corsini, M. Diani, and R. Grasso, “Retrieval of sea water optically active parameters from hyperspectral data by means of generalized radial basis function neural networks,” IEEE Trans. Geosci. Remote 39(7), 1508–1524 (2001). [CrossRef]

14. D. D’Alimonte and G. Zibordi, “Phytoplankton determination in an optically complex coastal region using a multilayer perceptron neural network,” IEEE Trans. Geosci. Remote 41(12), 2861–2868 (2003). [CrossRef]

15. I. Ioannou, A. Gilerson, B. Gross, F. Moshary, and S. Ahmed, “Neural network approach to retrieve the inherent optical properties of the ocean from observations of MODIS,” Appl. Opt. 50(19), 3168–3186 (2011). [CrossRef] [PubMed]

16. C. Jamet, H. Loisel, and D. Dessailly, “Retrieval of the spectral diffuse attenuation coefficient K-d(λ) in open and coastal ocean waters using a neural network inversion,” J. Geophys. Res.- Oceans 117(C10), C10023 (2012). [CrossRef]

17. L. E. Keiner and X. H. Yan, “A neural network model for estimating sea surface chlorophyll and sediments from thematic mapper imagery,” Remote Sens. Environ. 66(2), 153–165 (1998). [CrossRef]

18. G. Camps-Valls, L. Gomez-Chova, J. Munoz-Mari, J. Vila-Frances, J. Amoros-Lopez, and J. Calpe-Maravilla, “Retrieval of oceanic chlorophyll concentration with relevance vector machines,” Remote Sens. Environ. 105(1), 23–33 (2006). [CrossRef]

19. H. G. Zhan, P. Shi, and C. Q. Chen, “Retrieval of oceanic chlorophyll concentration using support vector machines,” IEEE Trans. Geosci. Remote 41(12), 2947–2951 (2003). [CrossRef]

20. M. S. Salama and F. Shen, “Stochastic inversion of ocean color data using the cross-entropy method,” Opt. Express 18(2), 479–499 (2010). [PubMed]

21. H. T. Shahraiyni, S. B. Shouraki, F. Fell, M. Schaale, J. Fischer, A. Tavakoli, R. Preusker, M. Tajrishy, M. Vatandoust, and H. Khodaparast, “Application of the active learning method to the retrieval of pigment from spectral remote sensing reflectance data,” Int. J. Remote Sens. 30(4), 1045–1065 (2009). [CrossRef]

22. C. H. Chang, C. C. Liu, and C. G. Wen, “Integrating semianalytical and genetic algorithms to retrieve the constituents of water bodies from remote sensing of ocean color,” Opt. Express 15(2), 252–265 (2007). [CrossRef] [PubMed]

23. H. G. Zhan, Z. P. Lee, P. Shi, C. Q. Chen, and K. L. Carder, “Retrieval of water optical properties for optically deep waters using genetic algorithms,” IEEE Trans. Geosci. Remote 41(5), 1123–1128 (2003). [CrossRef]

24. D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning (Addison-Wesley, 1989).

25. K. Miettinen, P. Neittaanmaki, M. M. Makela, and J. Periaux, Evolutionary Algorithms in Engineering and Computer Sciences (John Wiley & Sons, 1999).

26. J. R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT Press, 1992).

27. C. Fonlupt, “Solving the ocean color problem using a genetic programming approach,” Appl. Soft Comput. 1(1), 63–72 (2001). [CrossRef]

28. S. Tang, C. Michel, and P. Larouche, “Development of an explicit algorithm for remote sensing estimation of chlorophyll a using symbolic regression,” Opt. Lett. 37(15), 3165–3167 (2012). [CrossRef] [PubMed]

29. C. Ferreira, “Gene expression programming: a new adaptive algorithm for solving problems,” Complex Syst. 13, 87–129 (2001).

30. M. O'Neill and C. Ryan, “Grammatical Evolution: A steady state approach,” in Late Breaking Papers at the Genetic Programming 1998 Conference, J. R. Koza, ed., (Stanford University Bookstore, 1998), pp. 180–185.

31. P. J. Werdell and S. W. Bailey, “An improved in-situ bio-optical data set for ocean color algorithm development and satellite data product validation,” Remote Sens. Environ. 98(1), 122–140 (2005). [CrossRef]

32. Z. P. Lee, Remote Sensing of Inherent Optical Properties: Fundamentals, Tests of Algorithms, and Applications (IOCCG, 2006).

33. D. Odermatt, A. Gitelson, V. E. Brando, and M. Schaepman, “Review of constituent retrieval in optically deep and complex waters from satellite imagery,” Remote Sens. Environ. 118, 116–126 (2012). [CrossRef]

34. Z. P. Lee, “Models, parameters, and approaches that used to generate wide range of absorption and backscattering spectra,” http://www.ioccg.org/groups/lee_data.pdf, accessed 2/16/15.

35. S. W. Bailey and P. J. Werdell, “A multi-sensor approach for the on-orbit validation of ocean color satellite data products,” Remote Sens. Environ. 102(1-2), 12–23 (2006). [CrossRef]

36. M. Z. Hashmi, A. Y. Shamseldin, and B. W. Melville, “Statistical downscaling of watershed precipitation using Gene Expression Programming (GEP),” Environ. Model. Softw. 26(12), 1639–1646 (2011). [CrossRef]

37. A. Efstratiadis and D. Koutsoyiannis, “One decade of multi-objective calibration approaches in hydrological modelling: a review,” Hydrol. Sci. J. 55(1), 58–78 (2010). [CrossRef]

38. E. G. Bekele and J. W. Nicklow, “Multi-objective automatic calibration of SWAT using NSGA-II,” J. Hydrol. (Amst.) 341(3-4), 165–176 (2007). [CrossRef]

39. A. H. Barnard, J. R. V. Zaneveld, and W. S. Pegau, “In situ determination of the remotely sensed reflectance and the absorption coefficient: closure and inversion,” Appl. Opt. 38(24), 5108–5117 (1999). [CrossRef] [PubMed]

40. Z. P. Lee, K. L. Carder, R. G. Steward, T. G. Peacock, C. O. Davis, and J. S. Patch, “An empirical algorithm for light absorption by ocean water based on color,” J. Geophys. Res.- Oceans 103(C12), 27967–27978 (1998). [CrossRef]

41. J. E. O’Reilly, S. Maritorena, B. G. Mitchell, D. A. Siegel, K. L. Carder, S. A. Garver, M. Kahru, and C. McClain, “Ocean color chlorophyll algorithms for SeaWiFS,” J. Geophys. Res.- Oceans 103(C11), 24937–24953 (1998). [CrossRef]

42. T. S. Moore, J. W. Campbell, and M. D. Dowell, “A class-based approach to characterizing and mapping the uncertainty of the MODIS ocean chlorophyll product,” Remote Sens. Environ. 113(11), 2424–2430 (2009). [CrossRef]

43. Gepsoft, “Getting Started with Regression,” http://www.gepsoft.com/tutorials.htm, assessed 2/15/15.

44. R. Doerffer and H. Schiller, “The MERIS case 2 water algorithm,” Int. J. Remote Sens. 28(3-4), 517–535 (2007). [CrossRef]

45. F. E. Hoge and P. E. Lyon, “Satellite retrieval of inherent optical properties by linear matrix inversion of oceanic radiance models: An analysis of model and radiance measurement errors,” J. Geophys. Res.- Oceans 101(C7), 16631–16648 (1996). [CrossRef]

46. S. Maritorena, D. A. Siegel, and A. R. Peterson, “Optimization of a semianalytical ocean color model for global-scale applications,” Appl. Opt. 41(15), 2705–2714 (2002). [CrossRef] [PubMed]

47. S. Sathyendranath and T. Platt, “Analytic model of ocean color,” Appl. Opt. 36(12), 2620–2629 (1997). [CrossRef] [PubMed]

48. C. Hu, Z. Lee, and B. Franz, “Chlorophyll a algorithms for oligotrophic oceans: A novel approach based on three-band reflectance difference,” J. Geophys. Res.- Oceans 117(C1), C01011 (2012). [CrossRef]

49. W. Shi and M. H. Wang, “Detection of turbid waters and absorbing aerosols for the MODIS ocean color data processing,” Remote Sens. Environ. 110(2), 149–161 (2007). [CrossRef]

50. R. M. Pope and E. S. Fry, “Absorption spectrum (380-700 nm) of pure water. II. Integrating cavity measurements,” Appl. Opt. 36(33), 8710–8723 (1997). [CrossRef] [PubMed]

51. G. C. Feldman, “Feature of the Day (May 4, 2002), Ocean Color Image Gallery,” http://oceancolor.gsfc.nasa.gov/cgi/image_archive.cgi, assessed 2/15/15.

52. G. Zibordi, F. Melin, and J. F. Berthon, “Comparison of SeaWiFS, MODIS and MERIS radiometric products at a coastal site,” Geophys. Res. Lett. 33(6), L06617 (2006). [CrossRef]

53. M. Kotanchek, G. Smits, and E. Vladislavleva, “Trustable symbolic regression models: using ensembles, interval arithmetic and pareto fronts to develop robust and trust-aware models,” in Genetic Programming Theory and Practice V, R. Riolo, T. Soule, and B. Worzel, eds. (Springer, 2008), pp. 201–220.

Y	Subset	N	Range	Average	s
Chla	Training Validation	605 605	0.017–65.57 0.019–70.21	2.472 2.531	5.94 6.25
a_ph(443)	Training Validation	446 445	0.002–1.48 0.002–1.46	0.074 0.073	0.142 0.136
a_dg(443)	Training Validation	394 394	0.002-1.164 0.002-1.493	0.105 0.107	0.17 0.18
a_p(443)	Training Validation	466 466	0.003–1.679 0.003–1.878	0.098 0.100	0.190 0.199
a(443)	Training Validation	400 399	0.012–2.605 0.013–2.255	0.208 0.205	0.341 0.328
bb_p(555)*	Training Validation	222 110	0.0012–0.0087 0.0013–0.0088	0.0035 0.0036	1.602 1.553

Procedure	Settings
Function set	Mathematical expressions: x + y, x-y, xy, x/y, 1/x, \|x\|, x^y, 10^x, e^x, ln(x), $\sqrt{x}$ , $x^{1 / 3}$ , x²	Logical expression: max2(x, y, z); max3(x, y, z); GOE2A: if , then x, else y; GOE2C: if , then x + y, else x-y; GOE2D: if , then xy, else x/y; GOE2E: if , then x + y, else xy.
Model architecture	Head size: 8 Max number of numerical constants in a gene (sub-ET): 8 Number of genes in a chromosome: 3 Function linking all genes (sub-ETs): Addition
Genetic operators	For one gene: Mutation rate: 0.00138 Inversion rate: 0.00546 Transposition rate: 0.00546 Recombination rate: 0.00277	For a multigenic chromosome: Gene recombination rate: 0.00277 Gene transposition rate: 0.00277
General	Number of individuals in a generation: 30 Stop condition: no change has occurred within the past 3,000,000 generations

R_rs (λ, range) (nm, sr⁻¹)	99% interval of retrieved values estimated by a lognormal distribution (percentage valid retrievals)
R_rs (λ, range) (nm, sr⁻¹)	Chla (mg/m³)	a_ph(443) (m⁻¹)	a_dg(443) (m⁻¹)	a_p(443) (m⁻¹)	a(443) (m⁻¹)	bb_p(555) (m⁻¹)
411, 1.4E-4–1.2E-1 443, 2.5E-4–6.4E-2 489, 4.5E-4–4.4E-2 510, 4.5E-4–3.0E-2 555, 1.1E-4–7.4E-2	0.015– 22.01 (99.9%)	0.0020– 0.5423 (100.0%)	0.0037– 0.5591 (99.8%)	0.0025– 0.7581 (99.2%)	0.0093– 1.0920 (99.8%)	0.0005– 0.0175 (95.0%)

Product	Model	n	Tran./Val. R²	Tran./Val. logRMSE	Intercept	Slope	Tran./Val. MAPD
Chla	GEP	1210	0.907/0.882	0.219/0.248	−0.012	0.959	41.7/47.8
	OC4v6		0.868	0.261	−0.017	0.960	50.2
	GPSR		0.880	0.250	0.010	0.962	51.4
	OC4x		0.859/0.854	0.269/0.276	−0.032	0.915	51.1/49.4
	MLP		0.889/0.883	0.238/0.246	−0.026	0.942	45.6/45.9
	SVR		0.909/0.889	0.217/0.241	−0.030	0.926	41.3/44.2
Chla ≤ 0.25	GEP	402	0.704/0.597	0.154/0.173	0.100	1.054	32.8/36.9
	OC4v6		0.618	0.173	0.161	1.132	36.0
	GPSR		0.597	0.182	0.204	1.168	39.2
	OC4x		0.593/0.500	0.196/0.199	0.193	1.132	44.1/44.9
	MLP		0.660/0.603	0.173/0.176	0.192	1.151	36.2/38.0
	SVR		0.683/0.579	0.163/0.182	0.136	1.081	35.3/40.2
0.25 < Chla ≤ 3	GEP	607	0.627/0.561	0.236/0.260	0.031	1.401	48.5/55.4
	OC4v6		0.548	0.275	0.059	1.509	60.7
	GPSR		0.550	0.261	0.086	1.359	61.0
	OC4x		0.550/0.504	0.259/0.263	0.032	1.371	56.3/52.5
	MLP		0.598/0.549	0.238/0.251	0.027	1.336	50.8/50.2
	SVR		0.624/0.579	0.215/0.233	0.009	1.232	43.5/45.6
a_ph(443)	GEP	891	0.863	0.186	−0.088	0.941	37.1
a_ph(443)	QAAv5	887	0.029	0.751	−0.745	0.128	641
a_dg(443)	GEP	788	0.769	0.249	−0.172	0.866	48.0
a_dg(443)	QAAv5	746	0.617	0.615	0.265	1.523	58.8
a(443)	GEP	799	0.896	0.146	−0.077	0.922	24.7
a(443)	QAAv5	799	0.866	0.170	−0.114	0.922	24.4
bb_p(555)	GEP	332	0.801	0.088	−0.324	0.873	15.0
bb_p(555)	QAAv5	328	0.488	0.196	1.488	1.587	28.0
a_p(443)	GEP	932	0.888	0.171	−0.069	0.948	32.1

IOP	Algorithm	n	R²	logRMSE	Intercept	Slope
a_ph(λ)	GEP	500 (656)	0.95 (0.80)	0.207 (0.218)	0.314 (0.143)	1.143 (1.091)
	QAAv4	476 (630)	0.93 (0.59)	0.160 (0.321)	0.044 (0.033)	1.008 (1.067)
	NN	500	0.94	0.271	0.407	1.163
	LMI	484 (642)	0.88 (0.65)	0.222 (0.332)	0.053 (0.336)	0.989 (1.208)
	GSM	479 (646)	0.96 (0.74)	0.173 (0.350)	0.162 (0.029)	1.171 (1.175)
	SARA	492 (656)	0.83 (0.65)	0.288 (0.442)	0.145 (0.654)	1.156 (1.537)
a_dg(λ)	GEP	500 (656)	0.93 (0.82)	0.219 (0.205)	−0.140 (−0.225)	0.826 (0.870)
	QAAv4	476 (630)	0.99 (0.79)	0.093 (0.221)	−0.032 (−0.087)	0.960 (0.989)
	NN	500	0.96	0.23	0.21	0.959
	LMI	484 (642)	0.96 (0.65)	0.161 (0.325)	0.069 (−0.142)	1.052 (1.007)
	GSM	479 (646)	0.97 (0.80)	0.145 (0.246)	0.036 (0.003)	1.053 (1.084)
	SARA	492 (491)	0.87 (0.38)	0.348 (0.470)	−0.119 (0.416)	1.090 (1.312)
a(λ)	GEP	500 (656)	0.97 (0.82)	0.139 (0.184)	−0.040 (0.004)	0.880 (0.969)
	QAAv4	500 (656)	0.99 (0.84)	0.060 (0.175)	−0.004 (−0.081)	0.971 (0.969)
	NN	500	0.99	0.052	−0.009	1.006
	LMI	484 (642)	0.96 (0.87)	0.14 (0.163)	0.067 (−0.082)	1.036 (0.96)
	GSM	479 (646)	0.97 (0.84)	0.115 (0.223)	0.032 (−0.034)	1.068 (1.097)
	SARA	500 (656)	0.96 (0.76)	0.166 (0.218)	−0.122 (0.011)	0.974 (1.048)
bb_p(λ)	GEP	240	0.92	0.104	−0.603	0.763
	QAAv4	500	0.99	0.079	−0.063	0.983
	NN	500	0.98	0.082	−0.038	0.993
	LMI	484	0.94	0.149	0.008	1.018
	SARA	500	0.94	0.160	−0.093	0.981

Development of ocean color algorithms for estimating chlorophyll-a concentrations and inherent optical properties using gene expression programming (GEP)

Abstract

1. Introduction

2. Data sets

3. Symbolic regression

3.1 GEP based symbolic regression

3.2 Development of ocean-color retrieval algorithms using GEP

4. Performance and model-closure evaluation

5. Results and discussion

5.1 The GEP-derived ocean-color inversion models

5.2 The limitation of GEP-derived algorithms

5.3 Performance of GEP models for the NOMADv2a in situ bio-optical data set

5.4 Comparison with model-based IOP algorithms using IOCCG synthetic data set

5.5 Algorithm verification with level-2 match-up data

5.6 Application of GEP-derived algorithms

6. Concluding remarks

Acknowledgments

References and links

Cited By

Figures (11)

Tables (7)

Equations (11)

Optics Express

	R_rs(410)	R_rs(440)	R_rs(490)	R_rs(510)	R_rs(555)	R_rs(λ)
MAPD (%)	23	21	17	15	14	18
R²	0.92	0.94	0.89	0.84	0.87	0.97
RMSE	0.0031	0.0021	0.0013	0.0008	0.0005	0.0014

	Fluorometric Chla	HPLC chla	a_ph(443)	a_dg(443)	a_p(443)	a(443)	bb_p(555)
n	248	30	123	111	123	111	28
R²	0.832	0.920	0.839	0.805	0.838	0.877	0.825
logRMSE	0.234	0.324	0.312	0.246	0.190	0.165	0.071
MAPD (%)	37.0	52.5	46.2	55.4	36.7	32.7	10.7