Fast discrimination of traditional Chinese medicine according to geographical origins with FTIR spectroscopy and advanced pattern recognition techniques

Ning Li; Yan Wang; Kexin Xu

doi:10.1364/OE.14.007630

1. Introduction

In traditional Chinese medicine, danshen is often used for promoting coronary circulation and has become an important component in curing coronary heart diseases.

The quality and efficacy of danshen samples are somewhat different owing to growing conditions in the area of geographical origin. Trueborn traditional medicine has the best pharmacological effect and its growing area is called the “trueborn area.” The trueborn areas of danshen are still being researched. It is essential to classify the origin of samples for determining the trueborn area and for selecting specific danshen for curing diseases.

Currently the main method for identifying various samples of danshen is chromatography, which evaluates medicine by the content of one or several of the most effective components. However, based on the holistic theory of Chinese pharmacology, medicinal materials take effect in curing diseases as a whole. Therefore, any method or experiment that destroys the wholeness of traditional Chinese medicines will not be fundamentally accepted.

Infrared spectroscopy can be an excellent candidate for the determination of danshen origins because it is fast, accurate, nondestructive, and completely dependable. Y. A. Woo ^[1] discriminated herbal medicine by using near-infrared spectroscopy with two methods including the Mahalanobis distance method and disciminant PLS2. R.Hua ^[2] discriminated fritillary by two-dimensional correlation IR spectroscopy.

In this paper danshen is taken as an integral system, and comprehensive information has been extracted for analysis. We first applied PCA to ascertain the possibility of discrimination with infrared transparent spectroscopy. Then two pattern recognition techniques, including SIMCA ^[3] and ANN ^[4,5], were used to study spectral features of 53 samples from different geographical areas in China. The objective of this study was to search for a new method to effectively discriminate danshen according to geographical origin.

2. Experimental setup

2.1 Sample preparation

Fifty-three samples of danshen were collected from four places in China. Their source areas and corresponding serial numbers are listed in Table 1. Geographically, Shanxi province is a little farther away from the others, while Hebei, Tianjin, and Shandong are adjacent or near to each other. We can conjecture that there are some differences on compound contents and molecular structures of danshen between those from Shanxi and those from the other areas.

Table 1. Serial numbers and source areas of danshen samples.

View Table | View all tables in this article

2.2 Experimental instrument and parameters

In the experiment, samples were ground into powder and sifted. After mixing them with KBr powder, the samples were pressed into tablets (0.3 mm thick and 13 mm in diameter). A spectrum GX1 Fourier transform spectrometer (Perkin Elmer, UK) with a mid-infrared DTGS detector was used to scan the spectra of the tablets. The spectra were built from 400–4000 cm^-1. The sampling resolution was 4 cm^-1, and 16 scans were taken for each sample.

3. Results and discussion

3.1 Effect of spectra preprocessing on clustering analysis

As shown in Fig. 1(a), original spectra of danshen from different areas are built from 400–4000 cm^-1. Before further analysis, data preprocess is done on the spectra, which plays an important role in pattern recognition. It can lower or almost eliminate the disturbance caused by random noise and sample background. One of the data preprocess methods, multivariant scattering correction (MSC), is introduced in our research to reduce the effect on spectrum caused by scattering and nonuniform particles.

The process of MSC can be accomplished by the following three steps. First, mean spectrum M¯ is acquired by Eq. (1)

{\bar{M}}_{j} = (1 / n) \sum_{i = 1}^{n} M_{ij} .

Second, a linearity regression equation can be built by least square regression. The slope coefficient ^a_ij and intercept ^b_ij are obtained from Eq. (2)

M_{ij} = a_{ij} {\bar{M}}_{j} + b_{ij} .

Finally, the MSC process is done by Eq. (3)

M_{ij (MSC)} = (M_{ij} - b_{ij}) / a_{ij} .

In these equations, M is the spectral matrix of sample and M¯ is the mean spectral matrix of sample, while ^a is the slope coefficient and ^b is the intercept.

The MSC method is applied on spectra of danshen in Fig. 1(a) and Fig. 1(b), which show spectra before and after MSC. All spectra reach a high coherence after MSC. To some extent, the effect of scattering is reduced by this kind of treatment.

Fig. 1. Spectra of danshen before (a) and after (b) MSC process.

Download Full Size | PDF

3.2 Spectral analysis

The main chemical compositions of danshen include fat-soluble compounds and some water-soluble compounds; they contain C-H, N-H, and O-H base groups. As shown in Fig. 1(b), there is an absorption peak from about 3300 to 3500 cm^-1; this is according to the character area of amines (N-H), and it is the protein area. From about 1650 to 1750 cm^-1 there is a strong absorption peak, which is according to carbonyl (C=O). The waveband below 1500 cm^-1 is the fingerprint area of the IR spectrum. Absorption peaks in this part can hardly be ascribed to some character of bonds; however, they are sensitive to the change of molecular structure.

In our research we chose a 1000 to 1200 cm^-1 fingerprint area for analysis for two reasons. First, stretching vibration of single bonds including C-C, C-O, C-N, and C-X and bending vibration of C-H and O-H occur in this part. As for danshen, content of water and water-soluble compounds of samples differ based on their geographical origins. So we presume that more information can be extracted from this part. Second, from the absorption spectra of danshen samples from 400–1400cm^-1, as shown in Fig. 2, there is more obvious diversity in the spectra based on the different areas. This diversity is the basis of pattern recognition.

In Fig. 2, absorption peaks of all the spectra appear almost at the same wavenumber. Samples of different areas can be classified by the different intensity and shape of the peaks. We chose the absorbance value at every four wavenumbers as the input of pattern recognition.

Fig. 2. Absorption spectra of danshen (400~1400 cm^-1) from different source areas.

Download Full Size | PDF

3.3 Clustering by PCA

The main objective of PCA is to reduce the dimension of the matrix so that useful information can be extracted from the overlapped chemical information. Principal components can be taken as the projection of the original data matrix X in a new space; it is also called the score matrix. After the preprocessing of MSC and standardization on X, data matrix Z is obtained.

By decomposing the covariance matrix of Z, principal components can be obtained, and then their accumulated contribution and score matrix are attained. A scoring map is shown in Fig. 3(a), and the serial numbers are according to those listed in Table 1. While we take the spectral data processed by MSC as the input of PCA analysis, the scoring map in Fig. 3(b) is obtained. In it, as we expected, samples from Shanxi province are easily separated from the others.

Fig 3. Scoring map of PCA result before (a) and after (b) MSC.

Download Full Size | PDF

In Fig. 3(b), a dividing line can be drawn between two geographical groups. Shandong, Hebei, and Tianjin are located near each other. They share similar growing conditions, so the content and structure of the medicine are coherent to each other. Samples of these areas are sorted into one group. The result is consistent with our deduction that the compound content and molecular structure of danshen from Shanxi are different from samples from other areas. The influence of scattering is reduced by the MSC correction. From this analysis we come to the conclusion that traditional Chinese medicine can be discriminated by mid-infrared transmission spectroscopy and PCA.

3.4 Clustering analysis by SIMCA

We can divide all of the samples into groups through the scoring map roughly by PCA, but the identification should be quantified and more accurate for practical use. So the pattern recognition technique—SIMCA—is introduced. It is based on the following assumption: samples of the same type have similar characteristics, so they will gather in a special region when put into an eigen space, while samples of different types will not gather this way. A class model for each type of specimen in the training set is built up by PCA, with the optimal number of principal components chosen for each model according to cross validation. Then the SIMCA distances between samples in the testing set with each class model are calculated. According to the distance, each sample will be classified into a known class, several known classes, or a new unknown class ^[6].

From the total of 53 samples, 36 are selected to build a class model while the 17 remaining samples are in the test set. We put samples from Shanxi in group “B” while samples from other locales are in group “A”. Clustering results by SIMCA with and without MSC preprocess methods is shown in Table 2.

Table 2. Classification result of SIMCA.

View Table | View all tables in this article

As shown in Table 2, the correct classification rate by SIMCA without MSC can only reach 71%. When we take the data treated by MSC as the object of SIMCA, a much better result, the correct identification rate of 82% is obtained. Combined with MSC preprocessing, SIMCA can be a good process for identifying the geographical source of the danshen sample.

3.5 Clustering analysis by ANN

BP is a kind of ANN that must be practiced before being applied in the calculation. For each input, the discrepancy between the output and the expected object is calculated; the network will do a self-adjustment until such discrepancy reaches a minimum value ^[7]. In this paper, a BP network is designed, and the flow process chart is shown in Fig. 4.

Before being applied on the identification of source areas of danshen, the artificial network should be trained by the training set. Thirty-six samples are selected as the training set, while the other 17 are designated as the testing set.

By calculation, the parameters are confirmed. All the data of raw spectra are treated by MSC, and then the scoring matrix of PCA is taken as the input parameter of the BP network. Vectors of PCA are orthogonal and not correlated with each other. This can well meet the requirements of features optimization of the BP network. What is more, by PCA a few principal components can be extracted as input parameters, and this can greatly simplify the structure of the BP network and reduce calculation.

The BP network is trained by the training set according to the flow process. Samples from Shanxi province are put into group “B” while other samples are put into group “A”. The predicted result is compared to the practical testing set, and they completely match each other. So we can draw the conclusion that the use of the BP network can be a practical application in the classification of habitats of traditional Chinese medicine.

Fig. 4. Flow process of BP artificial neural network.

Download Full Size | PDF

4. Conclusions

With less dissociation and extraction, the use of mid-infrared transmission spectroscopy provides integral information regarding traditional Chinese medicine. IR transmission spectroscopy combined with a pattern recognition technique can be a feasible and rapid way to discriminate the origins of Chinese medicine. It provides a noninvasive way to evaluate traditional Chinese medicine, and can also accelerate the modernization process of Chinese medicine.

References and links

1. Y. A. Woo, H. J. Kim, J. H. Cho, and H. Chung, “Discrimination of herbal medicines according to geographical origin with near infrared reflectance spectroscopy and pattern recognition techniques,” J. Pharm. Biomed. Anal. 21, 407–413 (1999). [CrossRef]

2. R. Hua, S. Sun, Q. Zhou, I. Noda, and B. Wang, “Discrimination of fritillary according to geographical origin with Fourier transform infrared spectroscopy and two-dimensional correlation IR spectroscopy,” J. Pharm. Biomed. Anal. 33, 199–209 (2003). [CrossRef] [PubMed]

3. L. Xu and X. Shao, Methods of Chemometrics (Science Press, Beijing,2004), Ch. 1, 3.

4. T. Aoyama, Y. Suzuki, and H. Ichikawa, “Neural networks applied to quantitative structure-activity relationship analysis,” J. Med. Chem. 33, 2583–2590 (1990). [CrossRef] [PubMed]

5. P. J. Gemperline, J. R. Long, and V. G. Geogorious, “Nonlinear multivariate calibration using principal components regression and artificial neural networks,” J.Anal. Chem. 63, 2313–2323 (1991). [CrossRef]

6. S. Sun, J. Tang, Z. Yuan, and Y. Bai, “FTIR and classification study on trueborn tuber dioscoreae samples,” Chin. J. Spectrosc. Spectral Anal. 23, 258–261(2003).

7. X. Yang and J. Zheng, Artificial neural network and blind signal processing (Tsinghua University Press, Beijing, 2003).

The test set	Classification by original spectrum	Classification by spectra with MSC processing	Practical group
1.	A or B	A or B	A
2.	A,B	B	B
3.	A	A	A
4.	A	A	A
5.	A	A	A
6.	B	A,B	A
7.	A	A	A
8.	A	A	A
9.	A,B	A,B	A
10.	A	A	A
11.	A,B	B	B
12.	A	A	A
13.	A	A	A
14.	A	A	A
15.	A	A	A
16.	A	A	A
17.	A	A	A
Number of hits	12	14
Correct identification rate	12/17=71%	14/17=82%	___

The test set	Classification by original spectrum	Classification by spectra with MSC processing	Practical group
1.	A or B	A or B	A
2.	A,B	B	B
3.	A	A	A
4.	A	A	A
5.	A	A	A
6.	B	A,B	A
7.	A	A	A
8.	A	A	A
9.	A,B	A,B	A
10.	A	A	A
11.	A,B	B	B
12.	A	A	A
13.	A	A	A
14.	A	A	A
15.	A	A	A
16.	A	A	A
17.	A	A	A
Number of hits	12	14
Correct identification rate	12/17=71%	14/17=82%	___

Fast discrimination of traditional Chinese medicine according to geographical origins with FTIR spectroscopy and advanced pattern recognition techniques

Abstract

1. Introduction

2. Experimental setup

2.1 Sample preparation

2.2 Experimental instrument and parameters

3. Results and discussion

3.1 Effect of spectra preprocessing on clustering analysis

3.2 Spectral analysis

3.3 Clustering by PCA

3.4 Clustering analysis by SIMCA

3.5 Clustering analysis by ANN

4. Conclusions

References and links

Cited By

Figures (4)

Tables (2)

Equations (3)

Optics Express

Source area	Serial number
Hebei	1~2, 15~17, 40~42
Shanxi	3~4, 29~31
Tianjin	5~8, 25~26, 27~28, 48~50
Shandong	9~14, 18~24, 32~39, 43~47, 51~53