Development of a multilingual digital signage system using a directional volumetric display and language identification

Mitsuru Baba; Tomoya Imamura; Naoto Hoshikawa; Hirotaka Nakayama; Tomoyoshi Ito; Atsushi Shiraki; Atsushi Shiraki

doi:10.1364/OSAC.405929

1. Introduction

In recent years, image display in three-dimensional (3-D) space has been actively studied [1–4]. Among the various 3-D display technologies is volumetric display, which presents 3-D images on a medium with depth. As the display medium, researchers have employed light-emitting diodes [5], bubbles [6], and other media.

Our group has proposed a technology that records multiple images in the same space using a volumetric display [7,8]. Based on this technology, we then developed a directional volumetric display in which the presented images depend on the direction of observation [9]. The directional images can be displayed in any direction. Furthermore, because the display can be recognized as a meaningful image only from the displayed direction, the directional volumetric display can display different images to multiple observers placed at different observation positions. The observers’ positions are detected by a motion sensor [10].

Alongside these developments, parallel computation using a graphics processing unit and tensor processing units and machine learning frameworks such as Tensorflow [11], have advanced in recent years. These techniques are increasingly being applied in deep learning, where they enhance the accuracy by deepening the network structure of the neural network [12]. In particular, the advent of unattended assistant systems such as Microsoft Cortana and Amazon Alexa has sparked much interest in speech recognition systems based on deep learning network models [13]. For pre-processing of speech recognition, many researchers have employed language identification techniques that extract features such as prosody and phonemes from speech data, and hence classify a spoken language. Typical language identification methods employ a convolutional neural network (CNN) or a recurrent neural network (RNN). CNN is a multilayered neural network comprising a combination of a convolutional layer that extracts the features of input data and a pooling layer that reduces dimensions and maintains translation invariance [14]. CNNs are garnering attention in the field of image processing that addresses spatial data, such as handwritten characters [15] and object recognition [16], because it is possible to extract features while maintaining the translation immutability of the input data. However, the overfitting problem likely occurs when the training data set is small for a complex CNN architecture [17]. Alternatively, RNNs are neural networks with an autoregressive structure that temporarily stores the current state in a hidden layer and propagates the past state to the next state [18]. RNNs are becoming a hotspot in the field of natural language processing that addresses time-series data, such as machine translation [19] and speech recognition [20], because the past state propagates to the present state. However, a problem of decrease in the learning speed as the number of hidden layers increases occurs [21].

CNN-based language identification models include the five-layer CNN proposed by Shyamapada et al. [22], which distinguishes three languages (German, English, and Spanish) with 92.7% accuracy. Himadri et al. [23] distinguished seven languages (Bangladeshi, Marathi, Telugu, Tamil, Malayalam, Kannada, and Hindi) with 95.5% accuracy using a two-layer CNN. Among the RNN-based language identification models is the model of Christian et al. [24], which distinguishes four languages (English, German, French, and Spanish) with 91.0% accuracy using a five-layer CNN and long short-term memory. Sarthak et al. [25] identified six languages (English, French, German, Spanish, Russian, and Italian) with 95.4% accuracy using four CNN blocks and a bi-directional gated recurrent unit.

By implementing a language identification function for our directional volumetric display, we develop a multilingual digital signage system that directs images toward a speaker of a specific language at a specific position. In this study, we combine the directional volumetric display and language identification into a multilingual digital signage system.

2. Development method of a multilingual digital signage system

2.1 Architecture of language identification model

The system is designed to distinguish three languages: English, Spanish, and French. We employ the speech corpus provided by VoxForge [26] as the dataset, and randomly extract 22,000 speech samples per language (66,000 speech samples in total) from wav-format speech samples sampled at 8 kHz as the speech corpus. Among the speech samples of each language, we randomly select 15,000 samples as the training data, 3,500 samples as the verification data, and 3,500 samples as the test data. The total speech corpus is divided into 45,000 training data (68.2%), 10,500 verification data (15.9%), and 10,500 test data (15.9%).

The speech features are extracted from a log-mel spectrogram. For learning purposes, the spectrogram features are converted to (128 × 110)-pixel binary images. Our language identification model comprises four CNN blocks and a fully connected layer, including the output layer. The CNN-based language identification model is shown in Fig. 1 and configured as shown in Table 1.

Fig. 1. CNN-based language identification model.

Download Full Size | PDF

Table 1. Configuration of the CNN-based language identification model.

View Table

Each CNN block performs four processes: feature extraction by a convolutional layer, a batch normalization process that normalizes a mini-batch, an activation layer operated by a rectified linear unit, and feature-map reduction by a max-pooling layer. In the fully connected layer, we introduce a dropout layer [27] that improves the generalization performance. The classification is based on the features extracted through the four CNN blocks. The probabilities of belonging to each language are then estimated by the Softmax function. The model is built and learned using the neural network library Keras with a Tensorflow backend [11]. The batch size is 64, the number of epochs is 30, and the dropout ratio is 0.20. The optimization is performed by Adam [28] with a learning rate of 1.0 × 10⁻⁴ and a weight decay rate of 1.0 × 10⁻³.

2.2 Speech extraction method

The average times of reproducing the English, Spanish, and French speech samples were 4.95, 8.04, and 5.97 s, respectively (with standard deviations of 1.75, 2.34, and 1.34, respectively). The average time of reproducing the speech samples of all three languages was 6.32 s (with a standard deviation of 2.26). To improve the identification accuracy of the language identification model, the speech features are extracted from the collected speech samples through pre-processing. First, the reproduction time is edited to accommodate the inconsistent reproduction times of the speech samples in different languages. In this process, the average reproduction time of the speech samples (6.32 s) is rounded to 7.00 s for training. If the reproduction time of the speech sample exceeds 7.00 s, the data exceeding 7.00 s are truncated; if the reproduction time is shorter than 7.00 s, some speech samples are repeated until the reproduction time reaches 7.00 s. Next, the speech features are extracted from the samples. In speech recognition studies, the speech features are commonly extracted using the mel-frequency cepstral coefficients (MFCCs), which convert acoustic signals into features that account for human frequency perception. The procedure for deriving MFCCs is described [29–31].

1. Obtain the short-time Fourier transform using a humming window for the voice signal to obtain the magnitude spectrum.
2. To map the magnitude spectrum to the mel scale that matches the human auditory characteristics, employ the mel filter bank and obtain the mel spectrogram.
3. Obtain the discrete cosine transform (DCT) of the mel spectrogram to obtain MFCC.

However, in deep learning, the spatial features are lost by the DCT used in the conversion process to MFCC, and the log-mel spectrogram, which omits DCT processing, is used instead [32]. In this study, the speech features were extracted from a log-mel spectrogram and converted into a (128 × 110)-pixel binary image for training. The pre-processing of the speech samples is outlined in Fig. 2.

Fig. 2. Workflow of pre-processing speech samples (a) longer than 7.00 s, and (b) shorter than 7.00 s.

Download Full Size | PDF

2.3 Structure of the multilingual digital signage system

Our multilingual signage system comprises a directional volumetric display with threads, a projector, and a speech processing system that is based on Kinect v2 (Microsoft Japan Co., Ltd., Japan) as shown in Fig. 3(a). The images are projected using an MH550 projector (BenQ Japan Co., Ltd., Japan). When the projector is oriented at θ = 0° relative to the directional volumetric display, the Kinect v2 is installed at θ = 45°. The speech is reproduced by a WALKMAN NW-S764 (Sony Co., Ltd., Japan) and transmits through a Lenovo M0620 speaker (Lenovo Japan Co., Ltd., Japan). When the frontal direction of the Kinect v2 is set to θ_k = 0°, the speech reproduction is covered by an arc of radius 0.50 m ranging over −50° ≤ θ_k ≤ 50°. The positional relationship between the Kinect v2 and the audio source reproduction device is shown in Fig. 3(b). The directional volumetric display is 1.87 m high, 0.95 m wide, and 0.95 m deep, and the upper part of the frame is mounted with a square magnetic board of side length of 0.90 m. The magnetic board is hung using Vinymo MBT threads (Nagai Yoriito Co., Ltd., Japan) with a fineness of 280 dtex. Owing to constraints on the thread placements [9], a maximum of 211 threads could be placed in this study.

Fig. 3. External views of the system: (a) Photograph of the developed multilingual digital signage system, and (b) positional relationship between the Kinect v2 sensor and the audio source reproductive device.

Download Full Size | PDF

2.4 Speech processing system

The observer localization and speech acquisition are performed using a Kinect v2 sensor. As shown in Fig. 3(b), the frontal direction of the Kinect v2 is θ_k = 0°. The observer location was specified by calculating the directional angle of an observer speaking within the horizontal audio source acquisition range of Kinect v2 (−50 ≤ θ_k ≤ 50). The audio data samples at 16 KHz by the Kinect v2 sensor are down-sampled to 8 kHz for compatibility with the input of the language identification model. Based on the processed speech data, the proposed language identification model identifies the language spoken by the observer. The language with the highest occurrence probability among the three probabilities output by the language identification model is determined as the language identification result.

2.5 Generation of projected image

The images generated by the developed system must be projected on the directional volumetric display according to the observer position, which is itself specified by the speech processing system and the language identification result. This section describes the generation method of the projected images.

The directional volumetric display displays the 3-D data as voxels. As shown in Fig. 4, the calculated voxel values depend on the images to be displayed and the display direction [7,8]. Consider a voxel value V_x,y,z at 3-D coordinates (x, y, z) = (i, j, k) in the right-handed coordinate system. When two displayed images I_A and I_B are orthogonal (Fig. 4(a)), the voxel value V_{x, y, z} is calculated as follows:

(1)$${V_{x,{\; }y,{\; }z}} = \mathrm{\lambda } \times {I_A}({i,{\; }j} )\times {I_B}({k,{\; }j} )$$

where λ is a constant that normalizes the voxel value. To determine the voxel values of an image displayed in any direction (see Fig. 4(b)), we set the angle of the observer’s position relative to the frontal direction as θ_o. The coordinates (i′, j′, k′), obtained by rotating (i, j, k) by θ_o from the front direction with respect to the y-axis, are calculated as follows:

(2)$$\begin{aligned} \left[ {\begin{array}{c} {i^{\prime}}\\ {j^{\prime}}\\ {k^{\prime}} \end{array}} \right] &= \left[ {\begin{array}{ccc} {\cos {\theta_o}}&0&{\sin {\theta_o}}\\ 0&1&0\\ { - \sin {\theta_o}}&0&{\cos {\theta_o}} \end{array}} \right]\left[ {\begin{array}{c} i\\ j\\ k \end{array}} \right]\\ &= \left[ {\begin{array}{c} {i\cos {\theta_o} + k\sin {\theta_o}}\\ j\\ { - i\sin {\theta_o} + k\cos {\theta_o}} \end{array}} \right]. \end{aligned}$$

If image A in the frontal direction is I_A(i, j, 0) and image B in the observer’s direction is I_B(i′, j′, 0), the voxel value V_{x, y, z} is given by

(3)$$\begin{aligned} {V_{x,{\; }y,{\; }z}} &= \mathrm{\lambda } \times {I_A}({i,{\; }j,{\; }0} )\times {I_B}({i^{\prime},{\; }j^{\prime},0} )\\ &= \mathrm{\lambda } \times {I_A}({i,{\; }j,{\; }0} )\times {I_B}({i\cos {\theta_o} + k\sin {\theta_o},j,0} ). \end{aligned}$$

The above voxel values were calculated for two images, but the number of displayed images can be expanded to arbitrary N⁷. Directional images can be displayed by projecting the calculated voxel values onto the corresponding thread.

Fig. 4. Calculation of voxel values when (a) two displayed images are projected orthogonally and (b) the images are projected to arbitrary positions.

Download Full Size | PDF

3. Results

3.1 Language identification results

The accuracy of the language identification model was evaluated using the test data in the speech samples. Figure 5 shows the confusion matrix obtained when evaluating the test data.

Fig. 5. Confusion matrix obtained when evaluating the test data.

Download Full Size | PDF

The average identification accuracies of English, Spanish and French in the test data were 89.6%, 94.6% and 91.6%, respectively. The probability of predicting a correct English answer as an incorrect Spanish answer was 6.9%, and the probability of predicting a correct Spanish answer as an incorrect English answer was 5.2%. These results show that English and Spanish were sometimes confused. The identification accuracy of the three languages was 91.9%.

3.2 Operation results of the multilingual digital signage system

The original display images were (132 × 15)-pixel blocks containing the English characters “NONE,” “ENGLISH,” “SPANISH,” and “FRENCH.” Sliding each frame of these images pixel-by-pixel into the extraction process, (15 × 15)-pixel images were obtained and displayed. The original displayed image is shown in Fig. 6(a). The final displayed image depended on the output of the language identification model.

Fig. 6. Original images and display results: (a) Original image corresponding to each language, and (right) the display results of reproducing (b) Spanish at θ = 90°, (c) English at θ = 45°, and (d) French at θ = 0°.

Download Full Size | PDF

The displayed images and their corresponding original images are shown in Figs. 6(b)–6(d). First, when the Spanish speech was reproduced at θ = 90°, the display “SPANISH” was oriented at θ = 90°, and the display “NONE” was oriented at θ = 0°. The display result at θ = 90° and the corresponding original image are shown in Fig. 6(b). Next, when the English speech was reproduced at θ = 45°, the display “NONE” was oriented at θ = 0°, and “ENGLISH” was oriented at θ = 45° (see Fig. 6(c)). Finally, when the French speech was reproduced at θ = 0°, the display “FRENCH” was oriented at θ = 0°, and “ENGLISH” was oriented at θ = 45° (see Fig. 6(d)). In all cases, a meaningful image could be discerned only in the displayed direction. The video was displayed at ten frames per second, and can be viewed in Visualization 1.

4. Discussion and conclusion

In this study, the speech and direction of an audio source were acquired by a Kinect v2 sensor, and the language of the acquired speech (English, Spanish, or French) was distinguished by a CNN-based language identification model. The identification accuracy of the three languages was 91.9%, encouragingly similar to those of previous studies [22–25]. The language identification results were presented in the directional volumetric display using a thread-like display medium. Specifically, they were conveyed in the direction of the audio source from which the speech was extracted. The displays in Fig. 6 confirm that the language identification results were directed toward the audio source, thus demonstrating the effectiveness of our multilingual digital signage system. For example, if this system is installed at airports and tourist attractions visited by many foreigners, the practical effect of multilingual signage for guidance can be achieved. In today’s internationalization world, such a system that eliminates language barriers is crucial. Moreover, this system is significant as a practical example of directional volumetric display. However, the original speech sample and speech data acquired by Kinect v2 yielded different accuracies in the language identification model. White noise was evident in the sensor data in the actual measurement environment. However, during training, the audio source provided by VoxForge was converted to a log-mel spectrogram image in the computer for training. Hence, if the original audio source does not contain white noise, white noise, as in the measurement environment, will not occur. We assume that this observation is attributed to the difference in the accuracy. In fact, the measurement environment contaminates the speech samples with server operation noise and other noises. To enhance the robustness of the constructed system, the language identification model must be learned using speech samples plus the white noise of the measurement environment. Alternatively, the white noise must be removed from the acquired speech data prior to learning.

The image oriented at θ = 45° was poorer in quality than the images directed at θ = 0° and 90°. One probable cause of this degradation is the square arrangement of the threads on the magnetic board. When the image is displayed at θ = 45°, the horizontal length of the image displayed is square root of two times of the image displayed at θ = 0° or 90°. Because the differently sized images are composed of the same number of threads, the more distantly displayed image becomes sparser and its quality deteriorates. This is explained using Fig. 7. The perspective axes when θ = 0° and 45° refer to Perspective axes 1 and 2, respectively. The display area matched to each perspective axis is shown in Fig. 7(a). Here, note that the display area when θ = 45° and the outer shape of the thread arrangements do not completely overlap. When viewed from the perspective of axis 2, the displayed image is shown in Fig. 7(b). The blue part indicates noise because it is not required when displaying the image of Perspective axis 2. The red part in Fig. 7(b) denotes the display image of Perspective axis 2; however, because the threads are also arranged in the blue part, the threads that comprise the red part become sparse. Consequently, the image quality deteriorates. Therefore, as shown in Fig. 7(c), by making the outline of the thread arrangement circular, the size of the displayed image does not change from any perspective axis and the differences in image quality can be suppressed. Furthermore, in this research, only binary images, such as characters, were addressed, and there was almost no improvement in image quality. Therefore, we did not adopt the image quality improvement algorithm [33]. However, when handling high quality images with gradation, improved image quality can be expected. Therefore, the image quality improvement algorithm was used based on the image being handled. Because the improvement in image quality depends on the processing time, it is important to use the algorithm based on the system being developed.

Fig. 7. Problems with the current thread arrangement and proposed new thread arrangement. (a) Top view of two display spaces and two perspective axes when θ = 0 ° and 45 °, (b) image displayed when viewing from Perspective axis 2, and (c) the arrangement of circular threads to ensure same size of the displayed images.

Download Full Size | PDF

Funding

Japan Society for the Promotion of Science (18K11599); Yazaki Memorial Foundation for Science and Technology.

Disclosures

The authors declare no conflicts of interest.

References

1. D. E. Smalley, E. Nygaard, K. Squire, J. V. Wagoner, J. Rasmussen, S. Gneiting, K. Qaderi, J. Goodsell, W. Rogers, M. Lindsey, K. Costner, A. Monk, M. Pearson, B. Haymore, and J. Peatross, “A photophoretic-trap volumetric display,” Nature 553(7689), 486–490 (2018). [CrossRef]

2. S. Hunter, R. Azuma, J. M. Thompson, D. MacLeod, and D. Disanjh, “Mid-Air Interaction with a 3D Aerial Display,” ACM SIGGRAPH 2017 Emerging Technologies 17, 1–2 (2017).

3. K. Rathinavel, H. Wang, A. Blate, and H. Fuchs, “An Extended Depth-at-Field Volumetric Near-Eye Augmented Reality Display,” IEEE Trans. Visual. Comput. Graphics 24(11), 2857–2866 (2018). [CrossRef]

4. M. D. Medeiros, J. Nascimento, J. Henriques, S. Barrao, A. F. Fonseca, N. A. Silva, N. M. Coelho, and V. Agoas, “Three-Dimensional Head-Mounted Display System for Ophthalmic Surgical Procedures,” Retina 37(7), 1411–1414 (2017). [CrossRef]

5. M. Gately, Y. Zhai, M. Yeary, E. Petrich, and L. Sawalha, “A Three-Dimensional Swept Volume Display Based on LED Arrays,” J. Disp. Technol. 7(9), 503–514 (2011). [CrossRef]

6. K. Kumagai, S. Hasegawa, and Y. Hayasaki, “Volumetric bubble display,” Optica 4(3), 298–302 (2017). [CrossRef]

7. H. Nakayama, A. Shiraki, R. Hirayama, N. Masuda, T. Shimobaba, and T. Ito, “Three-dimensional volume containing multiple two-dimensional information patterns,” Sci. Rep. 3(1), 1931 (2013). [CrossRef]

8. A. Shiraki, D. Matsumoto, R. Hirayama, H. Nakayama, T. Kakue, T. Shimobaba, and T. Ito, “Improvement of an algorithm for displaying multiple images in one space,” Appl. Opt. 58(5), A1–A6 (2019). [CrossRef]

9. A. Shiraki, M. Ikeda, H. Nakayama, R. Hirayama, T. Kakue, T. Shimobaba, and T. Ito, “Efficient method for fabricating a directional volumetric display using strings displaying multiple images,” Appl. Opt. 57(1), A33–A38 (2018). [CrossRef]

10. D. Matsumoto, R. Hirayama, N. Hoshikawa, H. Nakayama, T. Shimobaba, T. Ito, and A. Shiraki, “Interactive directional volumetric display that keeps displaying directional image only to a particular person in real-time,” OSA Continuum 2(11), 3309–3322 (2019). [CrossRef]

11. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,” arXiv 1603.04467, (2016).

12. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

13. V. Kepuska and G. Bohouta, “Next-generation of virtual personal assistants (Microsoft Cortana, Apple Siri, Amazon Alexa and Google Home),” 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), 99–103 (2018).

14. Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object Recognition with Gradient-Based Learning,” Shape, Contour and Grouping in Computer Vision 1681, 319–345 (2015).

15. F. Sultana, A. Sufian, and P. Dutta, “Advancements in Image Classification using Convolutional Neural Network,” 2018 Fourth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), 122–129 (2018).

16. R. L. Galvez, A. A. Bandala, E. P. Dadios, R. R. P. Vicerra, and J. M. Z. Maningo, “Object Detection Using Convolutional Neural Networks,” TENCON 2018 - 2018 IEEE Region 10 Conference, 2023–2027 (2018).

17. J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, and T. Chen, “Recent advances in convolutional neural networks,” Pattern Recognition 77, 354–377 (2018). [CrossRef]

18. A. Sherstinsky, “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network,” Phys. D 404, 132306 (2020). [CrossRef]

19. S. K. Mahata, D. Das, and S. Bandyopadhyay, “MTIL2017: Machine Translation Using Recurrent Neural Network on Statistical Machine Translation,” J. Intelligent Systems 28(3), 447–453 (2018). [CrossRef]

20. A. Amberkar, P. Awasarmol, G. Deshmukh, and P. Dave, “Speech Recognition using Recurrent Neural Networks,” 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), 1–4 (2018).

21. S. Hojjat, B. Julianne, S. Sharan, B. Joseph, C. Errol, and V. Shahrokh, “Recent advances in recurrent neural networks,” arXiv:1801.01078, (2018).

22. S. Mukherjee, N. Shivam, A. Gangwal, and L. Khaitan., and A. J. Das, “Spoken Language Recognition using CNN,” 2019 International Conference on Information Technology (ICIT), 37–41 (2019).

23. H. Mukherjee, S. Ghosh, S. Sen, O. M. Sk, K. C. Santosh, S. Phadikar, and K. Roy, “Deep learning for spoken language identification: Can we visualize speech signal patterns?” Neural Comput. & Applic. 31(12), 8483–8501 (2019). [CrossRef]

24. S. Jauhari, S. Shukla, and G. Mittal, “Spoken Language Identification using ConvNets,” European Conference on Ambient Intelligence, 252–265 (2019).

25. C. Bartz, T. Herold, H. Yang, and C. Meinel, “Language Identification Using Deep Convolutional Recurrent Neural Networks,” In International Conference on Neural Information Processing, 880–889 (2017).

26. voxforge.org, “Free speech recognition (linux, windows and mac),” https://www.voxforge.org/, accessed on 1 Aug 2020.

27. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” J. Machine Learning Res. 15, 1929–1958 (2014). [CrossRef]

28. D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” Proc. ICLR 2015 (2015).

29. S. K. Kopparapu and M. Laxminarayana, “Choice of Mel filter bank in computing MFCC of a resampled speech,” 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010), 121–124 (2010).

30. T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: from features to supervectors,” Speech Commun. 52(1), 12–40 (2010). [CrossRef]

31. M. Sahidullah and G. Saha, “Design analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition,” Speech Commun. 54(4), 543–565 (2012). [CrossRef]

32. H. Purwins, B. Li, T. Virtanen, J. Schlüter, S. Y. Chang, and T. Sainath, “Deep Learning for Audio Signal Processing,” IEEE J. Sel. Top. Signal Process. 13(2), 206–219 (2019). [CrossRef]

33. R. Hirayama, H. Nakayama, A. Shiraki, T. Kakue, T. Shimobaba, and T. Ito, “Image quality improvement for a 3D structure exhibiting multiple 2D patterns and its implementation,” Opt. Express 24(7), 7319–7327 (2016). [CrossRef]

Layer (Type)	Output (Shape)	Parameter
Input Layer	(None, 128, 110, 1)	0
Conv 2D-1	(None, 128, 110, 16)	160
Batch Normalization-1	(None, 128, 110, 16)	64
Activation-1	(None, 128, 110, 16)	0
Max-Pooling 2D-1	(None, 64, 55, 16)	0
Conv 2D-2	(None, 64, 55, 32)	4,640
Batch Normalization-2	(None, 64, 55, 32)	128
Actibation-1	(None, 64, 55, 32)	0
Max-Pooling 2D-2	(None, 32, 27, 32)	0
Conv 2D-3	(None, 32, 27, 64)	18,496
Batch Normalization-3	(None, 32, 27, 64)	256
Activation-3	(None, 32, 27, 64)	0
Max-Pooling 2D-3	(None, 16, 13, 64)	0
Conv 2D-4	(None, 16, 13, 128)	73,856
Batch Normalization-4	(None, 16, 13, 128)	512
Activation-4	(None, 16, 13, 128)	0
Max-Pooling 2D-4	(None, 8, 6, 128)	0
Flatten	(None, 6144)	0
Dense-1	(None, 512)	3,146,240
Dropout(0, 20)	(None, 512)	0
Activation-5	(None, 512)	1,539
Dense-2	(None, 3)	1,539
Activation-6	(None, 3)	0

Development of a multilingual digital signage system using a directional volumetric display and language identification

Abstract

1. Introduction

2. Development method of a multilingual digital signage system

2.1 Architecture of language identification model

2.2 Speech extraction method

2.3 Structure of the multilingual digital signage system

2.4 Speech processing system

2.5 Generation of projected image

3. Results

3.1 Language identification results

3.2 Operation results of the multilingual digital signage system

4. Discussion and conclusion

Funding

Disclosures

References

Supplementary Material (1)

Cited By

Figures (7)

Tables (1)

Equations (3)

OSA Continuum