Feature Extraction Using Multimodal Convolutional Neural-Free PDF

  • Date:14 Jan 2020
  • Views:76
  • Downloads:0
  • Pages:5
  • Size:656.13 KB

Share Pdf : Feature Extraction Using Multimodal Convolutional Neural

Download and Preview : Feature Extraction Using Multimodal Convolutional Neural

Report CopyRight/DMCA Form For : Feature Extraction Using Multimodal Convolutional Neural


2 VISUAL FEATURE EXTRACTION USING CNN 2 4 HMM based visual speech recognition. In this study CNNs are used as feature extractors and are. 2 1 Convolutional neural networks,combined with a conventional HMM GMM phonetic de. A CNN is a multilayer stack of learning modules well suited coder This architecture allows the introduction of prior. for treating bi dimensional dataset i e images CNNs are linguistic knowledge during the decoding via the use of a. subclass of neural networks that combine the nonlinear pro pronunciation dictionary and a language model Such prior. cessing of hidden layer neurons with essential properties of knowledge remains of particular interest in the context of. weight sharing over customizable sub images so called con visual speech recognition for regularization purposes As. volutional filters pooling and down sampling As a conse a matter of fact several sources of information such as the. quence such networks are expected to learn representation voicing are missing when considering only visual data. of data with increasing levels of abstraction regrouped by se Two strategies can be investigated for combining the ultra. mantic similarities The canonical structure of a CNN 11 sound and video modalities within the HMM GMM decoder. contains 1 a given number of convolutional layers each 1 an early fusion strategy in which the feature vectors re. being divided in four sub tasks convolutional filtering non lated to each modality are concatenated together and mod. linearity pooling and sub sampling 2 a set of fully connected eled using a 1 stream HMM GMM decoder and 2 a mid. layers with properties identical to that of classical neural net dle fusion strategy based on a 2 stream HMM GMM decoder. works 3 a softmax layer performing softmax function which where the modalities are combined at the HMM state level. outputs posterior probabilities for each class The combination of the two CNN based feature extraction. techniques independent vs joint modeling with these two. strategies early vs middle fusion results in 4 VSR architec. 2 2 Independent processing of the visual modalities tures Those architectures are referred to as S1 S2 S3 and. S4 and are illustrated in Fig 2 bottom, Our first implementation is based on two CNNs each one pro. cessing independently one visual modality i e ultrasound. 3 EXPERIMENTS, and video At training stage the classical gradient descent. back propagation technique is used to estimate the parame 3 1 Database. ters in a supervised manner the phonetic labels being used. as targets For each modality a vector of visual features is Experiments were conducted on the same database used in. extracted from the network by taking the output of the last 10 which contains 488 sentences pronounced by a male. fully connected layer just before the final softmax layer French speaker Ultrasound images 320x240 grayscale im. ages 60fps were acquired using the Terason T3000 med. ical ultrasound system with a 128 elements microconvex. 2 3 Multimodal architecture transducer 3 5 MHz frequency 140 angle 7cm penetra. tion depth Video images of the speaker s face 640x480. We propose a multimodal CNN processing jointly pairs of grayscale images 60 fps were recorded using an industrial. video and ultrasound images This architecure is illustrated CMOS camera Ultrasound and video sensors were kept. in Fig 2 and consists in the fusion of two canonical CNNs fixed with respect to the speaker s head using a stabilization. 16 Importantly it includes a fusion layer combining the ul helmet Visual and audio data were recorded simultaneously. trasound and video modalities Such an architecture aims at using the Ultraspeech software 17 in a sound proof room. extracting high level features from the simulatenous observa and under stable conditions of lightning French language. tion of tongue lips and jaw As in Section 2 2 the multimodal was described using a set of 34 phonemes The phonetic tran. CNN is trained in a supervised manner the phonetic labels scription of each recorded sentences was extracted automat. being used as targets ically and manually post checked The temporal boundaries. Based on this multimodal architecture we investigated of each phoneme were extracted from the audio signal using a. two ways of extracting visual features 1 extracting one sin conventional ASR system and a forced alignment procedure. gle feature vector at the output of the fusion layer imple The phonetic segmentation of the audio signal was then used. mentation S3 see Fig 2 top and 2 extracting two feature to label the visual data since audio ultrasound and video. vectors one for each modality at the output of the last fully data are recorded synchronously and to train the CNNs and. connected layer before the fusion layer implementation S4 HMM GMM decoders. see Fig 2 top Let us mention that the resulting features. may be different from the one obtained when considering two 3 2 Implementation details. CNNs trained separately Indeed the two modalities are here. tied together Their parameters are jointly estimated and thus For S1 and S2 systems independent processing of the two. can be mutually influenced modalities it appeared that the simplest CNN structure one. Fig 2 Top Architecture of the proposed multimodal convolutional neural network CNN Bottom Schematic representation. of the 4 proposed VSR systems S1 S2 S3 and S4 combining CNN based feature extraction and HMM GMM decoding. convolution layer one full layer and one softmax layer with was optimized on the training set For the 2 stream HMM. only a moderate number of filters respectively 16 and 8 architecture the weighting parameters used to combine the. provides satisfying results as discussed in Sec 4 For S3 stream likelihoods were also optimized on the training set. and S4 systems multimodal CNN the best results were ob Optimal values were found to be 0 7 for ultrasound and 0 3. tained also using a simple structure one convolution layer for video a similar result was found in 10. one full layer one fusion layer and one softmax layer Given. the number of free parameters at play we do not claim that. the proposed architecture is optimal and tuning is likely to Since the present study aim only at probing the ability. improve its performance For all CNNs the Rectified Lin of the multimodal CNN to process the visual data recog. ear Unit ReLU non linearity was used for all convolutional nition experiments were conducted without exploiting prior. fusion and full layers All CNNs were implemented using linguistic information The performance was measured by. the MatconvNet toolkit 18 and were trained using GPU calculating the phoneme recognition accuracy Tp defined as. acceleration Tp Np D S I Np where Np is the number of. All HMM GMM decoders were built using the HTK phonemes in the test corpus and D S and I are respectively. toolkit 19 with a standard HMM topology 3 states and a the number of deletions substitutions and insertions The. standard training procedure tied states and context dependent 95 confidence interval 95 of the phonetic recognition. triphone modeling For all experiments the visual features rate was computed following 20 A 8 fold cross validation. were modeled together with their first derivatives At decod was used to refine our statistics by splitting our corpus into. ing stage the most likely sequence of phonemes was esti eight subsets keeping seven subsets for the training and the. mated by decoding the HMM GMM state posterior probabil remaining one for testing taking into account all the possible. ities using the Viterbi algorithm the model insertion penalty permutations. early fusion since S2 S1 and S4 S3 surprisingly it. Table 1 Accuracy of the 4 systems of visual speech recogni. was not the case for the baseline since B1 B2,tion based on CNN S1 S2 S3 S4 and PCA B1 B2 Com. parison with an ASR system trained on the audio stream For 2 The best performance was obtained with the multi. all experiments the 95 confidence interval is 1 5 modal CNN architecture and the middle fusion strategy. ASR MFCC S4 with 80 4 accuracy However the difference. Tp 84 with the system S2 lies within the confidence interval. VSR PCA VSR CNN Therefore the benefit of considering jointly the two. B1 B2 S1 S2 S3 S4 modalities at the feature extraction stage need to be. Tp 74 7 73 1 77 8 79 9 75 8 80 4 confirmed with additional experiments. 3 The lowest performance was obtained with the system. S3 in which the features are extracted at the output of. 3 3 Baseline the fusion layer Among other possible explanations. The CNN based feature extraction approach was compared we can conjecture that the fusion layer operates as a. to a PCA based approach as used in our previous studies 1 bottle neck in the network and directly control the di. 10 This technique is a slight adaptation of the EigenFaces mension of the extracted feature vector In this study. technique 21 and aims at finding a decomposition basis that we empirically set this parameters to 34 in order to. best explains the variation of pixel intensity in a set of training match the number of target phonetic classes and limit. frames At feature extraction stage resized and normalized the number of fully connected layers before the final. video ultrasound frame is projected onto this basis and the softmax layer Considering the relatively low perfor. visual features are defined as the D first coordinates for each mance we infer that the reduction of the dimension by. stream The number of coordinates is a free parameter In our a factor of 2 in comparison with S1 S2 and S4 is too. implementation it is optimized on the training set by keeping severe Hence the optimization of the size of the fusion. the eigenvectors that carry 80 of the variance which led in layer seems to be a key issue and should be addressed. our case to D 30 for both video and ultrasound images carefully in a future study. Based on this approach we derive two baseline systems B1 4 With 80 4 accuracy the best VSR system S4 ap. and B2 where PCA based features are decoded using either proaches the ultimate accuracy of 84 derived from. an early fusion strategy i e 1 stream HMM GMM as in S1 audio data Such performance is very encouraging and. and S3 or a middle fusion strategy was used i e 2 stream makes the multimodal CNN a good candidate for a. HMM GMM as in S2 and S4 practical VSR system, For each experiment the performance was also compared.
with the one obtained when considering the audio data which. 5 CONCLUSIONS, was recorded simultaneously with the visual data Consid. ering that audio provides thorough information we assume. In this article we investigated the use of CNN for extract. that the ASR accuracy gives the upper bound reachable by a. ing visual features from ultrasound and video images of the. VSR system Audio signal was parametrized using MFCC. tongue and lips We proposed a multimodal architecture in. decomposition resulting in a vector of 13 static coefficients. which the two visual modalities are jointly processed We de. with their first derivatives extracted every 5ms The HMM. rived different systems in which the CNN is used as a feature. GMM decoder was trained using the same procedure as for. extractor and is combined with a HMM GMM decoder Ex. the VSR systems 3 states tied state context dependent tri. periments were conducted on a continuous speech VSR task. phone models, Results have demonstrated the potential of the CNN over a. previously published baseline However further experiments. 4 RESULTS AND DISCUSSION should be conducted to valid the potential benefit of the mul. timodal architecture over the use of two distincts CNN Such. Results are presented in Table 1 First CNN based ap experiments will be conducted in future studies on a multi. proaches systematically outperform PCA based baselines speaker database Future work will also focus on the design of. regardless the strategy used to combine the modalities early an end to end VSR system in line with recent work on ASR. or middle fusion This demonstrates the potential of the 22 combining convolutional layers for processing the raw. CNNs to extract relevant features from the raw video and ul visual data with a recurrent architecture as in 23 to model. trasound images Second the differences observed between the dynamics of speech articulation. the 4 CNN based VSR systems are more difficult to interpret. Nonetheless the following conclusions can be drawn. 1 The middle fusion strategy always outperforms the. 6 REFERENCES 12 Moez Baccouche Franck Mamalet Christian Wolf. Christophe Garcia and Atilla Baskurt Spatio temporal. 1 Thomas Hueber Elie Laurent Benaroya Ge rard Chol convolutional sparse auto encoder for sequence classifi. let Bruce Denby Ge rard Dreyfus and Maureen Stone cation in Proc BMVC 2012 pp 1 12. Development of a silent speech interface driven by ul. trasound and optical images of the tongue and lips 13 Andrej Karpathy George Toderici Sanketh Shetty. Speech Communication vol 52 no 4 pp 288 300 Thomas Leung Rahul Sukthankar and Li Fei Fei. 2010 Large scale video classification with convolutional. neural networks in Proc IEEE CVPR 2014 pp, 2 Bruce Denby Tanja Schultz Kiyoshi Honda Thomas 1725 1732. Hueber Jim M Gilbert and Jonathan S Brumberg,14 Karen Simonyan and Andrew Zisserman Two. Silent speech interfaces Speech Communication vol, stream convolutional networks for action recognition in.
52 no 4 pp 270 287 2010,videos in Proc NIPS 2014 pp 56. FEATURE EXTRACTION USING MULTIMODAL CONVOLUTIONAL NEURAL NETWORKS FOR VISUAL SPEECH RECOGNITION Eric Tatulli Thomas Hueber CNRS Univ Grenoble Alpes GIPSA lab Grenoble France ABSTRACT This article addresses the problem of continuous speech recognition from visual information only without exploiting any audio signal Our approach

Related Books