Secondary abstract: |
This thesis focuses on quality assessment of multimodal services in contemporary telecommunication systems. It addresses quality degradations which affect user experience. Depending on their origin, they can be categorized as source or network impairments. Their impact can be measured with subjective or objective methods. Since multimodal services can be bi-directional systems, it is necessary to have control over input and output modalities of the system. This leads to intermodal influences between the modalities as a consequence of human perception. Furthermore, the users' focus on Regions-of-Interest (ROI) gives degradations in those particular regions greater impact on the overall quality, which we can use for differentiated quality assessment. The aim of this thesis is to propose a model for quality assessment of multimodal services and develop the concept of the quality evaluator, which takes the above mentioned facts into account. Therefore, the thesis is divided into three sections. In the first section, the impact of quality degradations on the input modality is determined. In the second, a suitable multimodal database comprising HD recordings is established. This section also presents subjective and objective assessment of output modality, where subjective mean opinion score (subMOS) and objective mean opinion score (objMOS) were conducted. Based on the results, a new model of multimodal quality assessment is proposed. The last section addresses differential quality evaluation based on ROI. As part of the evaluation of the effect of quality degradations on the input modality, a voice-driven IVR service with a built-in speech recognition module (ASR) is analyzed. Assessment begins by measuring objMOS values of the samples from the SpeechDat(II) database. Samples were degraded by transcoding and packet loss. There were substantial differences between the speech codecs used, even when the exact same codec was used with different configurations. Generally, deterioration was greater for codecs with lower bandwidth. The voice signal degraded to such an extent that it was necessary to use a more robust modality, i.e. DTMF dialing. After an analysis of the results, a classifier of input modality based on the Gaussian Mixture Models (GMM) was proposed. When training the classifier, different classification parameters were conducted. Test phase confirmed the successful operation of the classifier regarding the input modality with various packet loss scenarios. For the purpose of assessing the impact of degradations on the quality of output modality, a specifically designed multimodal database was established. It comprised audio (AAC at 48 kbps), video (H.264/AVC at a resolution of 1920x1080 pixels) and combined audio and video clips for a total of 240 samples, used in various packet loss scenarios. After that, subjective tests with 20 subjects were conducted, which gave reference data for objective quality assessment. Objective quality was measured separately for audio and video modalities. To assess the audio modality, standardized PESQ speech quality metric was used, and to assess the video modality NQM video metric was applied. Then, using the regression method, a linear model for evaluating the quality of multimodal services was proposed, which takes into account the type of modality, type of scene, amount of degradation and unimodal objMOS scores. Correlation yields 0.892. The differential quality evaluation consists of two stages. First, a ROI face detector was used, based on the Viola-Jones object detection algorithm with weak Haar-like feature-based cascade classifiers. Then, using good detection results, an analysis of the optimization possibilities due to differential quality assessment of visual modality is presented. This investigation proposed evaluating the quality of ROI regions with a more complex algorithm (NQM) since those regions have higher visual attention, and using a simpler quality metric (PSNR) for the background, i.e. non-ROI regions. |