Language: English ❌
You searched for +publisher:"University of New South Wales" +contributor:("Sethu, Vidhyasaharan, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW")
.
Showing records 1 – 7 of
7 total matches.
No search limiters apply to these results.

University of New South Wales
1.
Wataraka Gamage, Kalani.
Speech-Based Emotion Recognition: Linguistic and Saliency-Based Systems.
Degree: Electrical Engineering & Telecommunications, 2018, University of New South Wales
URL: http://handle.unsw.edu.au/1959.4/60426
;
https://unsworks.unsw.edu.au/fapi/datastream/unsworks:52192/SOURCE02?view=true
► Speech-based emotion recognition is a research field of growing interest, which aims to identifyhuman emotions based on speech. The main contributions of this thesis revolve…
(more)
▼ Speech-based emotion recognition is a research field of growing interest, which aims to identifyhuman emotions based on speech. The main contributions of this thesis revolve around the use of verbaland non-verbal vocalisation cues for speech-based emotion recognition, which is complementary topopularly used acoustic features for both emotion classification and continuous emotion predictiontasks.This thesis initially explores the supra-segmental feature representations generated by thevectorisation of the Mel-frequency cepstral coefficient frame level feature distribution models foremotion classification, which is an alternative to the default acoustic supra-segmental features. Next,the thesis focuses on the development of approaches for incorporating the emotional saliency andpronunciation of verbal cues (lexical features) for emotion classification.Apart from lexical features, non-verbal vocal events such as laughter, sighs, and expressions suchas “grrr!”, “oh!”, and disfluency patterns including filled pauses such as “hmm” are identified withinthe linguistic feature domain. These elements of speech are instrumental in portraying both voluntaryand involuntary emotions in human communication. Despite this, they have not been used for automaticemotion recognition in a completely automatic manner, and their effect on emotion recognition has notyet been adequately analysed. This thesis proposes and develops several models to utilise emotionallysalient linguistic cues, including non-verbal gestures and disfluencies, implicitly for emotionclassification and continuous emotion prediction tasks. This is achieved without the need for taggedand time aligned non-verbal vocalisation labels. The proposed novel approaches allow emotionrecognition systems to utilise linguistic information independent of manual transcripts or automaticspeech recognition.Inspired by the analysis of the influence of non-verbal vocalisations for continuous emotionprediction, as well as emotion psychology concepts related to the symbolic reference function of suchexpressions, this thesis proposes a novel view of continuous emotion prediction leading to thedevelopment of a transparent framework for continuous emotion prediction. This framework ismodelled as a time-invariant filter array for continuous emotion prediction, and distinct from thepointwise regression mapping taken by traditional approaches.All proposed approaches are extensively evaluated on state-of-the-art emotion databases.
Advisors/Committee Members: Ambikairajah, Eliathamby, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW, sethu, vidhyasaharan, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW.
Subjects/Keywords: Emotion classification; Speech based Emotion Recognition; Continuous Emotion Prediction; Vocal Gestures; Non-verbal vocalizations; Lexical
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Wataraka Gamage, K. (2018). Speech-Based Emotion Recognition: Linguistic and Saliency-Based Systems. (Doctoral Dissertation). University of New South Wales. Retrieved from http://handle.unsw.edu.au/1959.4/60426 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:52192/SOURCE02?view=true
Chicago Manual of Style (16th Edition):
Wataraka Gamage, Kalani. “Speech-Based Emotion Recognition: Linguistic and Saliency-Based Systems.” 2018. Doctoral Dissertation, University of New South Wales. Accessed April 23, 2021.
http://handle.unsw.edu.au/1959.4/60426 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:52192/SOURCE02?view=true.
MLA Handbook (7th Edition):
Wataraka Gamage, Kalani. “Speech-Based Emotion Recognition: Linguistic and Saliency-Based Systems.” 2018. Web. 23 Apr 2021.
Vancouver:
Wataraka Gamage K. Speech-Based Emotion Recognition: Linguistic and Saliency-Based Systems. [Internet] [Doctoral dissertation]. University of New South Wales; 2018. [cited 2021 Apr 23].
Available from: http://handle.unsw.edu.au/1959.4/60426 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:52192/SOURCE02?view=true.
Council of Science Editors:
Wataraka Gamage K. Speech-Based Emotion Recognition: Linguistic and Saliency-Based Systems. [Doctoral Dissertation]. University of New South Wales; 2018. Available from: http://handle.unsw.edu.au/1959.4/60426 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:52192/SOURCE02?view=true

University of New South Wales
2.
Cummins, Nicholas.
Automatic assessment of depression from speech: paralinguistic analysis, modelling and machine learning.
Degree: Electrical Engineering & Telecommunications, 2016, University of New South Wales
URL: http://handle.unsw.edu.au/1959.4/55642
;
https://unsworks.unsw.edu.au/fapi/datastream/unsworks:38198/SOURCE02?view=true
► Clinical depression is a prominent cause of disability and burden worldwide. Despite this prevalence, the diagnosis of depression, due to its complex clinical characterisation, is…
(more)
▼ Clinical depression is a prominent cause of disability and burden worldwide. Despite this prevalence, the diagnosis of depression, due to its complex clinical characterisation, is a difficult and time consuming task. Currently, there is a real need for an automated and objective diagnostic aid for use in primary care settings and specialist clinics. Accordingly this thesis investigates the use of paralinguistic cues for automatically assessing a speaker’s level of depression. Investigations are undertaken to establish the effects of depression in spectral representations of speech and their subsequent acoustic models. A novel Probabilistic Acoustic Volume (PAV) method for robustly estimating Acoustic Volume is presented and a Monte Carlo approximation that enables the computation of this measure outlined. Results indicate that reductions in spectral variations can quantitatively characterise speech affected by depression. Within the acoustic models the following statistically significant findings are made across two key datasets: reductions in localised acoustic variance, a flattening of the acoustic trajectory and reductions in three different Acoustic Volume measures. Further results gained using an array of PAV points give strong statistical evidence that the spectral feature space also becomes more concentrated. Together these observations demonstrate there is a reduction in the local and global spread of phonetic events in acoustic space in speech affected by depression. A range of novel approaches for performing depression prediction are also investigated. A comprehensive series of acoustic supervector experiments demonstrate the suitability of the Kullback-Leibler divergence based representation to the task and highlight the difficulties of performing nuisance mitigation within this paradigm. A further series of tests opens up the possibilities for using Relevance Vector Machines when predicting depression using a brute-forced feature space. Of particular interest are tests performed using a novel 2-stage rank regression framework designed specifically for regression analysis using ordinal depression scores. Three unique implementations are shown to match or out-perform corresponding conventional regression systems. Further results presented highlight the benefits of using the framework; most notably that, in contrast to conventional regressor fusion, score level fusion of the two-stage systems consistently improves prediction performance.
Advisors/Committee Members: Epps, Julien, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW, Sethu, Vidhyasaharan, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW.
Subjects/Keywords: Paralinguistic cues; Acoustic Volume measures; Probabilistic Acoustic Volume (PAV); Acoustic models
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Cummins, N. (2016). Automatic assessment of depression from speech: paralinguistic analysis, modelling and machine learning. (Doctoral Dissertation). University of New South Wales. Retrieved from http://handle.unsw.edu.au/1959.4/55642 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:38198/SOURCE02?view=true
Chicago Manual of Style (16th Edition):
Cummins, Nicholas. “Automatic assessment of depression from speech: paralinguistic analysis, modelling and machine learning.” 2016. Doctoral Dissertation, University of New South Wales. Accessed April 23, 2021.
http://handle.unsw.edu.au/1959.4/55642 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:38198/SOURCE02?view=true.
MLA Handbook (7th Edition):
Cummins, Nicholas. “Automatic assessment of depression from speech: paralinguistic analysis, modelling and machine learning.” 2016. Web. 23 Apr 2021.
Vancouver:
Cummins N. Automatic assessment of depression from speech: paralinguistic analysis, modelling and machine learning. [Internet] [Doctoral dissertation]. University of New South Wales; 2016. [cited 2021 Apr 23].
Available from: http://handle.unsw.edu.au/1959.4/55642 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:38198/SOURCE02?view=true.
Council of Science Editors:
Cummins N. Automatic assessment of depression from speech: paralinguistic analysis, modelling and machine learning. [Doctoral Dissertation]. University of New South Wales; 2016. Available from: http://handle.unsw.edu.au/1959.4/55642 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:38198/SOURCE02?view=true

University of New South Wales
3.
Brown, Stefanie.
Analysis and minimisation of aliasing and truncation errors in the extraction of soundfields using spherical microphone arrays.
Degree: Electrical Engineering & Telecommunications, 2017, University of New South Wales
URL: http://handle.unsw.edu.au/1959.4/59104
;
https://unsworks.unsw.edu.au/fapi/datastream/unsworks:48464/SOURCE02?view=true
► The spherical harmonic (SH) framework is a powerful representation that can be used to describe to a 3D soundfield. It decomposes waves propagating through space…
(more)
▼ The spherical harmonic (SH) framework is a powerful representation that can be used to describe to a 3D soundfield. It decomposes waves propagating through space into a sum of infinitely many SH functions, a set of orthonormal basis functions over the sphere, and a set of source-dependent soundfield coefficients. The coefficients encode spatial information about the source and a wave propagation model. 3D soundfields can be captured or recorded using microphone arrays, manipulated or modified to enhance or reduce contributions from certain sources/locations, and reproduced using loudspeaker arrays to a degree of accuracy dependent on the array geometry. Other uses of the representation include SH beamforming, source localisation and isolation, and acoustic holography. This thesis focuses on the processes of coefficient extraction and some aspects of soundfield manipulation. Traditional extraction methods seek to gain a representation of all sources within the soundfield to a greater or lesser degree of accuracy. This thesis attempts to extract only the contributions from certain sources, spatially filtering the soundfield. The extraction process (involving discrete spatial sampling) introduces errors into the SH representation, in particular those of truncation error (TE) from using only a finite number of coefficients to reproduce a soundfield and spatial aliasing (SA) errors in the extracted coefficients. This thesis derives several closed form solutions for the TE (excluding SA) under various soundfield conditions, and shows the connection between the plane and spherical wave cases. It also investigates the patterns of SA caused by the regularised inverse (RI) and orthonormal extraction (OE) methods and observes the combined effects of SA and TE for a particular spherical microphone array. This thesis proposes a spatial Wiener filter (SWF) that makes use of infinite spatial order prior models of the propagation model, source power and location to reduce SA errors and to reduce the contribution of unwanted sources to the coefficients. The RI and OE methods are analysed in the same manner and the SWF is shown to be superior at reducing SA. The SWF is then extended to a finite spatial order case. These methods are compared under various spatial scenarios and show the benefit of the SWF and eliminating an unwanted source, using the proposed measures of total mean square SA error and mean combined truncation and SA error.
Advisors/Committee Members: Sen, Deep, Qualcomm, Taubman, David, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW, Sethu, Vidhyasaharan, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW.
Subjects/Keywords: Truncation error; Spherical microphone array; Spatial aliasing; spatial Wiener filter
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Brown, S. (2017). Analysis and minimisation of aliasing and truncation errors in the extraction of soundfields using spherical microphone arrays. (Doctoral Dissertation). University of New South Wales. Retrieved from http://handle.unsw.edu.au/1959.4/59104 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:48464/SOURCE02?view=true
Chicago Manual of Style (16th Edition):
Brown, Stefanie. “Analysis and minimisation of aliasing and truncation errors in the extraction of soundfields using spherical microphone arrays.” 2017. Doctoral Dissertation, University of New South Wales. Accessed April 23, 2021.
http://handle.unsw.edu.au/1959.4/59104 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:48464/SOURCE02?view=true.
MLA Handbook (7th Edition):
Brown, Stefanie. “Analysis and minimisation of aliasing and truncation errors in the extraction of soundfields using spherical microphone arrays.” 2017. Web. 23 Apr 2021.
Vancouver:
Brown S. Analysis and minimisation of aliasing and truncation errors in the extraction of soundfields using spherical microphone arrays. [Internet] [Doctoral dissertation]. University of New South Wales; 2017. [cited 2021 Apr 23].
Available from: http://handle.unsw.edu.au/1959.4/59104 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:48464/SOURCE02?view=true.
Council of Science Editors:
Brown S. Analysis and minimisation of aliasing and truncation errors in the extraction of soundfields using spherical microphone arrays. [Doctoral Dissertation]. University of New South Wales; 2017. Available from: http://handle.unsw.edu.au/1959.4/59104 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:48464/SOURCE02?view=true

University of New South Wales
4.
Irtza, Saad.
Scalable Hierarchical Language Identification System.
Degree: Electrical Engineering & Telecommunications, 2018, University of New South Wales
URL: http://handle.unsw.edu.au/1959.4/60018
;
https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51118/SOURCE2?view=true
► Humans’ speech carries a great deal of information including linguistic contents (i.e. what is being said), speaker identity, language spoken, gender and emotions. The capacity…
(more)
▼ Humans’ speech carries a great deal of information including linguistic contents (i.e. what is being said), speaker identity, language spoken, gender and emotions. The capacity of machine/computer to automatically recognize persons’ spoken language is referred to as an automatic spoken language identification (LID) task. The last few decades have witnessed significant improvements in this area. However, there are still challenges that need to be addressed. State-of-the-art LID systems treat all language hypotheses identically and do not appear to benefit from knowledge about language families. Moreover, these systems suffer when presented with unknown, or out-of-set (OOS), languages to which the systems have not been exposed during training and when scaled to large numbers of languages. OOS situations are inevitable since developing a system with speech excerpts from all languages is not feasible, when LID systems are deployed in real-world applications. “Scalability” is a highly desirable trait in LID systems, referring to a LID system’s potential to accommodate
new languages without significant performance degradation. A hierarchical framework is proposed to develop robust, scalable LID systems that address these shortfalls, based on tree structures that consider language clusters of decreasing size as the tree is traversed from root (set of all target languages) to leaves (individual languages). A novel approach is proposed to incorporate knowledge of the language clusters into the front-ends of the classification systems employed in each node of a hierarchical LID system. This approach also investigates the use of feature representations tuned to the particular language cluster identification sub-problem at each node. This thesis also explores a novel decision strategy that incorporates information about language cluster model memberships into the front-ends at each node. Experimental results show that the inclusion of
new target languages in the hierarchical framework requires minimal re-training of the system compared to non-hierarchical approaches. Furthermore, a novel approach is proposed to develop OOS language models without using additional non-target language data for OOS language model training. The OOS language models are incorporated at multiple levels of the hierarchical framework, leading to better rejection of unknown languages. This research showcases the flexibility of the hierarchical language identification framework and opens up avenues for future research in this field.
Advisors/Committee Members: Ambikairajah, Eliathamby, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW, Sethu, Vidhyasaharan, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW.
Subjects/Keywords: Language Clustering; Spoken Language Identification; Hierarchical Structure; Deep Neural Network, CNN, LSTM; End-to-End Language Identification System
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Irtza, S. (2018). Scalable Hierarchical Language Identification System. (Doctoral Dissertation). University of New South Wales. Retrieved from http://handle.unsw.edu.au/1959.4/60018 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51118/SOURCE2?view=true
Chicago Manual of Style (16th Edition):
Irtza, Saad. “Scalable Hierarchical Language Identification System.” 2018. Doctoral Dissertation, University of New South Wales. Accessed April 23, 2021.
http://handle.unsw.edu.au/1959.4/60018 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51118/SOURCE2?view=true.
MLA Handbook (7th Edition):
Irtza, Saad. “Scalable Hierarchical Language Identification System.” 2018. Web. 23 Apr 2021.
Vancouver:
Irtza S. Scalable Hierarchical Language Identification System. [Internet] [Doctoral dissertation]. University of New South Wales; 2018. [cited 2021 Apr 23].
Available from: http://handle.unsw.edu.au/1959.4/60018 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51118/SOURCE2?view=true.
Council of Science Editors:
Irtza S. Scalable Hierarchical Language Identification System. [Doctoral Dissertation]. University of New South Wales; 2018. Available from: http://handle.unsw.edu.au/1959.4/60018 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51118/SOURCE2?view=true

University of New South Wales
5.
Dang, Ting.
Speech based Continuous Emotion Prediction: An investigation of Speaker Variability and Emotion Uncertainty.
Degree: Electrical Engineering & Telecommunications, 2018, University of New South Wales
URL: http://handle.unsw.edu.au/1959.4/60161
;
https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51221/SOURCE02?view=true
► Understanding and describing human emotional state is important for many applications such as interactive human-computer interface design and clinical diagnosis tools. Speech based emotion prediction…
(more)
▼ Understanding and describing human emotional state is important for many applications such as interactive human-computer interface design and clinical diagnosis tools. Speech based emotion prediction is generally viewed as a regression problem, where speech waveforms are labelled in terms of affective attributes such as arousal and valence, with numerical values indicating the short-term emotion intensity. Current research on continuous emotion prediction has primarily focused on improving the backend, developing novel features or improving feature selection techniques. However, emotion expressions or perceptions are in general heterogeneous across individuals, depending on a wide range of factors, such as cultural background and speaker’s gender. The impact of these sources of variations on the continuous emotion prediction systems has not been fully explored yet and is the focus of this thesis. Speaker variability, i.e., differences in emotion expression among speakers, has been shown to be one of the most confounding factors in categorical emotion recognition system, but there is limited literature that analyses the effect on continuous emotion prediction systems. In this thesis, a probabilistic framework is proposed to quantify speaker variability in continuous emotion systems in both the feature and the model domains. Furthermore, three compensation techniques for speaker variability are developed and in-depth analyses in both the feature and model spaces are carried out. Another confounding factor is the inter-rater variability, i.e., difference in emotion perception among raters, which is ignored in current approaches by taking the average rating across multiple raters as the ‘true’ representation of the emotion states. However, differences in perception among raters suggest that prediction certainty varies with time. A novel approach for the prediction of emotion uncertainty is proposed and implemented by including the inter-rater variability as a representation of the uncertainty information in a probabilistic model. In addition, Kalman filters are incorporated into this framework to take into account the temporal dependencies of the emotion uncertainty, as well as providing the flexibility to relax the Gaussianity assumption on the emotion distribution that reflects the uncertainty. The proposed frameworks and methods have been extensively evaluated on multiple state-of-the-art databases and the results have demonstrated the potential of the proposed solutions.
Advisors/Committee Members: Sethu, Vidhyasaharan, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW, Ambikairaja, Eliathamby, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW.
Subjects/Keywords: pattern recognition; speech processing; machine learning
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Dang, T. (2018). Speech based Continuous Emotion Prediction: An investigation of Speaker Variability and Emotion Uncertainty. (Doctoral Dissertation). University of New South Wales. Retrieved from http://handle.unsw.edu.au/1959.4/60161 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51221/SOURCE02?view=true
Chicago Manual of Style (16th Edition):
Dang, Ting. “Speech based Continuous Emotion Prediction: An investigation of Speaker Variability and Emotion Uncertainty.” 2018. Doctoral Dissertation, University of New South Wales. Accessed April 23, 2021.
http://handle.unsw.edu.au/1959.4/60161 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51221/SOURCE02?view=true.
MLA Handbook (7th Edition):
Dang, Ting. “Speech based Continuous Emotion Prediction: An investigation of Speaker Variability and Emotion Uncertainty.” 2018. Web. 23 Apr 2021.
Vancouver:
Dang T. Speech based Continuous Emotion Prediction: An investigation of Speaker Variability and Emotion Uncertainty. [Internet] [Doctoral dissertation]. University of New South Wales; 2018. [cited 2021 Apr 23].
Available from: http://handle.unsw.edu.au/1959.4/60161 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51221/SOURCE02?view=true.
Council of Science Editors:
Dang T. Speech based Continuous Emotion Prediction: An investigation of Speaker Variability and Emotion Uncertainty. [Doctoral Dissertation]. University of New South Wales; 2018. Available from: http://handle.unsw.edu.au/1959.4/60161 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51221/SOURCE02?view=true

University of New South Wales
6.
Sriskandaraja, Kaavya.
Spoofing countermeasures for secure and robust voice authentication system: Feature extraction and modelling.
Degree: Electrical Engineering & Telecommunications, 2018, University of New South Wales
URL: http://handle.unsw.edu.au/1959.4/60356
;
https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51915/SOURCE02?view=true
► The ability to employ automatic speaker verification systems without face-to-face contact makes them more prone to spoofing attacks compared to other biometric systems. The study…
(more)
▼ The ability to employ automatic speaker verification systems without face-to-face contact makes them more prone to spoofing attacks compared to other biometric systems. The study of spoofing countermeasures has become increasingly important and is currently a critical area of research, which is the principal objective of this thesis. Additionally, as a preliminary work, this thesis aimed to make the automatic speaker verification system robust to adverse noise conditions, by proposing a self-adaptive voice activity detector, which combines cepstral modelling and smoothed energy with the effective post processing stages. Thus, the overarching goal of this thesis is to significantly advance the state-of-the-art in automatic speaker verification systems by making them more secure and robust. Spoofing attacks can be categorised into one of four types: impersonation, replay, voice conversion or speech synthesis. Among these, speech synthesis (SS), voice conversion (VC) and replay attacks have been identified as the most effective and accessible. Accordingly, this thesis investigates and develops a framework to extract the discriminative features to deflect these three attacks. Investigations are undertaken to analyse the discrimination between spoofed and genuine speech as a function of frequency bands across the speech bandwidth, which in turn informed some novel filter bank designs for spoofing detection. In order to capture a richer representation of the spectral content of speech, novel hierarchical scattering decomposition technique based features are proposed to implement effective front-ends for stand-alone spoofing detection. The results showed that the proposed scattering features were superior to all other front-ends that had previously been benchmarked on the VC, SS and replay corpora. Consequently, a hybrid network consisting of a scattering followed by a convolutional network is also investigated. Finally, a novel approach to evaluate the similarities between pairs of speech samples is proposed to detect replayed speech based on a suitable embedding learned by deep Siamese architectures. Siamese networks are particularly suited to this task and have been shown to be effective in problems where intra-class variability is large and the number of training samples per class is relatively small. The proposed Siamese architecture produces state-of-the-art performance when evaluated on the ASVspoof2017 challenge corpus.
Advisors/Committee Members: Eliathamby, Ambikairajah, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW, Sethu, Vidhyasaharan, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW.
Subjects/Keywords: Spoofing countermeasures; Speech Processing; Speaker verification; Voice authentication
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Sriskandaraja, K. (2018). Spoofing countermeasures for secure and robust voice authentication system: Feature extraction and modelling. (Doctoral Dissertation). University of New South Wales. Retrieved from http://handle.unsw.edu.au/1959.4/60356 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51915/SOURCE02?view=true
Chicago Manual of Style (16th Edition):
Sriskandaraja, Kaavya. “Spoofing countermeasures for secure and robust voice authentication system: Feature extraction and modelling.” 2018. Doctoral Dissertation, University of New South Wales. Accessed April 23, 2021.
http://handle.unsw.edu.au/1959.4/60356 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51915/SOURCE02?view=true.
MLA Handbook (7th Edition):
Sriskandaraja, Kaavya. “Spoofing countermeasures for secure and robust voice authentication system: Feature extraction and modelling.” 2018. Web. 23 Apr 2021.
Vancouver:
Sriskandaraja K. Spoofing countermeasures for secure and robust voice authentication system: Feature extraction and modelling. [Internet] [Doctoral dissertation]. University of New South Wales; 2018. [cited 2021 Apr 23].
Available from: http://handle.unsw.edu.au/1959.4/60356 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51915/SOURCE02?view=true.
Council of Science Editors:
Sriskandaraja K. Spoofing countermeasures for secure and robust voice authentication system: Feature extraction and modelling. [Doctoral Dissertation]. University of New South Wales; 2018. Available from: http://handle.unsw.edu.au/1959.4/60356 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51915/SOURCE02?view=true

University of New South Wales
7.
Ma, Jianbo.
Modelling and compensation techniques for short duration speaker verification.
Degree: Electrical Engineering & Telecommunications, 2019, University of New South Wales
URL: http://handle.unsw.edu.au/1959.4/61432
;
https://unsworks.unsw.edu.au/fapi/datastream/unsworks:55498/SOURCE02?view=true
► Voice based biometric systems have been the focus of active research for a number of decades. These systems have a number of advantages including their…
(more)
▼ Voice based biometric systems have been the focus of active research for a number of decades. These systems have a number of advantages including their non-invasive nature and the ability to transmit voice over a variety of channels. However, a key difficulty that impedes their wide-spread use is the inability of current systems to work accurately with short speech utterances since it is unrealistic to collect long utterances in many scenarios. The goal of this thesis is to improve the accuracy of speaker verification system on utterances that are less than ten seconds long.This thesis shows that the conventional i-vector representation, derived from the total variability model, is not an accurate representation for short duration utterances. More accurate models are developed in this thesis. Firstly, a generalization of the total variability model is developed by allowing the distribution of latent variables to be mixture of Gaussians. Secondly, it was found that the information in each phonetic group, referred to as the local acoustic variability, is complementary to the total variability model. Consequently, this thesis proposes a local acoustic model that utilises this information. Thirdly, the current i-vector representation of an utterance is sensitive to phonetic mismatch, which is severe in short utterances. This thesis proposes a mixture of total variability models to have speaker-phonetic vector representations.The vectors representing short utterances are also distributed differently to those representing long utterances, and this difference will propagate into the back-end. As such, this thesis proposes compensation techniques. Specifically, projection methods based on a Gaussian probabilistic linear discriminant analysis (GPLDA) model with tied latent variables, and neural networks are proposed to normalise duration mismatches in the vector representation space. In the projected space, vector representations of long and short utterances are more likely to be similarly distributed. Finally, a twin model GPLDA back-end that uses two different sets of parameters to model short and long utterances differently, connected by shared speaker identity, is proposed to generate more reliable scores.Modelling and compensation techniques proposed in this thesis are efficient to mitigate problems caused by short duration in speaker verification.
Advisors/Committee Members: Ambikairajah, Eliathamby, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW, Sethu, Vidhyasaharan, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW.
Subjects/Keywords: Generalized variability model; Automatic speaker verification; Short duration; Ig-vector; Duration mismatch; Twin model GPLDA
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Ma, J. (2019). Modelling and compensation techniques for short duration speaker verification. (Doctoral Dissertation). University of New South Wales. Retrieved from http://handle.unsw.edu.au/1959.4/61432 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:55498/SOURCE02?view=true
Chicago Manual of Style (16th Edition):
Ma, Jianbo. “Modelling and compensation techniques for short duration speaker verification.” 2019. Doctoral Dissertation, University of New South Wales. Accessed April 23, 2021.
http://handle.unsw.edu.au/1959.4/61432 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:55498/SOURCE02?view=true.
MLA Handbook (7th Edition):
Ma, Jianbo. “Modelling and compensation techniques for short duration speaker verification.” 2019. Web. 23 Apr 2021.
Vancouver:
Ma J. Modelling and compensation techniques for short duration speaker verification. [Internet] [Doctoral dissertation]. University of New South Wales; 2019. [cited 2021 Apr 23].
Available from: http://handle.unsw.edu.au/1959.4/61432 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:55498/SOURCE02?view=true.
Council of Science Editors:
Ma J. Modelling and compensation techniques for short duration speaker verification. [Doctoral Dissertation]. University of New South Wales; 2019. Available from: http://handle.unsw.edu.au/1959.4/61432 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:55498/SOURCE02?view=true
.