2005
Autores
Ferreira, AJS;
Publicação
9th European Conference on Speech Communication and Technology
Abstract
Current signal processing techniques do not match the astonishing ability of the Human Auditory System in recognizing isolated vowels, particularly in the case of female or child speech. As didactic and clinical interactive applications are needed using sound as the main medium of interaction, new signal features must be used that capture important perceptual cues more effectively than popular features such as formants. In this paper we propose the new concept of Perceptual Spectral Cluster (PSC) and describe its implementation. Test results are presented for child and adult speech, and indicate that features elicited by the PSC concept permit reliable and robust identification of vowels, even at high pitches.
2006
Autores
Ferreira, AJS; Sirilia, D;
Publicação
Audio Engineering Society - 120th Convention Spring Preprints 2006
Abstract
3G mobile and wireless communication networks elicit new ways of multimedia human interaction and communication, notably two-way high-quality audio communication. This is inline with both the consumer expectation of new audio experiences and functionalities, and with the motivation of Telecom Operators to offer consumers new services and communication modalities. In this paper we describe the design and optimization of a nioriophonic audio coder (Audio Communication Coder -ACC) that features low-delay coding (< 50 ms) and intrinsic error robustness, while minimizing complexity and achieving competitive coding gains and audio quality at bit rates around 32 kbit/s and higher. ACC source, perceptual and bandwidth extension tools are described and an emphasis is placed on ACC structural and operational features making it suitable for real-time, two-way audio communication. A few performance results are also presented. Audio demos are available at http://www.atc-labs.corn/acc/ .
2008
Autores
Ferreira, A;
Publicação
SIGMAP 2008: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND MULTIMEDIA APPLICATIONS
Abstract
Vowel recognition is frequently based on Linear Prediction (LP) analysis and formant estimation techniques. However, the performance of these techniques decreases in the case of female or child speech because at high pitch frequencies (F0) the magnitude spectrum is scarcely sampled making formant estimation unreliable. In this paper we describe the implementation of a perceptually motivated concept of vowel recognition that is based on Perceptual Spectral Clusters (PSC) of harmonic partials. PSC based features were evaluated in automatic recognition tests using the Mahalanobis distance and using a data base of five natural Portuguese vowel sounds uttered by 44 speakers, 27 of whom are child speakers. LP based features and Mel-Frequency Cepstral Coefficients (MFCC) were also included in the tests as a reference. Results show that while the recognition performance of PSC features falls between that of LP based features and that of MFCC coefficients, the normalization of PSC features by F0 increases the performance and approaches that of MFCC coefficients. PSC features are not only amenable to a psychophysical interpretation (as LP based features are) but have also the potential to compete with global shape features such as MFCCs.
2011
Autores
Sousa, R; Ferreira, A;
Publicação
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5
Abstract
In this paper we introduce new phase-related features denoting the delay between the harmonics and the fundamental frequency of a periodic signal, notably of voiced singing. These features are identified as Normalized Relative Delay (NRD) and denote the phase contribution to the shape invariance of a periodic signal. Thus, NRDs are amenable to a physical and psychophysical interpretation and are structurally independent of the overall time shift of the signal, an important property that is shared with the magnitude spectrum in the case of a locally stationary signal. We describe the NRD and report on preliminary studies testing the discrimination capability of NRDs applied to singing signals.
2023
Autores
Silva, JM; Oliveira, MA; Saraiva, AF; Ferreira, AJS;
Publicação
ACOUSTICS
Abstract
The estimation of the frequency of sinusoids has been the object of intense research for more than 40 years. Its importance in classical fields such as telecommunications, instrumentation, and medicine has been extended to numerous specific signal processing applications involving, for example, speech, audio, and music processing. In many cases, these applications run in real-time and, thus, require accurate, fast, and low-complexity algorithms. Taking the normalized Cramer-Rao lower bound as a reference, this paper evaluates the relative performance of nine non-iterative discrete Fourier transform-based individual sinusoid frequency estimators when the target sinusoid is affected by full-bandwidth quasi-harmonic interference, in addition to stationary noise. Three levels of the quasi-harmonic interference severity are considered: no harmonic interference, mild harmonic interference, and strong harmonic interference. Moreover, the harmonic interference is amplitude-modulated and frequency-modulated reflecting real-world conditions, e.g., in singing and musical chords. Results are presented for when the Signal-to-Noise Ratio varies between -10 dB and 70 dB, and they reveal that the relative performance of different frequency estimators depends on the SNR and on the selectivity and leakage of the window that is used, but also changes drastically as a function of the severity of the quasi-harmonic interference. In particular, when this interference is strong, the performance curves of the majority of the tested frequency estimators collapse to a few trends around and above 0.4% of the DFT bin width.
2023
Autores
Jesus, LMT; Ferreira, JFS; Ferreira, AJS;
Publicação
JASA EXPRESS LETTERS
Abstract
The temporal distribution of acoustic cues in whispered speech was analyzed using the gating paradigm. Fifteen Portuguese participants listened to real disyllabic words produced by four Portuguese speakers. Lexical choices, confidence scores, isolation points (IPs), and recognition points (RPs) were analyzed. Mixed effects models predicted that the first syllable and 70% of the total duration of the second syllable were needed for lexical choices to be above chance level. Fricatives' place, not voicing, had a significant effect on the percentage of correctly identified words. IP and RP values of words with postalveolar voiced and voiceless fricatives were significantly different.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.