Speech signal is a nonstationary signal that is generated by a complex phenomenon that is influenced by the autonomic and somatic nervous systems, through the modulation of breathing activity, vocal muscles tension, salivation and mucus secretion. Generally, phonatory system can be modelled according to the so-called source-filter theory of speech production. In this model the source is represented by pulsatile airflow or turbulent airflow generated by the modulating action of the vocal folds. In fact, by means of their closing and opening motion, vocal folds are able to modulate the airflow coming from the lungs. Such a source, generated by the vocal folds, is hence filtered according to the resonance characteristics of the supraglottal vocal tract. Its resonances depend on its size and shape and generally they are continuously modified to allow the emission of specific sound targets.
Figure 1: Contribution of the different tracts in speaking
Eckman stated that moods are emotional feelings lasting for an extended period of time, while emotions are temporary feelings that tend to come and go quite quickly. Speech can be usefully investigated to give a characterization of the emotional and/or mood state of the speakers. Many studies have been conducted to characterize both of them. Emotions can be studied by using several kinds of database where emotions can be natural, acted or induced. Moods usually are investigated in relation with some mental disorder. Especially, bipolar disorder is characterized by a great variability of moods, since bipolar patients experience sudden and sometimes extreme mood swings.
Notwithstanding, in the literature the major efforts were made on the studies of patients affected by depression.
Generally, speech-related features can be thought as divided into three main categories. The first category is aiming the investigation of the prosodic dynamics of the speech. For this purpose, perceived rhythm, stress, intonation, pitch, speaking rate, and loudness are some of the possible cues that can be studied. The second one is related to the source of voice production and to the airflow streaming from the lungs through the glottis. Source features are also investigated to obtain information about voice quality, i.e. the auditory perception of the modification of vocal fold vibration and vocal tract shape. Finally, the third category is related to the spectral analysis of the speech signals.
Voice signal is a nonstationary signal that can be considered stationary if it is analysed over a sufficiently short period of time. When this kind of signal is investigated, it is important to perform a proper detection and segmentation of the voiced segments. The minimization of the rate of segment mislabelling is mandatory to achieve reliable estimates of the investigated features. Moreover, since voice signal is characterized by an high intra- and inter-day variability, it is crucial detecting carefully the features. For this purpose, features should be able to highlight statistically significant differences related to the emotion or mood state transitions, while they would show a robust behaviour with regard to the natural daily variability. In this frame, when a signal is recorded, it is fundamental to take into account also the environmental condition. Environmental noise and/or reverberation can severely alter the acquired signal, resulting in misleading results and conclusions.
Figure 2: Segmentation of the word "author".
In our activities, speech signals were investigated to recognize/characterize emotional states in actors and mood states in patients affected by bipolar disease. The investigation was performed at different levels of description. Micro-prosodic and higher level phenomena were studied. Small changes of the glottal cycle related to emotion and mood were observed in the initial investigations. Then, global prosodic and vocal quality studies were conducted later. More in details, four different methods related to three different description levels have been investigated in this study. The first method is focussed on vocal features (lower description level) and is concerning the investigation of glottal features. These features are: mean and standard deviation of fundamental frequency and jitter. Then, two methods were focussed on prosodic features (mid description level). The first one took into account a prosodic analysis within every voiced segment, while the second considered globally the whole prosodic behaviour of the speech. Finally, the last method aimed at investigating the voice quality (higher description level) by means of the Long-Term Average Spectrum of Voice.
Synthetic datasets, healthy control subjects, and a neutral database providing both audio and electroglottographic (EGG) recordings (CMU Arctic Database) were used to test the developed algorithms. Emotional studies were conducted on the German Emotional database, formed by actors playing different emotions. At the end, the mood investigation was performed on a database of audio samples acquired on bipolar patients. The patients were enrolled within the PSYCHE European project and performed two different vocal tasks: text reading and free image commenting.
Concerning the German Emotional database, the obtained results showed that the proposed and developed methods were able to highlight statistically significant differences among emotional speeches according to the arousal level of the acted emotion: the more the subjects are aroused, the more their speech features exhibit differences with low arousal states.
As regards the analysis on Bipolar Data, even if the limited number of enrolled patients does not allow to generalize, the obtained results showed that some statistically significant differences can be observed at every different description level. Some feature trends were observed, but some of them were not always coherent among the enrolled patients or the investigated tasks. Some features trends might be patient or task specific. Notwithstanding, it is important to highlight that at higher description levels some features showed coherent trends. In fact, differently from the features investigated at lower description levels, that showed some patient-specific trends, at higher levels some of the inter-state analysis highlighted coherent feature trends among the enrolled patients. These results could mean that a higher level of description might be needed to overcome the problem of high vocal variability. For this purpose, some higher levels of description could still be taken into account. For instance, the information gathered by a semantic analysis of the speech, together with the approaches here investigated, could lead to obtain an interesting and deeper knowledge of these phenomena.