Previous | Next | To the beginning

A Finnish audiovisual speech synthesizer

Mikko Sams, Kimmo Kaski, and Janne Kulju

Laboratory of Computational Engineering, Helsinki University of Technology, 02015 HUT, Espoo, Finland

In face-to-face communication speech perception is both visual and auditory. Under very noisy conditions the visual information from the talker's articulation can help to understand hardly audible speech. We have started developing a Finnish audio-visual speech synthesizer. In our current model, we have combined a three-dimensional facial model, based on the work by Parks [1], with a commercial audio text-to-speech synthesizer. The visual speech is based on straightforward letter-to-viseme (viseme = visual equivalent of a phoneme) mapping, in which each letter of a written text corresponds to a viseme. Visual speech is animated by linear interpolation between visemes.

Our facial model is a parameter-controlled topological model. It is currently controlled using 49 parameters, 12 of which are employed for visual speech affecting the jaw rotation and lip formation. The parameters used in speech production are based on the coordinate system rather than physiological properties of the face. The audio-visual speech synthesizer consists of the facial model and MikroPuhe 4.1 au-dio speech synthesizer (Timehouse, Inc.), which has been modified slightly in order to get signals needed in synchronization. This information, the currently active viseme as well as previous and subsequent visemes are gained from an audio synthesizer.

The 12 viseme parameter values are based on heuristic trial and error. One letter in text corresponds to one viseme with the exception of ìnkî and ìngî, which are represented by one viseme. Coarticulation has not been taken into account except for ìhî, îkî, ìgî, ìnkî, and ìngî, which depend on preceeding and following visemes. Due to the used parametrization and lack of tongue in the model, the avail-able viseme collection is limited.

The input to synthesizer is given as text. The visual speech is animated by linear interpolation with 1 - 5 intermediate steps between parameter values corresponding to the initial and final visemes. The synthesizer reproduces unlimited Finnish text. A 3D impression can be simulated with the aid of stereographic glasses. The graphics is implemented using libraries that are available in multiple hardware platforms, and currently our model can be run in SGI IRIX and PC (Windows) environments.

Audio-visual speech synthesizer can be used to prepare well-controlled stimuli for speech research and for cognitive neuroscience. In addition, various appli-cation areas will benefit from high-quality audio-visual speech synthesis, including telecommunication, human-computer interfaces, and speech therapy.

  1. Parks, F. IEEE Computer Graphics & Appl. 2: 61-70, 1982

Previous | Next

logo picture
Temporal Aspects of Human Cortical Information Processing
Proceedings of the Finnish Japanese Symposium, Otaniemi, June 14 - 17, 1998
Edited by O.V. Lounasmaa
Internet page created Fri, Sep 18, 1998 at 07:28:49 with Frontier. Peter Berglund, peter@neuro.hut.fi