The Doctor Can Understand You Now

Researchers develop a new translation system for clinics, emergency rooms and ambulances.


At medical facilities around the country, care is delayed, complicated and even jeopardized because doctors and patients don't speak the same language. The situation is particularly dire in diverse megacities like Los Angeles and New York.

Now, USC computer scientists, communication specialists and health professionals hope to create a cheap, robust and effective speech-to-speech (S2S) translation system for clinics, emergency rooms and even ambulances.

The initial SpeechLinks system will translate between English and Spanish. Professor Shrikanth Narayanan, who directs the Signal Analysis and Interpretation Laboratory at the USC Viterbi School of Engineering, hopes to test and deliver a working prototype within the 4-year window of a recently awarded $2.2 million NSF grant for "An Integrated Approach to Creating Context Enriched Speech Translation Systems."

Narayanan, who holds appointments the USC departments of electrical engineering, computer science, linguistics and psychology will collaborate with fellow engineering faculty member Panayiotis Georgiou, Professor Margaret McLaughlin of the Annenberg School for Communication and with researchers and clinicians from the Keck School of Medicine at USC on the project.

The project will also include investigators from two corporations, BBN and AT&T, who will not only collaborate on the research but serve as mentors to the students working on the project.

The detailed prospectus for the effort begins by explaining the need: "While large medical facilities and hospitals in urban centers such as Los Angeles tend to have dedicated professional language interpreters on their staff (a plan which still suffers from cost and scalability issues), multitudes of smaller clinics have to rely on other ad hoc measures including family members, volunteers or commercial telephone translation services. Unfortunately, human resources for in-person or phone-based interpretation are typically not easily available, tend to be financially prohibitive or raise privacy issues (such as using family members or children as translators)."

Filling these needs, Narayanan says, will require a system that can perceive and interpret not just words, but a wide range of human communications, an improvement on current, limited "pipeline" translation technology. "We want to let people communicate," he says. "We need to go beyond literal translation"—heavily based on translating written texts rather than spoken language—"to rich expressions in speech and non verbal cues. We want to enhance human communication capabilities."

The additional cues to be analyzed and incorporated into the translation mix include, according to the plan:

  • Prosodic information: Spoken language uses word prominence, emphasis and contrast, and intonational cues—is it a statement or a question? Speech also divides subjects and thoughts in ways that aren't always clear in a word-by-word literal translation. Prosodic cues will serve as an important information source for robust intelligence in the proposed work.
    • Discourse information: Capturing contextual cues of dialog becomes especially important since our target goal to enable interpersonal interactions, in contrast to applications where the end-result is just translated text. The group plans to model and track the cross-lingual dialog flow to improve the information transfer between the interlocutors.
      • User state information: User state information such as affect and attitude are also a critical part of interpersonal communication. We plan to investigate markers of valence (positive/ negative) and activation (strong/weak) conveyed in spoken language. Specifically, the effort plans to capture this meta-information, and transfer these source utterance characteristics and speaker attitudes to the target language through our augmented expressive synthesis schemes.
      • Other elements of the mix include embedding these sets of analyzed cues into speech that is synthesized from inputs keyboarded into the interface as a response or a question.