<u, dear reader, are at a cocktail party. The party is loud. You notice a friend, and yell at them across the room. Your friend comes over, and you begin to speak, and you understand your friend’s speech perfectly. Over the hubbub of the party, you hear that your friend dislikes the hors d’oeuvres and is ready to leave.
But how, exactly, were you able to understand your friend? What information did you use to decipher the complex acoustic frequencies reaching your brain?
To answer this question, your correspondent spoke with Carleton Professor of Psychology Julia Strand. She studies speech, and quite enjoys cocktail parties.
“A cocktail party is actually a great place to talk about the kind of research we do,” said Strand. “You’re understanding my speech, and you are understanding what I’m saying, even though there’s all this other noise that you have to ignore, and just focus on what I’m saying.”
Speech seems simple; even toddlers speak with marginal lucidity. But, the research in Strand’s lab shows that the task is rather tricky. If you disagree, ask Siri for the weather outside using non-Standard English while there is a light breeze. Siri fumbles. Yet we humans understand the speech of others all day long, sometimes through muffled phone signals.
Humans have access to a variety of sensory stimuli besides voices. For instance, when having conversations, we often use hand gestures or move our lips. This is what the Strand group studies: the information conveyed by a speaker’s moving mouth.
“Is it that you’re watching my lips, and you see that, ‘that particular mouth movement is associated with this particular sound, so she’s probably saying that?’” asked Strand. “Or, is it the case that the way that my mouth is moving is just signaling to you that ‘here is an important part in the speech signal,’ because when mouths are big, speech tends to be loud?” Mouth movement is either giving you, the listener, information about the words being spoken or information about which part of an utterance is important. Strand’s lab hopes to find out which.
“In an experiment, typically we record a speaker saying something, and we present speech in headphones with a lot of background noise, and we just ask what they heard,” said Lab Manager Violet Brown. “And, typically, they report out loud, which means that the lovely research assistants go in and listen to the audio files, and code what people said.” Results are still uncertain, but research assistants are helping the group near a conclusion.
Yet, the mouth is not the only clue to deciphering speech at a busy cocktail party. We also form expectations, based on who we’re listening to, about what a speaker will say. A professor, for example, might be expected to use “smart-sounding” language. The words that we’re likely to hear may also vary in certain contexts. In a lecture on shark anatomy, a professor is more likely to use words like “fin” or “gill” than the word “referendum.” Similarly, sentence structure is important for speech recognition. If I say, “I thought about the _____,” you know that the next word will be a noun based on your innate knowledge of English grammar.
In a 2017 study, Strand’s lab explored the ways that listeners combine auditory input with expectations derived from context. In experiments, participants heard a phrase in their headphones while their eyes were tracked using an eye tracking device. In this method, a set of images is on a screen, and an eye tracker shoots out infrared light and records where on the screen a participant is looking. So, infrared light shoots out, reflects off the back of the eyes, and then the eye tracker can interpret where they are looking at on the screen. Typically, participants are expected to click on an image, which provides information about what they’re predicting you might say.
“We either gave them a grammatical context, like ‘He thought about the ____,’ or no context, like, ‘The word is ______,’” said Strand. “That word could be any word in English. And, if you don’t give people a context clue, one of the findings in the research is that when you hear a particular word, you also think about other words that sound like that.”
Strand offered an example. “If I say, when there’s no context, that ‘The word is bridge,’ and I have a picture on the screen of a bridge and picture of someone ‘bringing’ something along, when I say, ‘The word is bri-’ people will start to look at both the bridge and the picture of someone bringing something.
“But, if I say, he thought about the ‘bri-,’ then people will look at bridge, but they tend to not look as much at bring because they’re anticipating that that word, which is a verb, isn’t going to complete the sentence,” said Strand.
Neat, yes? The Strand lab is using participants, mainly Carleton students, to help determine how humans decipher speech. In the article published last year, 132 participants participated. A team of nine research assistants collect data from student participants all day long.
In the lab, participants sit in a sound booth wearing headphones, looking at a computer screen, hearing speech, responding to visual stimuli, or maybe having their eye movements tracked. For more information on participating or about Strand’s research, visit her page on the psychology department site.