Demo video
https://github.com/hojinYang/whispertalk/assets/31153283/e8377ffc-e90e-49d2-9d8b-005003540162
We have uploaded the small version to HuggingFace model hub under afl-3.0 LICENSE, so please check it out. We are currently training a larger and more sophisticated model. If you have any questions or would like early access, please contact us at hojin.yang7@gmail.com.
Project Description:
WhisperTalk is an audio-to-text model based on the transformer architecture. It is designed to take audio input optionally along with preceding conversations or prompts and generate predictions for the next utterance. This model offers several notable features:
1️⃣ Comprehensive Understanding: WhisperTalk understands the content conveyed in audio and generates appropriate responses accordingly.
2️⃣ Transition detection: The model can effectively determine whether the speech input is a continuation of the ongoing speech or the end of an utterance.
3️⃣ Basic Prompt Support: In addition to predicting the next context, WhisperTalk has the ability to interpret basic audio features, such as sentiment and gender.
Motivation:
In recent years, language models (LMs) have achieved significant advancements, leading to the development of various LM-based applications. However, integrating voice input seamlessly into these models remains a challenge, hindering the potential for improved user experiences through voice-based interactions. One of the key obstacles is the reliance on transcription to convert voice input into text, resulting in the loss of essential vocal characteristics, including tone, mood, nuances, and speaker transitions.
To address this challenge, our project proposes a novel approach to incorporating voice input into LMs. Our proposed model takes both voice and text-based prompts as input, leveraging them to predict the next sequence of words. By integrating voice input, LM-based services can better grasp the nuances of user communication, including emotions, speech termination, and other vocal intricacies. This, in turn, enables more natural and seamless interactions between humans and LMs.
The potential impact of this project extends beyond enhancing user experiences. Similar to text-based LMs that have tackled various text-related tasks, such as improving writing and summarization, integrating voice input opens avenues for addressing numerous challenges specific to vocal communication.
Here are examples of the text output generated by WhisperTalk for different audio inputs
vocal characteristics
https://github.com/hojinYang/whispertalk/assets/31153283/81dd4b25-e1b7-4ce0-bfd6-56d8b7464844
a woman is expressing her happiness through laughter.
https://github.com/hojinYang/whispertalk/assets/31153283/df65da71-7114-4c45-b265-62ade394845b
the sound is that of a man who is furious.
https://github.com/hojinYang/whispertalk/assets/31153283/0047a70c-89ed-4459-aa95-e79290094d9f
a sad cry is coming from the baby.
https://github.com/hojinYang/whispertalk/assets/31153283/aa07863d-7373-4828-894b-63bbbd65b596
a male laughing with joy.
https://github.com/hojinYang/whispertalk/assets/31153283/3fee6ede-7dbd-4b10-847a-b8a77e3abd77
a male voice with a neutral emotional state.
caption
https://github.com/hojinYang/whispertalk/assets/31153283/aa3373c7-dbe2-4094-8bdc-7fd7fe74478b
the sound in the audio is reminiscent of grunge rock music.
https://github.com/hojinYang/whispertalk/assets/31153283/919d2cdb-cd7e-4267-8a56-399d7171244c
the audio contains a mix of speech and music.
https://github.com/hojinYang/whispertalk/assets/31153283/f23c6df2-a03c-43c7-9dd6-dd05a9c6f3e9
the sound of clapping is audible.
https://github.com/hojinYang/whispertalk/assets/31153283/c77f4430-8fc5-4a04-8bd9-78ed0c10d9a9
the sound of an electric toothbrush buzzing can be heard.
https://github.com/hojinYang/whispertalk/assets/31153283/6c02d48f-8295-48e5-a9d1-a94abd3a0952
the sound of boiling water can be heard.
https://github.com/hojinYang/whispertalk/assets/31153283/7a791d2e-d2d2-4948-a75b-01826f23f26d
a vehicle starts and revs up, then stops.
next token prediction
https://github.com/hojinYang/whispertalk/assets/31153283/f8af7762-5c12-4de2-926b-167ce055c694
(turn) i was. i was a breakfast club fan.
https://github.com/hojinYang/whispertalk/assets/31153283/5a5d21cc-a166-43d7-a01e-f164288e950e
real business. and that's what we're doing. we're bringing a real business to the table.
https://github.com/hojinYang/whispertalk/assets/31153283/057df693-806e-44e9-917d-ea1067227ba4
(turn)i'm sorry, but i don't have any information on how to make a hamburger. can you provide more context or details?
More examples: click here