GitXplorerGitXplorer
h

whispertalk

public
7 stars
0 forks
0 issues

Commits

List of commits on branch main.
Verified
ff1f05699e75d0d1213f5e4c5a4f51129b9bb119

Update README.md

hhojinYang committed 2 years ago
Unverified
4edd62c328e5c4742795901d93d2cce11dbbe127

initial commit

hhojinYang committed 2 years ago
Verified
c25d6b8e78e5759e356827691bff00bc5e1903ad

Update README.md

hhojinYang committed 2 years ago
Verified
eb31fbccb3dfec71e8ab4ab25cc4af85dacc7afa

Update README.md

hhojinYang committed 2 years ago
Verified
e62cf86c906aab19823c39fcbb48adf3925fde75

Update README.md

hhojinYang committed 2 years ago
Verified
590e0741f22c83236281f40ac1018463d481b954

Update README.md

hhojinYang committed 2 years ago

README

The README file for this repository.

WhisperTalk

Demo video

https://github.com/hojinYang/whispertalk/assets/31153283/e8377ffc-e90e-49d2-9d8b-005003540162

We have uploaded the small version to HuggingFace model hub under afl-3.0 LICENSE, so please check it out. We are currently training a larger and more sophisticated model. If you have any questions or would like early access, please contact us at hojin.yang7@gmail.com.

Project Description:

model

WhisperTalk is an audio-to-text model based on the transformer architecture. It is designed to take audio input optionally along with preceding conversations or prompts and generate predictions for the next utterance. This model offers several notable features:

1️⃣ Comprehensive Understanding: WhisperTalk understands the content conveyed in audio and generates appropriate responses accordingly.

2️⃣ Transition detection: The model can effectively determine whether the speech input is a continuation of the ongoing speech or the end of an utterance.

3️⃣ Basic Prompt Support: In addition to predicting the next context, WhisperTalk has the ability to interpret basic audio features, such as sentiment and gender.

Motivation:

In recent years, language models (LMs) have achieved significant advancements, leading to the development of various LM-based applications. However, integrating voice input seamlessly into these models remains a challenge, hindering the potential for improved user experiences through voice-based interactions. One of the key obstacles is the reliance on transcription to convert voice input into text, resulting in the loss of essential vocal characteristics, including tone, mood, nuances, and speaker transitions.

To address this challenge, our project proposes a novel approach to incorporating voice input into LMs. Our proposed model takes both voice and text-based prompts as input, leveraging them to predict the next sequence of words. By integrating voice input, LM-based services can better grasp the nuances of user communication, including emotions, speech termination, and other vocal intricacies. This, in turn, enables more natural and seamless interactions between humans and LMs.

The potential impact of this project extends beyond enhancing user experiences. Similar to text-based LMs that have tackled various text-related tasks, such as improving writing and summarization, integrating voice input opens avenues for addressing numerous challenges specific to vocal communication.

Examples

Here are examples of the text output generated by WhisperTalk for different audio inputs

vocal characteristics

https://github.com/hojinYang/whispertalk/assets/31153283/81dd4b25-e1b7-4ce0-bfd6-56d8b7464844

a woman is expressing her happiness through laughter.

https://github.com/hojinYang/whispertalk/assets/31153283/df65da71-7114-4c45-b265-62ade394845b

the sound is that of a man who is furious.

https://github.com/hojinYang/whispertalk/assets/31153283/0047a70c-89ed-4459-aa95-e79290094d9f

a sad cry is coming from the baby.

https://github.com/hojinYang/whispertalk/assets/31153283/aa07863d-7373-4828-894b-63bbbd65b596

a male laughing with joy.

https://github.com/hojinYang/whispertalk/assets/31153283/3fee6ede-7dbd-4b10-847a-b8a77e3abd77

a male voice with a neutral emotional state.

caption

https://github.com/hojinYang/whispertalk/assets/31153283/aa3373c7-dbe2-4094-8bdc-7fd7fe74478b

the sound in the audio is reminiscent of grunge rock music.

https://github.com/hojinYang/whispertalk/assets/31153283/919d2cdb-cd7e-4267-8a56-399d7171244c

the audio contains a mix of speech and music.

https://github.com/hojinYang/whispertalk/assets/31153283/f23c6df2-a03c-43c7-9dd6-dd05a9c6f3e9

the sound of clapping is audible.

https://github.com/hojinYang/whispertalk/assets/31153283/c77f4430-8fc5-4a04-8bd9-78ed0c10d9a9

the sound of an electric toothbrush buzzing can be heard.

https://github.com/hojinYang/whispertalk/assets/31153283/6c02d48f-8295-48e5-a9d1-a94abd3a0952

the sound of boiling water can be heard.

https://github.com/hojinYang/whispertalk/assets/31153283/7a791d2e-d2d2-4948-a75b-01826f23f26d

a vehicle starts and revs up, then stops.

next token prediction

https://github.com/hojinYang/whispertalk/assets/31153283/f8af7762-5c12-4de2-926b-167ce055c694

(turn) i was. i was a breakfast club fan.

https://github.com/hojinYang/whispertalk/assets/31153283/5a5d21cc-a166-43d7-a01e-f164288e950e

real business. and that's what we're doing. we're bringing a real business to the table.

https://github.com/hojinYang/whispertalk/assets/31153283/057df693-806e-44e9-917d-ea1067227ba4

(turn)i'm sorry, but i don't have any information on how to make a hamburger. can you provide more context or details?

More examples: click here