@callmyai/ai
v1.1.0
Published
## Description Inspired by Vercel's Language Model Specification, this is a proposal for introducing a Speech Model Specification to streamline the integration of various speech providers into our platform. This specification aims to provide a standardize
Downloads
3
Readme
AI-Voice
Description
Inspired by Vercel's Language Model Specification, this is a proposal for introducing a Speech Model Specification to streamline the integration of various speech providers into our platform. This specification aims to provide a standardized interface for interacting with different speech models, eliminating the complexity of dealing with unique APIs and reducing the risk of vendor lock-in.
Problem Statement
Currently, there are numerous speech providers available, each with its own distinct method for interfacing with their models. This lack of standardization complicates the process of switching providers and increases the likelihood of vendor lock-in. Developers are required to learn and implement different APIs for each provider, leading to increased development time and maintenance overhead.
The open-source community has created the following providers:
OpenAI Provider (@bishwenduk029-ai-voice/openai)
Elevenlabs Provider (@bishwenduk029-ai-voice/elevenlabs)
Deepgram Provider (@bishwenduk029-ai-voice/deepgram)
PlayHt Provider (@bishwenduk029-ai-voice/playht)
Features
- React Hook
useVoiceChat
for easy transcription and streaming voice chat in client side. - Consistent API for streaming speech response from various AI speech providers.
- Works well with
streamText
from Vercel AI SDK.
Installation
pnpm install @callmyai/ai
Usage
Client Side React Hook
🎙️ Real-Time Speech Transcription React Hook Seamlessly integrate real-time speech transcription into your React applications with this powerful and efficient hook! 🚀
✨ Hook Capabilities
- 🎤 Detect human speech end or silence using the robust @ricky0123/vad-react library
- ⏱️ Intelligently debounce speech input, ensuring continuous recording and transcription as long as the user speaks within a configurable time frame (e.g., 500ms)
- 🗣️ Gracefully handle speech interruptions, allowing users to pause and resume speaking naturally
- 🌐 Efficiently trigger REST calls for transcription, optimizing performance by waiting for the user to pause before sending requests
- 🔌 Easy to integrate into your existing React projects, with a simple and intuitive API
...
import { useVoiceChat } from '@callmyai/ai/ui'
...
export default function Chat({ id, initialMessages, className }: ChatProps) {
const { speaking, listening, thinking, initialized, messages, setMessages } =
useVoiceChat({
api: '/api/chat/voice',
initialMessages,
transcribeAPI: '/api/transcribe',
body: {
id
},
speakerPause: 500,
onSpeechCompletion: async () => {
if (id) {
const chat = await getChat(id)
setMessages(chat?.messages || [])
}
}
})
return (
<>
<div className={cn('pb-[200px] pt-4 md:pt-10', className)}>
{messages.length ? (
<>
<ChatList messages={messages} />
<ChatScrollAnchor trackVisibility={initialized} />
</>
) : (
<EmptyScreen />
)}
</div>
<ChatPanel
id={id}
initialized={initialized}
speaking={speaking}
listening={listening}
thinking={thinking}
messages={messages}
/>
</>
)
}
Hook up Server APIs
/api/chat/voice
import 'server-only'
...
import { streamText } from 'ai'
import { ollama, createOllama } from 'ollama-ai-provider'
import { openaiSpeech, playhtSpeech, streamSpeech } from '@callmyai/ai/server'
...
export const runtime = 'edge'
const model = ollama('llama3:latest')
export async function POST(req: Request) {
const cookieStore = cookies()
const supabase = createRouteHandlerClient<Database>({
cookies: () => cookieStore
})
const json = await req.json()
const { messages } = json
// const userId = (await auth({ cookieStore }))?.user.id
// if (!userId) {
// return new Response('Unauthorized', {
// status: 401
// })
// }
const systemPrompt = `
Your role is to act as a friendly human assistant by the user preferred name. Your given name is Nova.
`
const result = await streamText({
model,
messages: [
{
role: 'system',
content: `${systemPrompt}`
},
...messages.map((message: { role: any; content: any }) => ({
role: message.role,
content: message.content
}))
]
})
// OpenAI - env:OPENAI_API_KEY
const speechModel = openaiSpeech(
'tts-1', //openai_speech_model
'nova' //openai_voice_id
)
// ElevenLabsIO - env:ELEVENLABS_API_KEY
// const speechModel = elevenlabsSpeech(
// 'eleven_turbo_v2', //elevenlabs_speech_model
// 'DIBkDE5u33APYlfhjihh' //elevenlabs_voice_id
// )
// PlayHt - env:PLAYHT_API_KEY
// const speechModel = playhtSpeech(
// 'PlayHT2.0-turbo', //playht_speech_model
// '<your-playht-user-id>',
// "s3://voice-cloning-zero-shot/1afba232-fae0-4b69-9675-7f1aac69349f/delilahsaad/manifest.json" //playht_voice_id
// )
// Deepgram - env:DEEPGRAM_API_KEY
// const speechModel = deepgramSpeech("aura-asteria-en")
try {
const speech = await streamSpeech(speechModel)(result.textStream)
return new Response(speech, {
headers: { 'Content-Type': 'audio/mpeg' }
})
} catch (error) {
console.log(error)
return new Response(null, {
status: 500
})
}
}
Limitations
Currently the speech streaming only works for english text streams. Multi-lingual support in future.
Roadmap
- [x] Speech Model Specification Done
- [ ] Improve the implementation of sentence-boundary detection algorithm for the text stream to sentence stream conversion
- [ ] Add more tests
- [ ] Enhance the specification to also take case of WebSocket-based Speech Providers