vox-sdk
v0.0.7
Published
VoxSDK is a comprehensive toolkit designed to facilitate easy integration of AI-driven speech recognition and synthesis into your applications. With a focus on simplicity and efficiency, VoxSDK offers a set of React hooks and utilities to seamlessly conne
Downloads
8
Readme
VoxSDK
VoxSDK is a comprehensive toolkit designed to facilitate easy integration of AI-driven speech recognition and synthesis into your applications. With a focus on simplicity and efficiency, VoxSDK offers a set of React hooks and utilities to seamlessly connect with AI services for voice interactions.
Features
VoxProvider
: A context provider to encapsulate the SDK's functionalities and make them accessible throughout your React application.useListen
: A hook to capture and transcribe user speech in real-time.useSpeak
: A hook for text-to-speech functionality, converting text responses into natural-sounding speech.
Installation
Install VoxSDK using npm:
npm install vox-sdk
Or using yarn:
yarn add vox-sdk
Install tslib.
Using npm
npm install tslib --save-dev
Using yarn
yarn add tslib -D
Setup
To set up VoxSDK, you will need to generate a speech_key and region from the Azure Portal.
Visit the Microsoft Azure Portal and create a speech resource to obtain your speech_key and region.
Learn more about text-to-speech and speech-to-text wiyh Microsoft Cognative Speech Services.
You will need to set up both the server and the client.
Server Setup
On your server, you will need to create a
GET
endpoint at/token
.Using the
speech_key
andregion
, you will generate an authorization token from Microsoft's APIs.Set these values in the .env file as
SPEECH_KEY
andSPEECH_REGION
.The
/token
endpoint should return the following response:.{ token:string, region:string }
Here's a sample implementation of the
/token
endpoint.
import express from "express";
import cors from "cors";
import "dotenv/config";
import axios from "axios";
const app = express();
app.use(
cors({
origin: process.env.FRONTEND_URL,
})
);
let token = null;
const speechKey = process.env.SPEECH_KEY;
const speechRegion = process.env.SPEECH_REGION;
const getToken = async () => {
try {
const headers = {
headers: {
"Ocp-Apim-Subscription-Key": speechKey,
"Content-Type": "application/x-www-form-urlencoded",
},
};
const tokenResponse = await axios.post(`https://${speechRegion}.api.cognitive.microsoft.com/sts/v1.0/issueToken`, null, headers);
token = tokenResponse.data;
} catch (error) {
console.error("Error while getting token:", error);
}
};
app.get("/token", async (req, res) => {
try {
res.setHeader("Content-Type", "application/json");
// When client asks for refresh token
const refreshTheToken = req.query?.refresh;
if (!token || refreshTheToken) {
await getToken();
}
res.send({
token: token,
region: speechRegion,
});
} catch (error) {
console.error("Error while handling /token request:", error);
res.status(500).send({ error: "An error occurred while processing your request." });
}
});
app.listen(8080, () => console.log("Server running on port 8080"));
- For detailed documentation you can visit sample app here.
Client Setup
Wrap your application with VoxProvider to make the SDK available throughout your app:
import { VoxProvider } from "vox-sdk"; function App() { return <VoxProvider>{/* Your app components go here */}</VoxProvider>; } export default App;
VoxProvider
expectsconfig
object which includes,baseUrl
: url to your backend. e.g. :https://exampleapp.com
, Ensure that the/token
route serves the token and region..OnAuthRefresh
: A callback function that is invoked when any authentication error occurs or the token expires.headersForBaseUrl
: Option to pass baseUrl Config.
Here's the implmentation of the above two step.
<VoxProvider config={{ baseUrl: "https://exampleapp.com", onAuthRefresh: async () => { const { data } = await axios.get("https://exampleapp.com/token?refresh=true"); return { token: data.token, region: data.region }; }, headersForBaseUrl: { //... Bearer Authentication token or other config }, }} > <App /> </VoxProvider>
The
onAuthRefresh
callback will refresh the token and return it with the region.For more details you can visit here sample app implementation
Usage
Using useListen Hook
After setting up the Server and VoxProvider we are ready to use useListen
and useSpeak
.
Integrate speech-to-text functionality in your components:
import { useListen } from "vox-sdk";
import React from "react";
const SpeechToText = () => {
const { answers, loading, startSpeechRecognition, stopSpeechRecognition } = useListen({
onEndOfSpeech: () => {
console.log(answers);
},
automatedEnd: true,
delay: 1000,
});
return (
<>
<button disabled={loading} onClick={startSpeechRecognition}>
Start Litsening
</button>
<button onClick={stopSpeechRecognition}> Stop Listening</button>
</>
);
};
export default SpeechToText;
useListen
hook expects following parameters.
automatedEnd
:- Expects a boolean value, default is
true
. - When the user finishes speaking, the hook will automatically start the speech-to-text conversion.
- To listen continuously until the user clicks
stopSpeechRecognition
, passfalse
.
- Expects a boolean value, default is
delay
:- Expects a value in milliseconds.
- This is the debounce duration for listening to the user.
- The default is set to 2000ms.
onEndOfSpeech
:- Expects a callback function that is invoked when speech ends.
useListen
Hook Returns.
startSpeechRecognition
: Function to start speech recognition.stopSpeechRecognition
: Function to stop speech recognition.answers
: Returns an array of strings containing all the transcribed text.answer
: The last transcribed text.recognizerRef
: An instance ofmicrosoft-cognitiveservices-speech-sdk
.
Using useSpeak Hook
Implement text-to-speech in your application:
import React from "react";
import { useState } from "react";
import { useSpeak, SpeechVoices } from "vox-sdk";
const TextToSpeech = () => {
const [text, setText] = useState("");
const { interruptSpeech, speak, isSpeaking } = useSpeak({
onEnd: () => {
console.log("Spech ended");
},
shouldCallOnEnd: true,
throttleDelay: 1000,
voice: SpeechVoices.enUSAIGenerate1Neural, // AI Voices
});
return (
<>
<h3>Text To Speech</h3>
<input type="text" onChange={(e) => setText(e.target.value)} value={text} />
<button
onClick={() => {
speak(text);
}}
disabled={isSpeaking}
>
Start Speaking
</button>
<button
disabled={!isSpeaking}
onClick={() => {
interruptSpeech();
}}
>
Stop Speaking
</button>
</>
);
};
export default TextToSpeech;
useSpeak
hook expects following parameters.
voice
:Expects a string value.
Choose your preferred AI voice from Microsoft Azure.
Here's the list of available voices.
export enum SpeechVoices { // Arabic arAEFatimaNeural = "ar-AE-FatimaNeural", arBHAliNeural = "ar-BH-AliNeural", arEGSalmaNeural = "ar-EG-SalmaNeural", arJOTaimNeural = "ar-JO-TaimNeural", arKWFahedNeural = "ar-KW-FahedNeural", arLYImanNeural = "ar-LY-ImanNeural", arQAAmalNeural = "ar-QA-AmalNeural", arSAHamedNeural = "ar-SA-HamedNeural", arSYAmanyNeural = "ar-SY-AmanyNeural", arTNHediNeural = "ar-TN-HediNeural", arYEMaryamNeural = "ar-YE-MaryamNeural", // Chinese zhCNXiaoxiaoNeural = "zh-CN-XiaoxiaoNeural", zhCNYunxiNeural = "zh-CN-YunxiNeural", zhCNYunyeNeural = "zh-CN-YunyeNeural", zhHKHiuGaaiNeural = "zh-HK-HiuGaaiNeural", zhHKHiuMaanNeural = "zh-HK-HiuMaanNeural", zhTWHsiaoChenNeural = "zh-TW-HsiaoChenNeural", zhTWHsiaoYuNeural = "zh-TW-HsiaoYuNeural", // Danish daDKChristelNeural = "da-DK-ChristelNeural", daDKJeppeNeural = "da-DK-JeppeNeural", // Dutch nlBEArnaudNeural = "nl-BE-ArnaudNeural", nlBEDenaNeural = "nl-BE-DenaNeural", nlNLColetteNeural = "nl-NL-ColetteNeural", nlNLFennaNeural = "nl-NL-FennaNeural", // English (Australia) enAUNatashaNeural = "en-AU-NatashaNeural", enAUWilliamNeural = "en-AU-WilliamNeural", // English (Canada) enCAClaraNeural = "en-CA-ClaraNeural", enCALiamNeural = "en-CA-LiamNeural", // English (India) enINNeerjaNeural = "en-IN-NeerjaNeural", enINPrabhatNeural = "en-IN-PrabhatNeural", // English (UK) enGBLibbyNeural = "en-GB-LibbyNeural", enGBRyanNeural = "en-GB-RyanNeural", // English (US) enUSAIGenerate1Neural = "en-US-AIGenerate1Neural", enUSAmberNeural = "en-US-AmberNeural", enUSAriaNeural = "en-US-AriaNeural", enUSAshleyNeural = "en-US-AshleyNeural", enUSBrandonNeural = "en-US-BrandonNeural", enUSChristopherNeural = "en-US-ChristopherNeural", enUSCoraNeural = "en-US-CoraNeural", enUSDavisNeural = "en-US-DavisNeural", enUSElizabethNeural = "en-US-ElizabethNeural", enUSEricNeural = "en-US-EricNeural", enUSGuyNeural = "en-US-GuyNeural", enUSJacobNeural = "en-US-JacobNeural", enUSJasonNeural = "en-US-JasonNeural", enUSJennyNeural = "en-US-JennyNeural", enUSMichelleNeural = "en-US-MichelleNeural", enUSMonicaNeural = "en-US-MonicaNeural", enUSSaraNeural = "en-US-SaraNeural", enUSTonyNeural = "en-US-TonyNeural", // Finnish fiFINooraNeural = "fi-FI-NooraNeural", fiFISelmaNeural = "fi-FI-SelmaNeural", // French (Canada) frCADiegoNeural = "fr-CA-DiegoNeural", frCAFelixNeural = "fr-CA-FelixNeural", frCAJeanNeural = "fr-CA-JeanNeural", frCASylvieNeural = "fr-CA-SylvieNeural", // French (France) frFRDeniseNeural = "fr-FR-DeniseNeural", frFREloiseNeural = "fr-FR-EloiseNeural", frFRHenriNeural = "fr-FR-HenriNeural", // German deDEKatjaNeural = "de-DE-KatjaNeural", deDEKillianNeural = "de-DE-KillianNeural", // Greek elGRAthinaNeural = "el-GR-AthinaNeural", elGRNestorasNeural = "el-GR-NestorasNeural", // Hindi hiINMadhurNeural = "hi-IN-MadhurNeural", hiINSwaraNeural = "hi-IN-SwaraNeural", // Italian itITDiegoNeural = "it-IT-DiegoNeural", itITElsaNeural = "it-IT-ElsaNeural", // Japanese jaJPAoiNeural = "ja-JP-AoiNeural", jaJPNanamiNeural = "ja-JP-NanamiNeural", // Korean koKRInJoonNeural = "ko-KR-InJoonNeural", koKRSunHiNeural = "ko-KR-SunHiNeural", // Portuguese (Brazil) ptBRFranciscaNeural = "pt-BR-FranciscaNeural", ptBRAntonioNeural = "pt-BR-AntonioNeural", // Russian ruRUDmitryNeural = "ru-RU-DmitryNeural", ruRUSvetlanaNeural = "ru-RU-SvetlanaNeural", // Spanish (Mexico) esMXJorgeNeural = "es-MX-JorgeNeural", esMXDaliaNeural = "es-MX-DaliaNeural", // Spanish (Spain) esESElviraNeural = "es-ES-ElviraNeural", esESAlvaroNeural = "es-ES-AlvaroNeural", // Swedish svSESofieNeural = "sv-SE-SofieNeural", svSEMattiasNeural = "sv-SE-MattiasNeural", }
throttleDelay
:- Expects a value in milliseconds.
- This is the throttle duration for listening to the user.
- The default is set to 2000ms.
onEnd
:- Expects a callback function that is invoked when the AI speech ends.
- To invoke this, set shouldCallOnEnd to true.
useSpeak
Hook Returns.
speak
:- Function to start text-to-speech recognition.
- Expects a string argument to be converted to speech.
interruptSpeech
:- Function to stop the AI speech.
hasAllSentencesBeenSpoken
:- Returns a boolean value indicating if all sentences have been recognized.
isSpeaking
:- Returns a boolean value indicating if the AI is speaking.
streamedSentences
:- Returns an array of strings with all streamed sentences.
Contributing
Contributions are welcome! Please read our Contributing Guide for more information.
License
This project is licensed under the MIT License.