@echogarden/windows-media-tts
v0.1.0
Published
Node.js binding to the Windows Media text-to-speech API (Windows.Media.SpeechSynthesis).
Downloads
71
Maintainers
Readme
Node.js binding to the Windows Media Speech Synthesis API
Uses N-API to bind to the Universal Windows Platform speech-synthesis API. The native speech synthesizer on Windows 10 and newer.
- Speech is returned as a
Uint8Array
, in WAVE format - Will recognize voice packages installed via the Windows Speech settings
- Supports SSML (when option
enableSsml
is set totrue
) - Provides markers and metadata tracks
- Addon binaries are pre-bundled. Doesn't require any install-time postprocessing
- Supports Windows 10 and newer, on both x64 and arm64
Installation
npm install @echogarden/windows-media-tts
Usage examples
List voices:
import { getVoiceList, synthesize } from '@echogarden/window-media-tts'
const voices = getVoiceList()
console.log(voices)
Prints:
[
{
id: 'HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Speech_OneCore\\Voices\\Tokens\\MSTTS_V110_enUS_DavidM',
displayName: 'Microsoft David',
description: 'Microsoft David - English (United States)',
language: 'en-US',
gender: 'male'
},
{
id: 'HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Speech_OneCore\\Voices\\Tokens\\MSTTS_V110_enUS_ZiraM',
displayName: 'Microsoft Zira',
description: 'Microsoft Zira - English (United States)',
language: 'en-US',
gender: 'female'
},
//...
]
Synthesize text
import { writeFile } from 'node:fs/promises'
import { synthesize } from '@echogarden/window-media-tts'
const text = `
Hello World! How are you doing today?
`
const { audioData } = synthesize(text, {
voiceName: 'Microsoft Zira',
speakingRate: 0.9,
audioPitch: 1.0,
enableSsml: false
})
await writeFile(`hello-world.wav`, audioData)
The voiceName
property will match any of id
, displayName
or description
properties of the target voice.
Synthesize SSML
const ssml = `
Hello World! <mark name="a" />How are you <mark name="b" />doing today?
`
const { audioData, markers } = synthesize(ssml, {
voiceName: 'Microsoft David',
speakingRate: 0.8,
audioPitch: 1.0,
enableSsml: true
})
// ...
console.log(markers)
Markers show as:
[ { name: 'a', time: 1.6964375 }, { name: 'b', time: 2.2614375 } ]
Notes:
- The wrapping SSML
<speak>
tag is automatically added, and includes the language name for the selected voice (or for the default one, if not selected). That is crucial for the synthesis engine to successfully accept the input - The Windows synthesis engine is very strict about the validity of the SSML markup. If the SSML has errors, like tags that aren't properly closed, it will fail without a helpful error message
Timed metadata
The Windows synthesis API returns the start time of each sentence and word, in a TimedMetadataTracks
object.
This object is converted to a JavaScript object and returned in the timedMetadataTracks
property of the returned object.
const { audioData, timedMetadataTracks } = synthesize(...)
Note: it seems to give correct timing. However, the cue identifiers seem to be empty, cue durations are 0, and it contains no text offset information. This makes it harder to reliably match the cues to the words and sentences in the text, in practice.
Building the N-API addons
The library is bundled with pre-built addons, so recompilation shouldn't be needed.
If you still want to compile yourself, for a modification or a fork:
- Install Visual Studio 2022 build tools
- In the
addons
directory, runnpm install
, which would install the necessary build tools. Then runnpm run build-x64
- To cross-compile for arm64 go to
Visual Studio Installer -> Individual components
, and ensureMSVC v143 - VS 2022 C++ ARM64 build tools (latest)
is checked. Then runnpm run build-arm64
- Resulting binaries should be written to the
addons/bin
directory
License
MIT