STT-TTS Service
A lightweight, local speech-to-text (STT) and text-to-speech (TTS) service that runs on any device connected to your OpenClaw server. Perfect for voice-enabled workflows and flexible resource allocation.
Features
- Speech-to-Text: Transcribe audio using faster-whisper (4x faster than OpenAI Whisper)
- Text-to-Speech: Generate natural speech using piper-tts or pyttsx3 fallback
- 100% Local: No cloud APIs, works offline after initial model download
- Flexible Deployment: Run on any device - Raspberry Pi, laptop, or GPU server
- HTTP API: Simple REST endpoints for easy integration
Quick Start
Installation
# Clone or download this skill
cd stt-tts-service
# Install dependencies
pip install -r requirements.txt
# Start the service
python main.py
Docker Deployment
docker build -t stt-tts-service .
docker run -p 8765:8765 stt-tts-service
API Endpoints
POST /stt - Speech to Text
Transcribe audio files to text.
curl -X POST http://localhost:8765/stt \
-F "audio=@recording.wav"
Response:
{
"text": "Hello, this is the transcribed text.",
"language": "en",
"duration": 3.5
}
POST /tts - Text to Speech
Convert text to audio.
curl -X POST http://localhost:8765/tts \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "voice": "default"}' \
--output speech.wav
Parameters:
text(required): Text to synthesizevoice(optional): Voice ID to usespeed(optional): Speech rate multiplier (0.5-2.0)
GET /health
Health check endpoint.
curl http://localhost:8765/health
GET /models
List available models and voices.
curl http://localhost:8765/models
WebSocket Streaming (Real-time Voice)
For real-time voice conversations, use WebSocket endpoints:
WS /ws/stt - Streaming Speech-to-Text
Stream audio and receive transcriptions in real-time.
const ws = new WebSocket('ws://localhost:8765/ws/stt');
// Send audio chunks (16kHz, 16-bit, mono PCM)
ws.send(audioBuffer);
// Receive transcriptions
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log(data.text); // Transcribed text
};
// Flush remaining audio
ws.send(JSON.stringify({action: "flush"}));
WS /ws/tts - Streaming Text-to-Speech
Send text and receive audio chunks in real-time.
const ws = new WebSocket('ws://localhost:8765/ws/tts');
// Send text to synthesize
ws.send(JSON.stringify({text: "Hello world"}));
// Receive audio chunks
ws.onmessage = (event) => {
if (event.data instanceof Blob) {
// Audio chunk - play it
playAudio(event.data);
}
};
WS /ws/voice - Full Duplex Voice Conversation
Stream audio input and receive audio output for real-time voice-to-voice.
const ws = new WebSocket('ws://localhost:8765/ws/voice');
// Stream microphone audio
navigator.mediaDevices.getUserMedia({audio: true})
.then(stream => {
// Send audio chunks to WebSocket
});
// Handle responses
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "transcript") {
// User's speech transcribed - send to your AI
sendToAI(data.text);
}
};
// Send AI response to be spoken
ws.send(JSON.stringify({action: "speak", text: aiResponse}));
Configuration
Set environment variables or edit config.py:
| Variable | Default | Description |
|----------|---------|-------------|
| STT_MODEL | base | Whisper model: tiny, base, small, medium |
| TTS_ENGINE | auto | TTS engine: piper, pyttsx3, auto |
| DEVICE | auto | Compute device: cpu, cuda, auto |
| HOST | 0.0.0.0 | Server bind address |
| PORT | 8765 | Server port |
Model Sizes
| STT Model | Size | Speed | Accuracy | |-----------|------|-------|----------| | tiny | ~75MB | Fastest | Basic | | base | ~150MB | Fast | Good | | small | ~500MB | Medium | Better | | medium | ~1.5GB | Slower | Best |
OpenClaw Integration
Register this service with your OpenClaw server:
openclaw service register http://device-ip:8765
Then use in your workflows:
- action: stt
input: ${audio_file}
output: transcription
- action: tts
input: "Hello, ${user_name}!"
output: greeting_audio
Requirements
- Python 3.9+
- 2GB RAM minimum (4GB recommended for medium model)
- ~500MB disk space (plus model storage)
Scan to join WeChat group