spring AI (Eight) Speech Transcription and TTS

Once the image part is done, it's time to work on the audio. There are two methods for handling audio:

Transcription API is used to transcribe text, which means generating subtitles for the audio. It uses the whisper model.
Text-To-Speech (TTS) API, also known as TTS, is used to generate speech from text.

Transcription#

Let's start with the code, the operations are the same:

private final OpenAiAudioTranscriptionModel openAiAudioTranscriptionModel;

/**
     * Audio transcription
     * @param file Audio file
     * @return String
     */
    @PostMapping(value = "/transcriptions")
    public String transcriptions(@RequestPart("file") MultipartFile file) {

        var transcriptionOptions = OpenAiAudioTranscriptionOptions.builder()
                .withResponseFormat(OpenAiAudioApi.TranscriptResponseFormat.TEXT)
                .withTemperature(0f)
                .build();

        AudioTranscriptionPrompt transcriptionRequest = new AudioTranscriptionPrompt(file.getResource(), transcriptionOptions);
        AudioTranscriptionResponse response = openAiAudioTranscriptionModel.call(transcriptionRequest);
        return response.getResult().getOutput();
    }

The most important parameter here is ResponseFormat, which determines the format of the generated output. It can be txt, json, srt, etc. srt is the commonly used subtitle file format. For other parameter configurations, please refer to the official documentation.

The only thing to note is that the required file format is Resource.

If using the intermediate API, please check if this model is supported before testing.

TTS#

There are two types of TTS responses: regular and streaming. Let's focus on the streaming response. Here is the code:

private final OpenAiAudioSpeechModel openAiAudioSpeechModel;

/**
     * Real-time TTS streaming
     * @param message Text
     * @return SseEmitter
     */
    @GetMapping(value = "/tts", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public SseEmitter openImage(@RequestParam String message) {
        OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
                .withVoice(OpenAiAudioApi.SpeechRequest.Voice.ALLOY)
                .withSpeed(1.0f)
                .withResponseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3)
                .withModel(OpenAiAudioApi.TtsModel.TTS_1_HD.value)
                .build();

        SpeechPrompt speechPrompt = new SpeechPrompt(message, speechOptions);

        String uuid = UUID.randomUUID().toString();
        SseEmitter emitter = SseEmitterUtils.connect(uuid);

        Flux<SpeechResponse> responseStream = openAiAudioSpeechModel.stream(speechPrompt);
        responseStream.subscribe(response -> {
            byte[] output = response.getResult().getOutput();
            String base64Audio = Base64.getEncoder().encodeToString(output);
            SseEmitterUtils.sendMessage(uuid, base64Audio);
        });
        return emitter;
    }

Here is the explanation of each parameter:

Parameter	Explanation
Voice	The voice of the narrator
Speed	The speed of speech synthesis. Acceptable range is from 0.0 (slowest) to 1.0 (fastest)
ResponseFormat	The format of the audio output. Supported formats are mp3, opus, aac, flac, wav, and pcm.
Model	The model, which can be TTS_1 or TTS_1_HD. HD provides better results.

Please note that currently only the first four formats are available for audio output, the last two are not supported.

This has a significant impact because PCM can be directly decoded in the browser and is suitable for streaming, while MP3 needs to be transcoded.

As you can see, the result of TTS is a byte[] array, which is converted to Base64 and sent to the frontend via SSE.

The frontend needs to decode it. I asked Claude to write a test page for this:

<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Real-time Streaming MP3 TTS Player</title>
</head>
<body>
<h1>Real-time Streaming MP3 TTS Player</h1>
<input type="text" id="textInput" placeholder="Enter the text to convert">
<button onclick="startStreaming()">Start Playback</button>
<audio id="audioPlayer" controls></audio>

<script>
    let mediaSource;
    let sourceBuffer;
    let audioQueue = [];
    let isPlaying = false;

    function startStreaming() {
        const text = document.getElementById('textInput').value;
        const encodedText = encodeURIComponent(text);
        const eventSource = new EventSource(`http://127.0.0.1:8868/audio/tts?message=${encodedText}`);

        const audio = document.getElementById('audioPlayer');
        mediaSource = new MediaSource();
        audio.src = URL.createObjectURL(mediaSource);

        mediaSource.addEventListener('sourceopen', function() {
            sourceBuffer = mediaSource.addSourceBuffer('audio/mpeg');
            sourceBuffer.addEventListener('updateend', playNextChunk);
        });

        audio.play();

        eventSource.onopen = function(event) {
            console.log('Connection opened');
        };

        eventSource.onmessage = function(event) {
            const audioChunk = base64ToArrayBuffer(event.data);
            audioQueue.push(audioChunk);
            if (!isPlaying) {
                playNextChunk();
            }
        };

        eventSource.onerror = function(error) {
            console.error('Error:', error);
            if (eventSource.readyState === EventSource.CLOSED) {
                console.log('Connection closed');
            }
            eventSource.close();
        };
    }

    function base64ToArrayBuffer(base64) {
        const binaryString = window.atob(base64);
        const len = binaryString.length;
        const bytes = new Uint8Array(len);
        for (let i = 0; i < len; i++) {
            bytes[i] = binaryString.charCodeAt(i);
        }
        return bytes.buffer;
    }

    function playNextChunk() {
        if (audioQueue.length > 0 && !sourceBuffer.updating) {
            isPlaying = true;
            const chunk = audioQueue.shift();
            sourceBuffer.appendBuffer(chunk);
        } else {
            isPlaying = false;
        }
    }
</script>
</body>
</html>