Use a websocket URL to send audio media to AutoTranscribe
startStream
message with appropriate feature parameters specified.finishStream
to indicate to the Voice server that there is no more audio to send for this stream request.finalResponse
which contains a summary of the stream request.startStream
message.
For real-time live streaming, ASAPP recommends that you stream audio chunk-by-chunk in a real-time streaming format, by sending every 20ms or 100ms of audio as one binary message and sending the next chunk after a 20ms or 100ms interval.
If the chunk is too small, it will require more audio binary messages and more downstream message handling; if the chunk is too big, it increases buffering pressure and slows down the server responsiveness.
Exceptionally large chunks may result in WebSocket transport errors such as timeouts.
.WAV
media files with speaker-separated channels.Recordings for training and real-time streams should have both the same sample rate (8000 samples/sec) and audio encoding (16-bit PCM).See the Customization section of the AutoTranscribe Product Guide for more on data requirements for transcription model training.HTTPS
protocol. Traffic using HTTP
will not be redirected to HTTPS
.asapp-api-id
and asapp-api-secret
are required header parameters, both of which will be provided to you by ASAPP.externalId
. ASAPP refers to this identifier from the client’s system in real-time streaming use cases to redact utterances using context from other utterances in the same conversation (e.g. reference to a credit card in an utterance from 20s earlier). It is the client’s responsibility to ensure externalId
is unique.POST /autotranscribe/v1/streaming-url
Headers (required)
wss://<internal-voice-gateway-ingress>?token=<short_lived_access_token>
A WebSocket connection will be established if the short_lived_access_token
is validated. Otherwise, the requested connection will be rejected.
Send Your Request | Receive ASAPP Response | |
---|---|---|
1 | startStream message | startResponse message |
2 | Stream audio | transcript message |
3 | finishStream message | finalResponse message |
startStream
message with information about the speaker including their role
(customer, agent) and their unique identifier (externalId
) from your system before sending any audio packets.
startStream
message to adjust default transcription settings.
For example, the default language
transcription setting is en-US
if not denoted in the startStream
message. To set the language to Spanish, the language
field should be set with value es-US
. Once set, AutoTranscribe will expect a Spanish conversation in the audio stream and return transcribed message text in Spanish.
startStream
message, the server will respond with a startResponse
if the request is granted:
streamID
is a unique identifier assigned to the connection by the ASAPP server.
The status code and description may contain additional useful information.
If there is an application status code error with the request, the ASAPP server sends a finalResponse
message with an error description, and the server then closes the connection.
startStream
message is sent without the need to wait for the startResponse
. However, it is possible a request could be rejected either due to an invalid startStream
or internal server errors. If that is the case, the server notifies with a finalResponse
message, and any streamed audio packets will be dropped by the server.
Audio must be sent as binary data of WebSocket protocol:
ws.send(<binary_blob>)
The server does not acknowledge receiving individual audio packets. The summary in the finalResponse
message can be used to verify if any audio packet was not received by the server.
If audio can be transcribed, the server sends back transcript
messages asynchronously.
For real-time live streaming, it is recommeneded that audio streams are sent chunk-by-chunk, sending every 20ms or 100ms of audio as one binary message. Exceptionally large chunks may result in WebSocket transport errors such as timeouts.
transcript
message, which contains one complete utterance.
Example of a transcript
message:
GET /messages
to receive all the transcript messages for a completed call.
Conversation transcripts are available for seven days after they are completed.
finishStream
message. Any audio message sent after finishStream
will be dropped by the service.
finishStream
will be dropped, the service will send a finalResponse
with error code 4056 (Wrong message order) and the connection will be closed.
finalResponse
at the end of the streaming session and closes the connection, after which the server will stop processing incoming messages for the stream. It is safe to close the WebSocket connection when the finalResponse
message is received.
The server will end a given stream session if any of following are true:
finishStream
and all audio received has been processedfinalResponse
message.
The finalResponse
message has a summary of the stream along with the status code, which you can use to verify if there are any missing audio packets or transcript messages:
Field | Description | Default | Supported Values |
---|---|---|---|
sender.role (required) | A participant role, usually the customer or an agent for human participants. | n/a | ”agent”, “customer” |
sender.externalId (required) | Participant ID from the external system, it should be the same for all interactions of the same individual | n/a | ”BL2341334” |
language | IETF language tag | en-US | ”en-US”, “es-US” |
samplingRate | Audio samples/sec | 8000 | 8000 |
encoding | ’L16’: PCM data with 16 bit/sample | L16 | ”L16” |
smartFormatting | Request for post processing: Inverse Text Normalization (convert spoken form to written form), e.g., ‘twenty two —> 22’. Auto punctuation and capitalization | true | true, false |
detailedToken | If true, outputs word-level details like word content, timestamp and word type. | false | true, false |
audioRecordingAllowed | false: ASAPP will not record the audio; true: ASAPP may record and store the audio for this conversation | false | true, false |
redactionOutput | If detailedToken is true along with value ‘redacted’ or ‘redacted_and_unredacted’, request will be rejected. If no redaction rules configured by the client for ‘redacted’ or ‘redacted_and_unredacted’, the request will be rejected. If smartFormatting is False, requests with value ‘redacted’ or ‘redacted_and_unredacted’ will be rejected. | redacted | ”redacted”, “unredacted”,“redacted_and_unredacted” |
Field | Description | Format | Example Syntax |
---|---|---|---|
start | Start time (millisecond) of the utterance (in milliseconds) relative to the start of the audio input | integer | 0 |
end | End time (millisecond) of the utterance (in milliseconds) relative to the start of the audio input | integer | 300 |
utterance.text | The written text of the utterance. While an utterance can have multiple alternatives (e.g., ‘me two’ vs. ‘me too’) ASAPP provides only the most probable alternative only, based on model prediction confidence. | array | ”Hi, my ID is 123.” |
detailedToken
in startStream
request is set to true, additional fields are provided within the utterance
array for each token
:
Field | Description | Format | Example Syntax |
---|---|---|---|
token.content | Text or punctuation | string | ”is”, ”?“ |
token.start | Start time (millisecond) of the token relative to the start of the audio input | integer | 170 |
token.end | End time (millisecond) audio boundary of the token relative to the start of the audio input, there may be silence after that, so it does not necessarily match with the startMs of the next token. | integer | 200 |
token.punctuationAfter | Optional, punctuation attached after the content | string | ’.‘ |
token.punctuationBefore | Optional, punctuation attached in front of the content | string | ’“‘ |
updateVocabulary
message.
The updateVocabulary
service can be sent multiple times during a session. Vocabulary is additive, which means the new vocabulary words are appended to the previous ones. If vocabulary is sent in between sent audio packets, it will take into effect only after the end of the current utterance being processed.
All updateVocabulary
changes are valid only for the current WebSocket session.
The following fields are part of a updateVocabulary
message:
Field | Description | Mandatory | Example Syntax |
---|---|---|---|
phrase | Phrase which needs to be boosted. Prevent adding longer phrases, instead add them as separate entries. | Yes | ”IEEE” |
soundsLike | This provides the ways in which a phrase can be said/pronounced. Certain rules: - Spell out numbers (25 -> ‘two five’ and/or ‘twenty five’) - Spell out acronyms (WHO -> ‘w h o’) - Use lowercase letters for everything - Limit phrases to English and Spanish-language letters (accented consonants and vowels accepted) | No | ”i triple e” |
category | Supported Categories: ‘address’, ‘name’, ‘number’. Categories help the AutoTranscribe service normalize the provided phrase so it can guess certain ways in which a phrase can be pronounced. e.g., ‘717 N Blvd’ with ‘address’ category will help the service normalize the phrase to ‘seven one seven North Boulevard’ | No | ”address”, “name”, “number”, “company”, “currency” |
Status code | Description |
---|---|
1000 | OK |
1008 | Invalid or expired access token |
2002 | Error in fetching conversationId. This error code is only possible when integration with other AI Services is enabled |
4040 | Message format incorrect |
4050 | Language not supported |
4051 | Encoding not supported |
4053 | Sample rate not supported |
4056 | Wrong message order or missing required message |
4080 | Unable to transcribe the audio |
4082 | Audio decode failure |
4083 | Connection idle timeout. Try streaming audio in real-time |
4084 | Custom vocabulary phrase exceeds limit |
4090 | Streaming duration over limit |
4091 | Invalid vocabulary format |
4092 | Redact only smart formatted text |
4093 | Redaction only supported if detailedTokens in True |
4094 | RedactionOutput cannot be unredacted or redacted_and_unredacted because of global config being to always redact |
5000 | Internal service error |
5001 | Service shutting down |
5002 | No instances available |
GET /conversation/v1/conversation/messages
Use this endpoint to retrieve all the transcript messages for a completed call.
When to Call
Once the conversation is complete. Conversation transcripts are available for seven days after they are completed.
Request Details
Requests must include a call identifier with the GUID/UCID of the SIPREC call.
Response Details
When successful, this endpoint responds with an array of objects, each of which corresponds to a single message. Each object contains the text of the message, the sender’s role and identifier, a unique message identifier, and timestamps.
startStream
websocket message, when call transcription is initiated. All transcripts retrieved after the call will reflect the initially requested settings in the startStream
message.utterances
data feed.
The File Exporter service is meant to be used as a batch mechanism for exporting data to your data warehouse, either on a scheduled basis (e.g. nightly, weekly) or for ad hoc analyses. Data that populates feeds for the File Exporter service updates once daily at 2:00AM UTC.
Visit Retrieving Data for AI Services for a guide on how to interact with the File Exporter service.