Deploying AI Transcribe for Twilio

This guide covers the Twilio Media Gateway solution pattern, which consists of the following components to receive speech audio from Twilio, receive call signals, and return call transcripts:

Media gateways for receiving call audio from Twilio
HTTPS API which enables the customer to GET a streaming URL to which call audio is sent and POST requests to start and stop call transcription
Webhook to POST real-time transcripts to a designated URL of your choosing, alongside two additional APIs to retrieve transcripts after-call for one or a batch of conversations

ASAPP works with you to understand your current telephony infrastructure and ecosystem. Your ASAPP account team will also determine the main use case(s) for the transcript data to determine where and how call transcripts should be sent. ASAPP then completes the architecture definition, including integration points into the existing infrastructure.

Integration Steps

There are four steps to integrate AI Transcribe into Twilio:

Authenticate with ASAPP and Obtain a Twilio Media Stream URL
Send Audio to Media Gateway
Send Start and Stop Requests
Receive Transcript Outputs

Requirements

Audio Stream Codec Twilio provides audio in the mu-law format with 8000 sample rate, which ASAPP supports. You do not need any modification or additional transcoding when forking audio to ASAPP.

When supplying recorded audio to ASAPP for AI Transcribe model training prior to implementation, send uncompressed .WAV media files with speaker-separated channels.

Recordings for training should have a sample rate of 8000 and 16-bit PCM audio encoding. See the Customization section of the AI Transcribe Product Guide for more on data requirements for transcription model training. Developer Portal ASAPP provides an AI Services Developer Portal. Within the portal, developers can do the following:

Access relevant API documentation (e.g. OpenAPI reference schemas)
Access API keys for authorization
Manage user accounts and apps

Visit the Get Started for instructions on creating a developer account, managing teams and apps, and setting up AI Service APIs.

Integrate with Twilio

1. Authenticate with ASAPP and Obtain a Twilio Media Stream URL

A Twilio media stream URL is required to start streaming audio. Begin by authenticating with ASAPP to obtain this URL.

All requests to ASAPP sandbox and production APIs must use HTTPS protocol. Traffic using HTTP will not be redirected to HTTPS.

The following HTTPS REST API enables authentication with the ASAPP API Gateway: GET /mg-autotranscribe/v1/twilio-media-stream-url HTTP headers (required):

{
    "asapp-api-id": <asapp provided api id>,
    "asapp-api-secret": <asapp provided api secret>
}

ASAPP provides these header parameters to you in the Developer Portal. HTTP response body:

{
   "streamingUrl": "<short-lived URL for twilio media stream>"
}

If the authentication succeeds, the HTTP response body will return a secure WebSocket short-lived access URL. TTL (time-to-live) for this URL is 5 minutes. The system checks validity of the short-lived URL only at the beginning of the WebSocket connection, so sessions can last as long as needed. You can use the same short-lived access URL to start as many unique sessions as desired within the 5-minute TTL. For example, if the call center has an average rate of 1 new call per second, the call center can use the same short-lived access URL to initiate 300 total calls (60 calls per minute × 5 minutes). And each call can last as long as needed, regardless of whether it’s 2 minutes long or longer than 30 minutes. But after the five-minute TTL, the system will need to obtain a new short-lived access URL to start any new calls. We recommend obtaining a new short-lived URL in less than 5 minutes to always have a valid URL.

2. Send Audio to Media Gateway

With the URL obtained in the previous step, instruct Twilio to start sending Media Stream to ASAPP Media Gateway components. Media Gateway (MG) components receive real-time audio along with Call SID metadata.

Twilio provides multiple ways to initiate Media Stream, which are described in their documentation.

While instructing Twilio to send Media Streams, we highly recommend that you provide a statusCallback URL. Twilio will use this URL in the event connectivity is lost or has an error. The customer call center will need to process this callback and instruct Twilio to again start new Media Streams, assuming transcriptions are still desired.

See Handling Failures for Twilio Media Streams for details below.

ASAPP offers a software-as-a-service approach to hosting MGs at ASAPP’s VPC in the PCI-scoped zone. Network Connectivity Twilio cloud will send audio to ASAPP cloud via secure (TLS 1.2) WebSocket connections over the internet. The system does not require additional or custom networking. Port Details Ports and protocols in use for the AI Transcribe implementations are shown below:

Audio Streams: Secure WebSocket with destination port 443
API Endpoints: TCP 443

Handling Failures for Twilio Media Streams There are multiple reasons (e.g., intermediate internet failures, scheduled maintenance) why Twilio Media Stream could be interrupted mid-call. The only way to know that the system interrupted the Media Stream is to utilize the statusCallback parameter (along with statusCallbackMethod if needed) of the Twilio API. Should a failure occur, the URL you specified in the statusCallback parameter will receive the HTTP request informing of a failure. If you receive a failure notification, it means ASAPP has stopped receiving audio from Twilio and no more transcriptions for that call will take place. To restart transcriptions:

Obtain a Twilio Media Stream URL - unless failure occurred within 5 minutes of the start of the call, you won’t be able to reuse the original call streaming URL.
Send Audio to Media Gateway - instruct Twilio through their API to start a new media stream to the Twilio Media Stream URL that ASAPP provided.
Send Start request (see 3. Sending Start and Stop Requests for details).

Generating Call Identifiers AI Transcribe uses your call identifier to ensure a given call can be referenced in subsequent start and stop requests and associated with transcripts. Twilio will automatically generate a unique Call SID identifier for the call.

3. Send Start and Stop Requests

As outlined in requirements, you must create user accounts in the developer portal to enroll apps and receive API keys to interact with ASAPP endpoints. The /start-streaming and /stop-streaming endpoints of the Start/Stop API control when transcription occurs for every call. See the API Reference to learn how to interact with this API. ASAPP will not begin transcribing call audio until you request it to, thus preventing transcription of audio at the very beginning of the audio streaming session, which may include IVR, hold music, or queueing. Stop requests pause or end transcription for any needed reason. For example, you could use a stop request mid-call when the agent places the call on hold or at the end of the call to prevent transcribing post-call interactions such as satisfaction surveys.

AI Transcribe is only meant to transcribe conversations between customers and agents - you should implement start and stop requests to ensure the system does not transcribe non-conversation audio (e.g., hold music, IVR menus, surveys). Attempted transcription of non-conversation audio will negatively impact other services meant to consume conversation transcripts, such as ASAPP AI Summary.

4. Receive Transcript Outputs

AI Transcribe outputs transcripts using three separate mechanisms, each corresponding to a different temporal use case:

Real-time: Webhook posts complete utterances to your target endpoint as they are transcribed during the live conversation
After-call: GET endpoint responds to your requests for a designated call with the full set of utterances from that completed conversation
Batch: File Exporter service responds to your request for a designated time interval with a link to a data feed file that includes all utterances from that interval’s conversations

Real-Time via Webhook

ASAPP sends transcript outputs in real-time via HTTPS POST requests to a target URL of your choosing. Authentication Once the target is selected, work with your ASAPP account team to implement one of the following supported authentication mechanisms:

Custom CAs: Custom CA certificates for regular TLS (1.2 or above).
mTLS: Mutual TLS using custom certificates provided by the customer.
Secrets: A secret token. The secret name is configurable as is whether it appears in the HTTP header or as a URL parameter.
OAuth2 (client_credentials): Client credentials to fetch tokens from an authentication server.

Expected Load Target servers should be able to support receiving transcript POST messages for each utterance of every live conversation on which AI Transcribe is active. For reference, an average live call sends approximately 10 messages per minute. At that rate, 50 concurrent live calls represent approximately 8 messages per second. Please ensure you load test the selected target server to support anticipated peaks in concurrent call volume. Transcript Timing and Format Once you have started transcription for a given call stream using the /start-streaming endpoint, AI Transcribe begins to publish transcript messages, each of which contains a full utterance for a single call participant. The expected latency between when ASAPP receives audio for a completed utterance and provides a transcription of that same utterance is 200-600ms.

Perceived latency will also be influenced by any network delay sending audio to ASAPP and receiving transcription messages in return.

Though we send messages in the order they are transcribed, network latency may impact the order in which they arrive or cause the system to drop messages due to timeouts. Where latency causes timeouts, the system will drop the oldest pending messages first; AI Transcribe does not retry to deliver dropped messages. The message body for transcript type messages is JSON encoded with these fields:

Field	Subfield	Description	Example Value
externalConversationId		Unique identifier with the Amazon Connect Contact Id for the call	8c259fea-8764-4a92-adc4-73572e9cf016
streamId		Unique identifier that ASAPP assigns to each call participant’s stream returned in response to `/start-streaming` and `/stop-streaming`	5ce2b755-3f38-11ed-b755-7aed4b5c38d5
sender	externalId	Customer or agent identifier as provided in request to `/start-streaming`	ef53245
sender	role	A participant role, either customer or agent	customer, agent
autotranscribeResponse	message	Type of message	transcript
autotranscribeResponse	start	The start ms of the utterance	0
autotranscribeResponse	end	Elapsed ms since the start of the utterance	1000
autotranscribeResponse	utterance	Transcribed utterance text	Are you there?

Expected transcript message format:

{
  "type": "transcript",
  "externalConversationId": "<twilio call SID>",
  "streamId": "<streamId>",
  "sender": {
    "externalId": "<id>",
    "role": "customer",  // or "agent"
  },
  "autotranscribeResponse": {
    "message": "transcript",
    "start": 0,
    "end": 1000,
    "utterance": [
       {"text": "<transcript text>"}
      ]
  }
}

Error Handling Should your target server return an error in response to a POST request, ASAPP will record the error details for the failed message delivery and drop the message.

After-Call via GET Request

AI Transcribe makes a full transcript available at the following endpoint for a given completed call: GET /conversation/v1/conversation/messages Once a conversation is complete, make a request to the endpoint using a conversation identifier and the system returns every message in the conversation. Message Limit This endpoint responds with up to 1,000 transcribed messages per conversation, approximately a two-hour continuous call. You receive all messages in a single response without any pagination. To retrieve all messages for calls that exceed this limit, use either a real-time mechanism or File Exporter for transcript retrieval.

You set transcription settings (e.g., language, detailed tokens, redaction) for a given call with the Start/Stop API when you initiate call transcription. All transcripts retrieved after the call will reflect the initially requested settings with the Start/Stop API.

See the API Reference to learn how to interact with this API.

Batch via File Exporter

AI Transcribe makes full transcripts for batches of calls available using the File Exporter service’s utterances data feed. You can use the File Exporter service as a batch mechanism for exporting data to your data warehouse, either on a scheduled basis (e.g., nightly, weekly) or for ad hoc analyses. Data that populates feeds for the File Exporter service updates once daily at 2:00AM UTC. Visit Retrieving Data from ASAPP for a guide on how to interact with the File Exporter service.

Use Case Example

Real-Time Transcription This real-time transcription use case example consists of an English language call between an agent and customer with redaction enabled, ending with a hold. Note that redaction is enabled by default and does not need to be requested explicitly.

Obtain a Twilio media streaming URL destination by authenticating with ASAPP. GET /mg-autotranscribe/v1/twilio-media-stream-url Response STATUS 200: OK - Twilio media stream url in the response body

    {
      "streamingUrl": "wss://localhost/twilio-media?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c"
    }

With the URL obtained in the previous step, instruct Twilio to start Media Stream to ASAPP media gateway components. ASAPP will now receive real-time audio via Twilio Stream along with metadata, most notably the call’s SID: CA5b040e075515c424391012acc5a870cf
When the customer and agent are connected, send ASAPP a request to start transcription for the call: POST /mg-autotranscribe/v1/start-streaming Request

    {
     "namespace": "twilio",
     "guid": "CA5b040e075515c424391012acc5a870cf",
     "customerId": "TT9833237",
     "agentId": "RE223444211993",
     "autotranscribeParams": {
       "language": "en-US"
     },
     "twilioParams": {
       "trackMap": {
         "inbound": "customer",
         "outbound": "agent"
       }
     }
    } 

Response STATUS 200: Router processed the request, details are in the response body

    {
     "isOk": true,
     "autotranscribeResponse": {
       "customer": {
         "streamId": "5ce2b755-3f38-11ed-b755-7aed4b5c38d5",
         "status": {
           "code": 1000,
           "description": "OK"
         }
       },
       "agent": {
         "streamId": "cf31116-3f38-11ed-9116-7a0a36c763f1",
         "status": {
           "code": 1000,
           "description": "OK"
         }
       }
     }
    }

The agent and customer begin their conversation and ASAPP’s webhook publisher sends separate HTTPS POST transcript messages for each participant to a target endpoint configured to receive the messages. HTTPS POST for Customer Utterance

    {
      type: "transcript",
      externalConversationId: "CA5b040e075515c424391012acc5a870cf",
      streamId: "5ce2b755-3f38-11ed-b755-7aed4b5c38d5",
      sender: {
        externalId: "TT9833237",
        role: "customer",
      },
      autotranscribeResponse: {
        message: 'transcript',
        start: 400,
        end: 3968,
        utterance: [
           {text: "I need help upgrading my streaming package and my PIN number is ####"}
          ]
      }
    }

HTTPS POST for Agent Utterance

    {
      type: "transcript",
      externalConversationId: "CA5b040e075515c424391012acc5a870cf",
      streamId: "cf31116-3f38-11ed-9116-7a0a36c763f1",
      sender: {
        externalId: "RE223444211993",
        role: "agent",
      },
      autotranscribeResponse: {
        message: 'transcript',
        start: 4744,
        end: 8031,
        utterance: [
           {text: "Thank you sir, let me pull up your account."}
          ]
      }
    }

Later in the conversation, the agent puts the customer on hold. This triggers a request to the /stop-streaming endpoint to pause transcription and prevents hold music and promotional messages from being transcribed. POST /mg-autotranscribe/v1/stop-streaming Request

    {
     "namespace": "twilio",
     "guid": "CA5b040e075515c424391012acc5a870cf",
    }

Response STATUS 200: Router processed the request, details are in the response body

    {
     "isOk": true,
     "autotranscribeResponse": {
       "customer": {
         "streamId": "5ce2b755-3f38-11ed-b755-7aed4b5c38d5",
         "status": {
           "code": 1000,
           "description": "OK"
         },
         "summary": {
           "totalAudioBytes": 1334720,
           "audioDurationMs": 83420,
           "streamingSeconds": 84,
           "transcripts": 2
         },
       "agent": {
         "streamId": "cf31116-3f38-11ed-9116-7a0a36c763f1",
         "status": {
           "code": 1000,
           "description": "OK"
         },
         "summary": {
           "totalAudioBytes": 1334720,
           "audioDurationMs": 83420,
           "streamingSeconds": 84,
           "transcripts": 2
         },
       }
     }
    }

Data Security

ASAPP’s security protocols protect data at each point of transmission, from first user authentication to secure communications to our auditing and logging system (which includes hashing of data in transit) all the way to securing the environment when data is at rest in the data logging system. ASAPP teams also operate under tight restraints in terms of access to data. These security protocols protect both ASAPP and its customers.

Digital Agent Desk

Insights Manager

Virtual Agent

Integrate Channels

AI Productivity

Deploying AI Transcribe for Twilio

Integration Steps

Requirements

Integrate with Twilio

1. Authenticate with ASAPP and Obtain a Twilio Media Stream URL

2. Send Audio to Media Gateway

3. Send Start and Stop Requests

4. Receive Transcript Outputs

Real-Time via Webhook

After-Call via GET Request

Batch via File Exporter

Use Case Example

Data Security

Digital Agent Desk

Insights Manager

Virtual Agent

Integrate Channels

AI Productivity

​Integration Steps

​Requirements

​Integrate with Twilio

​1. Authenticate with ASAPP and Obtain a Twilio Media Stream URL

​2. Send Audio to Media Gateway

​3. Send Start and Stop Requests

​4. Receive Transcript Outputs

​Real-Time via Webhook

​After-Call via GET Request

​Batch via File Exporter

​Use Case Example

​Data Security

Integration Steps

Requirements

Integrate with Twilio

1. Authenticate with ASAPP and Obtain a Twilio Media Stream URL

2. Send Audio to Media Gateway

3. Send Start and Stop Requests

4. Receive Transcript Outputs

Real-Time via Webhook

After-Call via GET Request

Batch via File Exporter

Use Case Example

Data Security