Speech and Text Processing APIs: How do Each of the Major Players Measure Up?

At Support Partners, we pride ourselves on not being affiliated with any one supplier or vendor. That means that we have to know about as many different tools as possible so that we can design and implement the perfect solution for you and your team.

We’ve been checking out the machine learning APIs available from Microsoft, Google, Amazon and IBM, seeing what platform is offering which features and where each of them excel. Ultimately, this means that we are able to ensure that whatever option you and your team go for, it is going to be the most suitable for your needs and your environment. All four offer the most critical features, such as speech-to-text, text-to-speech, low-quality audio handling and intent analysis. Outside of these core capabilities, each of the platforms offer a variety of other interesting processing skills.

Below, we have taken a brief look at each of the platforms and what tools they have in place for their text and speech processing APIs. Following this, we have pulled together a table to give you a quick guide of what is included in each system.

Microsoft Azure’s Cognitive Services

User-friendly and able to be integrated into existing infrastructures:

  • The Language segment enables your apps with the ability to process language, giving them the ability to understand the meanings behind unstructured text as well as recognising intent behind natural language. This segment has the following capabilities:
    • Text Analytics
    • Translator Text
    • Bing Spell Check
    • Content Moderator
    • Language Understanding
  • The Speech segment offers users the ability to integrate speech processing capabilities into any of their services or apps. This includes both converting spoken language into text as well as producing speech from text. Within the Speech segment, there are the following capabilities:
    • Speech to Text
    • Speaker Recognition
    • Text to Speech
    • Speech Translation

Google Cloud AI

In Google Cloud AI, there are a number of different modules, each finally tuned to perform and carry out specific services:

  • Dialogflow Enterprise Edition offers development of conversation interfaces – such as chatbots. These are capable of natural, rich interactions using machine learning to analyse intent and context of speech and text.
  • Cloud Text-to-Speech promises natural-sounding speech using groundbreaking WaveNet research from DeepMind and Googles neural networks.
  • Cloud Speech-to-Text, recognising 120+ languages in short or long form, the service allows users to convert audio to text.
  • Cloud Natural Language API analyses the structure and meaning of the text, using machine learning modules that can be built and trained by your developers. This API can be used to extract information and understand sentiment and intent from conversations.
  • Cloud Translation API is a highly-responsive translation tool utilising the AutoML Translation.

Amazon

A number of different tools that work together to cover different APIs around speech and text processing. These are:

  • Amazon Transcribe is an Automatic Speech Recognition (ASR) service, allowing developers the ability to add speech-to-text options to their applications.
  • With Amazon Lex, developers can add conversational interfaces into any application. Lex’s deep learning functionalities and natural language understanding means that you can add/build applications with sophisticated and lifelike conversational interactions, powering tools such as chatbots.
  • Amazon Polly is a service that enables text-to-speech capabilities, offering a variety of languages and voices.
  • Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text.
  • Amazon Translate works to enable natural and accurate language translation.

IBM Watson

The following features and services are available with the IBM Watson platform:

  • Language Translator, using the Neural Machine Translation, users can instantly translate into multiple languages. Allowing for written translation as well as multilingual chatbots.
  • The Natural Language Classifier applies cognitive computer techniques to analyse information and return appropriate answers or actions.
  • The Natural Language Understanding analyses text to extract appropriate metadata, such as concepts, topics, keywords, sentiment and emotion.
  • Speech to Text, converting voice recordings into text.
  • Text to Speech, converting text into natural sounding speech with a range of voice available.
  • The Tone Analyser is able to recognise and learn a range of emotions, social propensities and language styles which are then used to refine and improve communications.

A High Level Breakdown of Each Platform’s Offerings:

Accurate as of August 2018

For more information about any of these, or any other speech and text processing questions, get in touch: hello@support-partners.com