Last month, we took a look at the speech and text processing APIs from a number of the biggest companies out there. This month we’re looking at Computer Vision – a computer’s ability to “see” and understand images – exploring what Microsoft Azure, Google Cloud, AWS and IBM Watson have to offer in terms of each of their processing capabilities.

Computer Vision is often spoken about in close relation to artificial intelligence; a computer must interpret what it is shown and then analyse or perform specific predefined actions. With this particular type of computer science, the machine is configured and taught to identify and process images or videos as a human would.

Below, we have outlined the key skills that each of these platforms offer ‘out of the box’. The aim is to give you a starting point when venturing into the complex and ever changing world of AI. Whilst each platform has different strengths, almost all of the platforms can be trained to understand and recognise many different things – all you need is the data set and time to teach it!

Azure – Computer Vision API

Microsoft Azure offers many of the key features that highlight the power of computer vision and is a great starting point for those who need convincing that this is where the world is heading:

Analyse an image: Returning information around the visual content found in an image or video. Tagging, domain-specific models and descriptions in four languages can be used to identify content and label it.

Read text in images: Text detection in an image using optical character recognition (OCR). Recognised words are extracted into a machine-readable character stream.

Read handwritten text from images: Handwritten text from notes, letters, essays, whiteboards, forms and other sources can be recognised and extracted. Currently in preview mode, available for English text on backgrounds such as white paper, yellow sticky notes and whiteboards.

Recognise celebrities and landmarks: More than 200,000 celebrities from business, politics, sports and entertainment, as well as 9,000 natural and man-made landmarks from around the world.

Analyse video in near real-time: The Computer Vision APIs extract frames of the video from your device and then sending those frames to your chosen API calls.

Generate a thumbnail: High-quality storage-efficient thumbnail based on any image, allows modification to size, shape and style.

Google Cloud – Cloud Vision API

One of the first platforms to apply image recognition to the mainstream in their search engine. Needless to say, Google have access to a wealth of data which gives their algorithm a huge amount of information to learn from. Accessible via REST API, the tool is easy to integrate with.  

Insight from your images: Detect objects in images from thousands of other object categories. Vision API improves over time as new concepts are introduced and accuracy is improved.

Extract text: Optical Character Recognition (OCR) enables detection of text within images, along with automatic language identification.

Power of the web: Google Image Search finds topical entities like celebrities, logos, or news events. This can be combined this with Visually Similar Search to find similar images on the web.

Content moderation: Easily moderate content and detect inappropriate content.

Logo detection: Detect popular product logos within an image.

Face detection: Detect multiple faces within an image as well as key facial attributes like emotional state or wearing headwear.

AWS – Amazon Rekognition

Whilst the aforementioned platforms are easy to integrate with, Rekognition is the easiest starting point for the casual user – all you need is an S3 bucket and some content and away you go.

The key features are:

Object, scene, and activity detection: Identifying thousands of objects and scenes in both images and videos. When analysing video, it can also identify specific activities happening in the frame, such as “delivering a package” or “playing football”.

Facial recognition: Identification of a person in a photo or video using your own private repository of face images.

Facial analysis: Analyse the attributes of faces in images and videos, determining emotions, characteristics and other visual elements (such as items worn). In videos, you can see how these things change over time.

Pathing: Capture the path of people in the scene, such as following the movements of an athlete on the pitch.

Unsafe content detection: Identify potentially unsafe or inappropriate content across both image and video assets.

Celebrity recognition: Quickly identify well known people in your video and image libraries.

Text in images: Detect and recognise text from images, such as street names, captions, product names, and license plate.

IBM Watson – Visual Recognition

While IBM has only few ‘bots’ for computer vision out of the box, it’s standout feature is that you alone own the Intellectual Property rights. The platforms above are great for those getting started but if you have a library of valuable content that you want to train an algorithm to learn, then this is the platform. Your algorithm, your data, your IP.

General model: Classification from thousands of classes.

Face model: Facial analysis with age and gender.

Explicit model: Detect whether an image contains inappropriate content.

Food model: Classify specific images of food items.

Text model (Private beta): Text extraction from natural scenes and images.


Not being tied to any of the platforms, we are able to help you choose and implement the one that will fit your needs best for you and your business. If you’d like to speak with us further about any of these products or about any other machine learning capabilities you are looking to add into your video production workflows, get in touch: