Which 95% is better?

95% seems to be the magic number in the field of Artificial Intelligence.

In a recent study DeepMind’s AI correctly identified various types of eye disease 94.5% the time. This puts these algorithms “on a par with expert performance at diagnosing OCT scans”.

Last year, Microsoft reported that the performance of their AI on speech recognition was 94.9%. This number is derived from comparing the software’s output with a manual transcription performed by a human. Analysts compare both transcripts to calculate the word error rate, which defines the performance of the system.

In another AI field – Machine Translation – SDL announced that 95% of their system’s output was labelled as “human-level, by professional Russian-English translators”.

95% is impressive in itself, but to achieve that on such a complex language combination is truly outstanding. (Russian and English are very different in terms of grammar, inflection and word order).

What’s unique about this AI challenge is that there are many valid results. A true “objective error rate” is almost impossible to compute. For those speaking more than one language, you will know that there are many correct ways to translate a sentence.

Chris Manning highlights in this Stanford NLP lecture that the human language is a very compressed communication channel, using very few messages (words) to communicate a lot of meaning. The reason it works so well is because the recipient is responsible for reconstructing the context of the communication, using their accumulated world knowledge.

A machine translation system has to transfer this meaning from one language to without altering it. That’s why the machine translation problem is considered “AI complete”.

Back to evaluating the performance of such an AI system, the most accurate way is to have a blind test where specialists are scoring translated phrases (not knowing if they are human translated or machine translated). On one end (of this Likert scale) we have “completely wrong”, and “perfect translation” on the other end (i.e. human-level). That is actually the methodology that SDL used for their assessment.

So how should we interpret this 95%? Does it imply that we completely solved these AI challenges?

Not quite…

SDL accurately highlights that these are generic systems, and when applied to specialised content (like pharmaceutical labels or financial contracts), the performance changes. Domain specificity is a key challenge in any AI application using Machine Learning.

There is also another aspect. In all of the examples listed above, AI performance is compared with human performance.

performance(machines) vs. performance(humans)

However, AI is not a zero sum game, and so we should reconsider the equation above as:

performance(humans + machines) vs. performance(humans)

In a 2016 research study on computer vision an AI system was able to correctly identify cancerous cells from lymph node images with a 92.5% accuracy. The accuracy of a human pathologist was 96.6%. However, when combining the AI and human outputs, the accuracy jumped to 99.5%!

performance(humans) = 96.6%

performance(machines) = 92.5%

performance(humans + machines) = 99.5%

In other words:

performance(humans + machines) > performance(humans)

When it comes to professional language translation, SDL has a similar view. Instead of comparing machines vs. humans, it’s important probably more important to focus on the combined value (productivity, speed, precision) unlocked by teaming AI with human expertise.