Kalev Leetaru Contributo
As deep learning has become ubiquitous, evaluations of its accuracy typically compare its performance against an idealized baseline of flawless human results that bear no resemblance to the actual human workflow those algorithms are being designed to replace. For example, the accuracy of real-time algorithmic speech recognition is frequently compared against human captioning produced in offline multi-coder reconciled environments and subjected to multiple reviews to generate flawless content that looks absolutely nothing like actual real-time human transcription. If we really wish to understand the usability of AI today we should be comparing it against the human workflows it is designed to replace, not an impossible vision of nonexistent human perfection.
While the press is filled with the latest superhuman exploits of bleeding-edge research AI systems besting humans at yet another task, the reality of production AI systems is far more mundane. Most commercial applications of deep learning can achieve higher accuracy than their human counterparts at some tasks and worse performance on others.
Instead of comparing deep learning algorithms against trained humans placed in the same situations, we as a society have developed the habit of comparing them against idealized but entirely non-existent utopian baselines that look nothing like actual human performance on those tasks.
Take the rapidly growing use of deep learning to understand human speech, driven by the desire for ever more accurate smart speakers, personal assistants and other voice interfaces. Today the performance of such systems is frequently compared against human baselines that were coded offline through multi-coder reconciled workflows designed to yield absolutely flawless transcriptions.
In reality, when compared against the actual results of real-time human transcription, today’s deep learning systems perform remarkably well, yielding similar accuracy but much higher fidelity.
imilarly, when evaluating image or video understanding systems, companies typically focus on their error rates when encountering particularly complex edge cases. Yet as those same companies are only too aware, large teams of human reviewers rarely perform much better on those edge cases and frequently perform far worse.
Having personally overseen many large human categorization workflows, I can directly attest to the fact that even the most accurate and conscientious human analysts will exhibit extreme variation in their accuracy levels from day to day and hour to hour.
One of the debates within the driverless car industry is what constitutes a fair comparison when machines get things wrong. When a driverless car gets into an accident, the public, policymakers and pundits are quick to rush forward with concerns over just how a machine could possibly have failed in that circumstance.
In some cases, the criticism is warranted, but in many others, it is unlikely that an average human driver placed in the same situation could have done much better.
If a driverless car fails to avoid an accident in a situation where a human would also have likely been unable to avoid that accident, should we fault deep learning for not being better than a human?
In most cases the public’s answer seems to be yes, with the argument being that to many the point of driverless cars is to replace error-prone and easily distracted human drivers with flawless machines that don’t make mistakes and will have perfect safety records.
Would a driverless car that performs exactly on par with the average human driver be considered a success story ready to turn loose on public roads or should machines really be held to a vastly higher standard than we hold ourselves?
This raises the question of just what the point of the AI revolution really is.
If the point of deep learning is to automate mundane tasks to permit population-level scalability, ensure consistency, enable more complex workflows and transition human employees to more creative and interesting tasks and protect them from harmful ones like content moderation, then algorithms performing at human accuracy levels should be good enough.
In contrast, it seems the public views the point of deep learning as developing superhuman algorithms that can achieve near-flawless operational records that entirely eliminate human fallibility.
The former is already here in many fields while the latter is quite a way off.
Putting this all together, we have a tendency today to compare the actual output of deep learning systems to the falsehood of flawless unerring humans. When compared against the imperfect reality of real life humans performing the same tasks in the real world, deep learning algorithms typically perform far more comparatively.
In the end, should we hold out for perfect algorithms that never make mistakes and refuse to allow applications that fall short of that nonexistent ideal or should we accept deep learning systems that are no better but no worse than ourselves but which open the door to new possibilities through their scalability and ability to insulate humans from harmful tasks like content moderation?