A lot has been written about problems with artificial intelligence recently; speech recognition failing to correctly deal with Irish accents, image recognition that cannot detect faces of colour, all kinds of biases in AI decision making, or autonomous vehicles causing accidents. In response, some have called for more rigorous testing, quality assurance, and trust certification of AI products and services and their providers.
For example, in his recent keynote at a Committee for Economic Development of Australia (CEDA) event in Sydney, Dr Alan Finkel, Australia’s chief scientists, proposed a solution to “overcoming our mistrust of robots in our homes and workplaces”.
In his talk Dr Finkel proposed a trust certification for AI, whereby providers would submit to a voluntary auditing process to establish the quality of their AI-based products, as well as the ethical standards that their organisation embodies.
Dr Finkel claims that “true quality is achieved by design, not by test and reject,” so we should ensure that AI products are reliable before we release them into the wild. While this proposal makes intuitive sense, I argue that it misses the point about AI.
This is because we are dealing with a fundamentally new computing paradigm that does not submit to the same criteria for establishing quality as traditional computing.
The technology most often at work in modern AI applications is deep learning (or variants thereof). Deep learning (DL) systems can be immensely useful in fields such as document classification, image and speech recognition, or decision support based on large amounts of data.
However, at the same time DL systems are also fundamentally unreliable, which comes with important implications and responsibilities.
A shift in computing paradigm
A Kuhnian paradigm shift does not only bring about new ways of doing things. A new paradigm also comes with its own value system, and it requires new ways of judging what counts as quality. As a result, a new paradigm is generally incommensurable with the existing one that everyone is used to. It is impossible to fully make sense of and judge a new paradigm against the criteria of the established one. We do so at our own peril, as the history of technological progress shows.
For example, our own research into disruption in the music industry has demonstrated the inability of incumbent record companies to make sense of and judge the quality of the new emerging digital music paradigm from within their established one. Initial dismissals of mp3 as inferior in sound quality turned out to be irrelevant over time because the new paradigm articulated its own quality criteria: accessibility, portability and shareability. On those yardsticks mp3 was clearly superior to the CD.
I suggest a similar shift is currently under way in computing.
Traditional computing vs deep learning
Traditional computing is in principle deterministic; the very idea of an algorithm is a finite set of pre-determined steps to achieve a goal with certainty. In other words, we program software with a set of instructions, or rules. For example, we might encode in a software product the explicit rules by which a bank establishes whether or not a person should be given a loan.
Deep learning however is in principle non-deterministic. Rather than encoding explicit rules for solving a problem, the idea is to derive patterns from existing data sets that are later used for classification. Deep learning systems are classification systems in the sense that they associate an input with one category from a set of predetermined outputs.
This is done by way of ‘training’ layers of complex networks of numerical values (so-called neural networks) such that for every item in the training data a pathway through the network is generated that arrives at the right output category. The network thus ‘learns’ to recognise patterns in the input data. The usefulness of deep learning is that the network will classify any input of the same kind, even if that input was not contained in the training data. It does so by interpolating, by filling in the blanks so to speak.
For example, we might feed the neural network all kinds of data about customers that the bank has access to (as input) and customers’ past credit history (e.g. whether they defaulted or not). The network will then ‘learn’ and inscribe in its internal values those combinations of customer characteristics associated with good or bad credit history. When fed with a new customer data set, the network will then indicate if a loan should or should not be provided.
What is exiting about deep learning is the ability to classify beyond the training data. Traditional algorithms are limited to what the programmer encodes in its instructions, whereas deep learning ‘learns’ patterns too complex for humans to understand (e.g. to make predictions about the future – is a particular customer likely to default?) or to recognise patterns in images in ways that go beyond human abilities (e.g. detecting cancerous cells in MRI images). However, this ability comes at a price.
Known issues with deep learning
In a recent paper, Gary Marcus, former head of AI at Uber and Professor at New York University (NYU) outlines a range of problems and limitations with deep learning. Most importantly, he explains in detail that deep learning is in principle unreliable and non-deterministic.
In short, problems arise from 1) the uncertainty with which deep learning extrapolates beyond its training data, and 2) the fact that such systems have no way of recognising that they might make a mistake.
The issue is that the system will always provide a classification, even if the input data falls outside its training space and the extrapolation fails. Where traditional algorithms will provide an error message or exception when fed with unknown inputs, deep learning networks will not and cannot do so, because any input will always link to one of the possible outputs.
While immensely useful most of the time, this renders DL systems fundamentally unreliable. Consequently, recent research in image recognition has shown that DL systems are easily fooled, for example mistaking turtles for rifles when only a few pixels in images are changed. This also reveals that it is not clear beforehand what might ‘throw off’ a DL system.
Further problems stem from the reliance on training data in building DL systems. As a result, any biases, such as over- or under-representation of certain characteristics in the training data (e.g. race, gender, social status of people) will affect the outcomes of the DL system. Without access to the training data such problems are often hard to spot.
This is aggravated by the fact that the DL system is largely a black box, in that it is not possible to fully understand how a particular output was arrived at (e.g. how particular characteristics in the input data contribute to the output).
What counts as ‘quality’ is changing
As a result, deep learning challenges our traditional understanding of quality and reliability in computing. When Dr Finkel claims that “true quality is achieved by design, not by test and reject,” he misses the point about deep learning at a fundamental level. “Test and reject” is literally the principle by which deep learning software is built. After all, the ‘neural network’ is trained to recognise patterns from thousands, often millions, of iterations of classifying training data inputs, testing and affirming or rejecting its outputs along the way.
Hence, where quality in traditional computing is achieved through “exacting design and business practices”, that “[bake] in the expectation of quality from the start,” in the form of reliable algorithms, as Dr Finkel spells out, deep learning in principle does not adhere to such a quality regime.
My point is that it is dangerous to attempt to apply quality criteria and assurance practices of one paradigm to another. Much like mp3 could not be understood on quality criteria of a CD world, DL systems cannot simply be judged by quality criteria of an algorithmic computing paradigm. What we need instead is greater understanding for the nature of this new emerging world of AI computing. This starts with how we talk about these systems.
Current narratives are unhelpful
We often hear machine learning or deep learning systems referred to as either ‘algorithms’ or as ‘robots’ or ‘artificial intelligence’. Without a better understanding of what DL systems actually can or can’t do, both these narratives evoke misleading comparisons:
- The notion of algorithm implies that DL systems are a variant of traditional computing, thus suggesting a level of determinism, reliability, and accountability that deep learning cannot live up to in principle. DL systems cannot be subjected to the same rigorous debugging and testing procedures that achieves reliability in algorithmic computing.
- Evoking notions of robots and intelligence suggests that we achieve human cognition in machines. This is problematic, because it suggests that machines ‘think’, ‘understand’, ‘have insights’ or even agency, and that in principle we should be able to understand ‘them’ because they are like us, which they are not.
So, what now?
Deep learning and its variants is immensely useful and, as with every new paradigm, we have yet to fully understand its potentials. Yet, we also need to understand its limitations, and how to mitigate them. Research to do so is under way, but the public conversation is fraught with overblown expectations, and anxiety about unwanted consequences.
What is needed is a level-headed discourse about this emerging computing paradigm. In this context, language matters. I argue that it is neither helpful to evoke the reliability associated with traditional algorithms, nor notions of judgement and reason associated with ‘intelligence’.
Let’s name this new way of computing for what it is: ‘non-deterministic’ or ‘probabilistic computing’, and set out to fully understand what we are dealing with when we employ this new way of computing in different parts of society. This will also require us to decide where not to employ DL systems; such as in contexts where reliability is paramount, and we are not comfortable with the occurrence of undetectable mistakes.
For example, would we tolerate such unreliability in court rooms? Where decisions decide about someone’s jail time? Or in cars, planes and other mission-critical systems? What is our tolerance for error in those systems, and what can we do to make DL more reliable by surrounding it with other systems? These are open questions that are more pressing than creating new certification systems.
But if we go down the certification route, the best way forward is to figure out ways of auditing DL systems that take into account their particular characteristics, such as Cathy O’Neil’s new venture to audit deep learning systems for hidden biases in training data. But we should not imply that this will guarantee reliability in a traditional sense.