There is an epidemic of AI misinformation to use Gary Marcus’ characterization of the current state of AI[1]. There have also been some great successes in applying AI technologies in both research and commercial practice despite the fog of misinformation.

It pays to be cynical.

    • Apply all the standard purchasing due-diligence appropriate for the type of procurement you’re considering.
    • Bring in an expert from your center of technical expertise.
    • Ignore sales pressure, executive nonsense, and references to obscure scientific reports.  
    • Ask the supplier to strip away the rhetoric, similes, and metaphors.

Ask them to tell you about the following in simple language:

1. Business results: What can we expect, based on which specific use cases? Tie your answer into our business models and specify phasing, success and failure criteria, timing, cost, and risks. Include verifiable references.

This item is at the top of the list because it should drive all decision making. A full discussion of business results is outside the scope of this article — but an essential element in every evaluation process. The next nine questions are specific to analyzing the AI technology components of a business proposition. For each of the nine questions, there should be implicit consideration of how the answers tie back to question number 1. (Note that not all projects need — or would benefit from — the inclusion of AI technology.)

2. Maturity: Is this proposal based on an academic finding, a prototype, a bespoke implementation, or an extensible, off-the-shelf solution? How far away is it from being a viable product for general use?

Too often, we see announcements of intent to explore applications instead of after-the-fact recounting of the business results of implementation.

How many announcements of intent have resulted in implementations? How many implementations have failed to achieve their objectives? How do the successes differ from the failures? Please be specific.

3. Functions: What does the AI technology actually do?

Don’t accept answers like “It’s Cognitive Computing!” Ask, for example, if that term implies it truly understands text, like a human. (Any “like a human” simile used in explanations is hyperbole.)

Or is it searching for similar phrases in a large repository? And building its collection of related phrases to search for by using a graph database and a fancy dictionary — sometimes called an ontology.

Don’t accept answers like “It learns for itself.” Ask how much code or rules have to be loaded into the system for it to be able to process the data it is being fed, whether the code is handwritten on the spot or taken from libraries that others have written. 

Look for all human-behavior similes and ignore them. Focus on business needs and results. The similes and metaphors create a fog of misinformation and hyperbole.  

4. Generalization: How generalizable are the results you’re referencing?

AlphaGo works fine on a 19×19 board but would need to be retrained to play on a rectangular board. The lack of transfer is telling. In that vein:

      • Does a radiologist’s assistant work as well on one generation and brand of MRI machine as another? How much retraining would be required? What happens if there are images from many generations and brands of MRI machine? 
      • Does the vision system work as well in a sandstorm as it does in a rainstorm? How can we verify that claim?
      • How much will it cost to adapt the offering to deliver the industry-specific, business-model specific, business-objective specific results you need?  

5. Magic: Does the technology discover patterns previously never seen before by people?

Could the results be a demonstration of the infinite-monkey theorem[2], which postulates that an infinite number of monkeys typing for an endless period of time could come up with the written works of Shakespeare?

We could substitute random character strings for monkeys typing. The point is to focus the discussion on how does it come up with these novel insights? Is it the result of having the ability to explore previously unexplored patterns through iteration and then testing those patterns using rules fed to it by another (adversarial) system or by coders?

Can it perform its pattern discovery effectively with an infinitely large search space? What are its search space limits? (Infinity is a big number and very difficult to achieve. Finite solutions may be too limited.)

6. Human performance: If the AI system is allegedly better than people for the specific task, which people are they referring to and how much better?

For example, for facial recognition for security purposes, you might want to know how the technology compares to the performance of the humans most skilled on the task. Demand comparative data versus the top 1% of people on the task – that’s whom you’d want to hire, isn’t it? There are significant individual differences across the overall population[3].

7. Robustness: How robust is the system? How far can it be further trained without massive amounts of retraining? 

8. Expectations management: How do we keep our expectations realistic given the confusion engendered in the public around terms like “artificial intelligence?” What expectations should we exclude from our thinking?

9. Access: Is there a demo where we can probe for ourselves? Safely test it with our own data?

10. Other limitations? 

Listen carefully. Poke and probe at the edges. Question the speaker or writer’s authority, particularly if they don’t mention the general degree of difficulty, complexity, fragility, transparency, access to and ownership of data, and system engineering limits. 

Search for examples of cognitive biases in the proponents’ observations, data, and recommendations. Here are three examples of cognitive biases (out of the 50 or so that have been studied by psychologists):

        • Confirmation bias. The tendency to search for or interpret information in a way that confirms one’s preconceptions and ignores that which doesn’t support one’s views.
        • Selection bias. The results you see may be an artifact of the sample that was used. The sample doesn’t reflect the larger population.
        • Survivor bias. People who failed aren’t properly represented in the results.

Ignore the momentum-class claims (e.g., “Leaders are already doing this!”) If the supplier starts by segmenting the world into broad categories (such as Leaders versus Middle-of-the-road versus Laggards) and then focuses on the leaders, exercise huge caution because that type of classification may ignore important industry differences. In many cases, intra-industry data is more valuable than cross-industry data. 

Explore the reproducibility of case studies – exactly how similar and different are their needs versus yours? Industry segments, business models, existing infrastructure, end-customer requirements, sourcing and distribution, and governance and regulation may all impact reproducibility. 

Analyze the level of ecosystem growth, dependence on a single supplier, breadth of user adoption, and the level of technical volatility – how well protected are you against shifts in the core technology?

And let me know what questions you’d add to this list.