“It’s concerning that it can do that when it shouldn’t be able to.”
Such was the insightful comment of one of my coworkers on learning that Artificial Intelligence systems can determine the race of a patient from his XRays, matching the patient’s self-reported race with 80-99% accuracy.
He wasn’t alone in thinking this. That is, emoting this; little to no rational thought was involved. Of the dozen people in the meeting, half or more spoke up, saying that the AI must have been trained wrong, that the race must have been encoded on the XRay, or it’s the quality of the XRay (with nonwhites of course being scanned by older equipment). Something. Not stated was that race doesn’t exist and is purely a social construct; that’s revealed truth to almost all of my coworkers and need not be spoken unless someone utters heresy.
“I know something is true and what I’m seeing doesn’t match what I know is true so what I’m seeing must not be true.” Or, to paraphrase Groucho Marx, “Who are you going to believe, your beliefs or your lying eyes?”
Most people will deny what their eyes are telling them. They’ll deny truth, reality, in favor of their beliefs. This of course applies to a lot more than an AI determining race “when it shouldn’t be able to”, but I’ll focus on AIs because that’s what came to my attention and because I know more than most about AI systems. (And to limit the amount of time spent in writing this; as usual I have many more things to do than time to do them.) (Proof of this: it’s taken me four months to get an uninterrupted hour to finish writing this essay.)
AIs have useful results in many fields, completely unrelated to one another. They’ve been trained to sift through mountains of noisy astronomical data to find exoplanets, a task largely beyond humans because the data is too great for our brains.
AIs have been used to predict recidivism, and seem to be much better than humans, at 90% versus little better than guessing. This has led to complaints by the accused (claiming bias, presumably because the AI wasn’t trained to be sensitive about disparate impact) and by judges (who inform us that these matters are more an art than a science).
AIs have been used by HR departments to make hiring decisions or at least to prioritize candidates. But there’s a problem. Two problems. First, the criteria for determining whether the AI (or the human) made a good decision are squishy. In predicting recidivism, regardless of your feelings about racial bias or disparate impact, if an offender is arrested again within a year, that’s a solid data point.
When it comes to employee performance, HR departments often try to apply objective criteria such as number of trouble tickets resolved, but those are seldom useful for more than an overview of one facet of the job. Most of the employee evaluation is squishy, and seldom completely honest for any number of reasons.
And that gets us to the other problem with using AIs for hiring decisions: Very often, racial minorities are underrepresented in the selected candidates, regardless of the job. In the computer fields, minority under-representation is even worse, when chosen by AIs. Women are also under-selected.
The obvious explanation, of course, is that the AI was trained wrong. It’s probably deliberate bias, because almost everyone involved in creating and training AIs is a white man. (That’s not true. In the US, men of European descent are a minority of people involved in building, training, and running AIs.) Or unconscious bias, because the training data is not fully representative of all job seekers or convicted criminals or hospital patients. Or some other kind of bias because the data contains patterns that no one’s noticed before and it’s throwing off the results.
That last one is a reasonable concern. In the example above of determining race from XRays, it’s possible that a hospital serving almost only blacks used one model of XRay machine and another hospital serving almost only whites used another model. When the AI was trained using data from these two hospitals, it could have picked up the relationship between XRay machine model and patient race. “Deep learning” systems are notoriously opaque, so something can sneak in and not be found for a while. (Though human thought processes are not exactly open for scrutiny, either, and very seldom do humans realize that their decisions are influenced by how recently they’ve eaten or the similarity of a job candidate to an old girlfriend.)
On the other hand, data scientists and AI developers are trained to be aware of this issue, to recognize the signs of a spurious correlation, and to actively look for problems. Furthermore, many of the announced, unacceptable results have been replicated many times with consistent findings. It’s not impossible but is very unlikely that the same error appears in different systems created by different teams using different data.
Even on projects in which only a single AI has been trained for some problem, claims that the training data must be biased are not accompanied by examples of problems in the data. The claim of bias comes by backward reasoning from the results which must be wrong, rather than from any direct evidence.
When AIs are trained to solve real-world problems using real-world data and they repeatedly come up with unacceptable results, there’s something wrong. Something major, something systemic.
It’s possible that the problem is in the way the AIs are designed and built and trained. Despite what I said a few paragraphs ago, errors do creep in. It’s conceivable that the same bias crept in and threw off the hiring recommendations of half a dozen HR AIs, all in the same racially and sexually biased fashion, or that the identification of hospital patients needing more intensive post-release follow-up was thrown off by richer (ie, white) patients getting more expensive treatment than poorer (ie, black) patients. (That did happen. It was corrected as soon as the problem was noticed, but that mistake has become one of the go-to examples of racial bias in AIs. One of the very few examples that anyone can point at, I’ll note.)
I don’t believe this is a systemic problem, though, because AIs have proved themselves useful and accurate in any number of areas, from optimizing warehouse layout in a manner no human would have thought of (and showing actual benefits in the amount of time needed to collect items) to finding risk factors for elderly people to fall and require hospitalization. (One of my coworkers found that a couple years ago, via “deep learning” examination of dozens of demographic and clinical factors. What popped out of the AI were factors such as age and weight, which were known to doctors and which served as a good check of the AI’s function, as well as a few surprises like a blood test showing some hormone over some threshold. Sorry about not remembering the details; I’m not a doc and they meant nothing to me. However, the docs put the findings to work and wound up with improved patient outcomes, showing that the AI’s findings were valid.)
As mentioned above, the claim that the demographics of the teams developing AIs result in systemic bias, somehow, is frequently made. The mechanism of this bad result is never detailed, merely hand-waved as a “Well, what else can it be?” As with the claims about biased training data, this is reasoning backward from a result which cannot be accepted to the conclusion that something must be wrong in the way the AI was built because of the people who built it.
The factor which is missing from “analyses” that the AIs must be wrong is the possibility that the AIs are right and that the common wisdom is wrong. That is, when there’s a discrepancy between A and B, why do they always assume that A is right and B is wrong?
It’s known that judges do a poor job in predicting recidivism. The good ones hitting maybe 60% is nothing to brag about.
It’s known that HR departments are bad at hiring, retention, and raise decisions because they apply criteria other than objective measures of expected or actual performance.
If what you care about is good results, then you should compare the real world effectiveness of what humans recommend versus what the AI recommends and go with whichever is get better results.
If the AI is giving bad results, you probably need to look at what it was told to do. More than that, you need to look at how you are deciding that the AI is wrong or unacceptable. By what criteria does the AI’s recommendation result in a bad result?
So far as I can see, only certain kinds of AI results result in butthurt. Not even radiologists complain when an AI detects early breast cancer at 99% accuracy, easily beating out the most experienced radiologists and doing a hundred times as many images in a day. It’s only when the AIs’ results trample on shibboleths of race or sex or other social factors that people complain.
There’s a saying which is common among some groups: Reality is that which doesn’t go away when you don’t believe in it. There’s an addendum to that: When religion collides with reality, reality wins but the truly faithful won’t admit it. (And they often use the unwelcome reality to strengthen their faith.)
The truth which the True Believers don’t want to admit is that a machine can do a better job of objectively seeing reality than humans can. The machine is more honest in looking at the world through whatever lens it’s told to look through.
AIs do a good job of finding patterns for what they’re optimized for. They give “wrong” results because there’s a mismatch between what people say is important and what is really important to them — eg, avoiding appearance of racial bias is more important than likelihood of repaying a loan.
If you want to make sure that “enough” black families get bank loans to meet some quota, program that into the AI. (The most straightforward way would be to order that the percentage of black loan recipients must exceed the percentage of local population which is black and let the AI figure out how to make that happen, but that may be too straightforward and honest for executives and managers to accept.)
If you want to believe that there’s no such thing as race, I don’t know what to tell you. Machines, which are not told in advance that there is no such thing as race, keep finding clusters of physiology or behavior or preference which bear a shocking similarity to race as understood by most people. Netflix came under fire for racism in their recommendations. Recommendations were based solely on the syllogism “Most people who liked X also liked Y. You liked X. We recommend Y to you.” But that was unacceptable because white people tended to like different things than black people. I don’t know how Netflix resolved that controversy, but I suspect that it was enough to simply point out that their customer sign-up form doesn’t ask about race.
Whatever way you go, be honest about how you’re deciding if an AI’s results are good. And stop attributing your own racism and sexism to the AI programmers.