Most of us hold a level of confidence about our factual knowledge and predictions that doesn’t match our abilities. That’s what I learned, summarized in a recent post, from running a website called The Confidence Gap for a year.
When I published that post, someone linked to it from Hacker News, causing 9000 people to read the post, 3000 of which took the trivia quiz(es) and assigned confidence levels to their true/false responses. Presumably, many of those people learned from the post that the each group of ten questions in the survey had one question about the environment or social issues designed to show an extra level of domain-specific overconfidence. Presumably, those readers are highly educated. I would have thought this would have changed the gap between accuracy and confidence in those who used the site before and after that post. But the results from visits after the blog post and Hacker News coverage, even in those categories, were almost identical to those of the earlier group.
Hacker News users pointed out several questions where the the Confidence Gap answers were obviously wrong. Stop signs actually do have eight sides, for example. Readers also reported questions with typos that could have confused the respondents. For example “Feetwood Mac” did not perform “Black Magic Woman” prior to Santana, but Fleetwood Mac did. The site called the statement with “Feetwood” true; it was a typo, not a trick.
A Hacker News reader challenged me on the naïve averaging of confidence estimates, saying he assumed the whole paper was similarly riddled with math errors. The point is arguable, but not a math error. Philip Tetlock, Tony Cox, Gerd Gigerenzer, Sarah Lichtenstein, Baruch Fischhoff, Paul Slovic and other heavyweights of the field used the same approach I used. Thank your university system for teaching that interpretation of probability is a clear-cut matter of math as opposed to vexing issue in analytic philosophy (see The Trouble with Probability and The Trouble with Doomsday).
Those criticisms acknowledged, I deleted the response data from the questions where Hacker News reported errors and typos. This changed none of the results by more than a percentage point. And, as noted above, prior knowledge of “trick” questions had almost no effect on accuracy or overconfidence.
On the point about the media’s impact on people’s confidence about fact claims that involve environmental issues, consider data from the 2nd batch of responses (post blog post) to this question:
According to the United Nations the rate of world deforestation increased between 1995 and 2015
This statement about a claim made by the UN is false, and the UN provides a great deal of evidence on the topic. The average confidence level given by respondents was 71%. The fraction of people answering correctly was 29%. The average confidence value specified by those who answered incorrectly was 69%. Independent of different interpretations of probability and confidence, this seems a clear sign of overconfidence about deforestation facts.
Compare this to the responses given for the question of whether Oregon borders California. 88% of respondents answered correctly and their average confidence specified was 88%.
According to OurWorldInData.org, the average number of years of schooling per resident was higher in S. Korea than in USA in 2010
The statement is false. Average confidence was 68%. Average correctness was 20%
For all environmental and media-heavy social questions answered by the 2nd group of respondents (who presumably had some clue that people tend to be overconfident about such issues) the average correctness was 46% and the average confidence was 67%. This is a startling result; the proverbial dart throwing chimps would score vastly higher on environmental issues (50% by chimps, 20% on the schooling question and 46% for all “trick” questions by humans) than American respondents who were specifically warned that environmental questions were designed to demonstrate that people think the world is more screwed up than it is. Is this a sign of deep pessimism about the environment and society?
For comparison, average correctness and confidence on all World War II questions were both 65%. For movies, 70% correct, 71% confidence. For science, 75% and 77%. Other categories were similar, with some showing a slightly higher overconfidence. Most notably, sports mean accuracy was 59% with mean confidence of 68%.
Richard Feynman famously said he preferred not knowing to holding as certain the answers that might be wrong (No Ordinary Genius). Freeman Dyson famously said, “it is better to be wrong than to be vague” (The Scientist As Rebel). For the hyper-educated class, spoon-fed with facts and too educated to grasp the obvious, it seems the preferred system might now be phrased “better wrong than optimistic.”
. . . _ _ _ . . .
The man who despairs when others hope is admired as a sage. – John Stuart Mill. Speech on Perfectibility, 1828
Optimism is cowardice. Otto Spengler, Man and Technics, 1931
The U.S. life expectancy will drop to 42 years by 1980 due to cancer epidemics. Paul Ehrlich. Ramparts, 1969
It is the long ascent of the past that gives the lie to our despair. HG Wells. The Discovery of the Future, 1902
There is a trend in conservative politics to blame postmodernism for everything that is wrong with America today. Meanwhile conservatives say a liberal who claims that anything is wrong with the USA should be exiled to communist Canada. Postmodernism in this context is not about art and architecture. It is a school of philosophy – more accurately, criticism – that not long ago was all the rage in liberal education and seems to have finally reached business and government. Right-wing authors tie it to identity politics, anti-capitalism, anti-enlightenment and wokeness. They’re partly right.
Postmodernism challenged the foundations of knowledge, science in particular, arguing against the univocity of meaning. Postmodernists think that there is no absolute truth, no certain knowledge. That’s not the sort of thing Republicans like to hear.
Deconstruction, an invention of Jacques Derrida, was a major component of heyday postmodernism. Deconstruction dug into the fine points of the relationship between text and meaning, something like extreme hermeneutics (reading between the lines, roughly). Richard Rorty, the leading American postmodernist, argued that there can be no unmediated expression of any non-linguistic entity. And what does that mean?
It means in part that there is no “God’s eye view” or “view from nowhere,” at least none that we have access to. Words cannot really hook onto reality for several reasons. There are always interpreters between words and reality (human interpreters), and between you and me. And we really have no means of knowing whether your interpretation is the same as mine. How could we test such a thing? Only by using words on each other. You own your thoughts but not your words once they’ve left your mouth or your pen. They march on without you. Never trust the teller; trust the tale, said D.H. Lawrence. Derrida took this much farther, exploring “oppositions inside text,” which he argued, mostly convincingly, can be found in any nontrivial text. “There is nothing outside the text,” Derrida proclaimed.
Derrida was politically left but not nearly as left as his conservative enemies pretended. Communists had no use for Derrida. Conservatives outright despised him. Paul Gross and Norman Levitt spent an entire chapter deriding Derrida for some inane statements he made about Einstein and relativity back before Derrida was anyone. In Higher Superstition – The Academic Left and its Quarrels with Science, they attacked from every angle, making much of Derrida’s association with some famous Nazis. This was a cheap shot having no bearing on the quality of Derrida’s work.
Worse still, Gross and Levitt attacked the solid aspects of postmodern deconstruction:
“The practice of close, exegetical reading, of hermeneutics, is elevated an ennobled by Derrida and his followers. No longer is it seen as a quaint academic hobby-horse for insular specialists, intent on picking the last meat from the bones of Jane Austen and Herma Melville. Rather, it has now become the key to comprehension of the profoundest matters of truth and meaning, the mantic art of understanding humanity and the universe at their very foundation.”
There was, and is, plenty of room between Jane Austen hermeneutics and arrogantly holding that nothing has any meaning except that which the god of deconstruction himself has tapped into. Yes, Derrida the man was an unsavory and pompous ass, and much of his writing was blustering obscurantism. But Derrida-style deconstruction has great value. Ignore Derrida’s quirks, his arrogance, and his political views. Ignore “his” annoying scare quotes and “abuse” of “language.” Embrace his form of deconstruction.
Here’s a simple demo of oppositions inside text. Hebrews 13 tells us to treat people with true brotherly love, not merely out of adherence to religious code. “Continue in brotherly love. Do not neglect to show hospitality to strangers, for by so doing some have entertained angels without knowing it.” The Hebrews author has embedded a less pure motive in his exhortation – a favorable review from potential angels in disguise. Big Angel is watching you.
Can conservatives not separate Derrida from his work, and his good work from his bad? After all, they are the ones who think that objective criteria can objectively separate truth from falsehood, knowledge from mere belief, and good (work) from bad.
Another reason postmodernists say the distinction between truth and falsehood is fatally flawed – as is in fact the whole concept of truth – is the deep form of the “view from nowhere” problem. This is not merely the impossibility of neutrality in journalism. It is the realization that no one can really evaluate a truth claim by testing its correspondence to reality – because we have no unmediated access to the underlying reality. We have only our impressions and experience. If everyone is wearing rose colored glasses that cannot be removed, we can’t know whether reality is white or is rose colored. Thus falls the correspondence theory of truth.
Further, the coherence theory of truth is similarly flawed. In this interpretation of truth, a statement is judged likely true if it coheres with a family of other statements accepted as true. There’s an obvious bootstrapping problem here. One can imagine a large, coherent body of false claims. They hang together, like the elements of a tall tale, but aren’t true.
Beyond correspondence and coherence, we’re basically out of defenses for Truth with a capital T. There are a few other theories of truth, but they more or less boil down to variants on these two core interpretations. Richard Rorty, originally an analytic philosopher (the kind that studies math and logic and truth tables), spent a few decades poking all aspects of the truth problem, borrowing a bit from historian of science Thomas Kuhn. Rorty extended what Kuhn had only applied to scientific truth to truth in general. Experiments – and experience in the real world – only provide objective verification of truth claims if your audience (or opponents) agree that it does. For Rorty, this didn’t mean there was no truth out there, but it meant that we don’t have any means of resolving disputes over incompatible truth claims derived from real world experience. Applying Kuhn to general knowledge, Truth is merely the assent of the relevant community. Rorty’s best formulation of this concept was that truth is just a compliment we pay to claims that satisfy our group’s validation criteria. Awesome.
Conservatives cudgeled the avowed socialist Rorty as badly as they did Derrida. Dinesh D’Souza saw Rorty as the antichrist. Of course conservatives hadn’t bothered to actually read Rorty any more than they had bothered to read Derrida. Nor had conservatives ever read a word from Michel Foucault, another postmodern enemy of all things seen as decent by conservatives. Foucault was once communist. He condoned sex between adults and consenting children. I suspect some religious conservatives secretly agree. He probably had Roman Polanski’s ear. He politicized sexual identity – sort of (see below). He was a moral relativist; there is no good or bad behavior, only what people decide is good for them. Yes, Foucault was a downer and a creep, but some of his his ideas on subjectivity were original and compelling.
The conservatives who hid under their beds from Derrida, Rorty and Foucault did so because they relied on the testimony of authorities who otherwise told them what they wanted to hear about postmodernism. Thus they missed out on some of the most original insights about the limitations of what it is possible to know, what counts as sound analytical thinking, and the relationship between the teller and the hearer. Susan Sontag, an early critic of American exceptionalism and a classic limousine liberal, famously condemned interpretation. But she emptied the tub leaving the baby to be found by nuns. Interpretation and deconstruction are useful, though not the trump card the postmodernism founders originally thought they had. They overplayed their hand, but there was something in that hand.
Postmodernists, in their critique of science, thought scientists were incapable of sorting through evidence because of their social bias, their interest, as Marxists like to say. They critically examined science – in a manner they judged to be scientific, oddly enough. They sought to knock science down a peg. No objective truth, remember. Postmodern social scientists found that interest pervaded hard science and affected its conclusions. These social scientists, using scientific methods, were able to sort through interest in the way that other scientists could not sort through evidence. See a problem here? Their findings were polluted by interest.
When a certain flavor of the vegan, steeped in moral relativism, argues that veganism is better for your health, and, by the way, it is good for the planet, and, by the way, animals have rights, and, by the way, veganism is what our group of social activists do…, then I am tempted to deploy some deconstruction. We can’t know motives, some say. Or can we? There is nothing outside the text. Can an argument reasonably flow from multiple independent reasons? We can be pretty sure that some of those reasons were backed into from a conclusion received from the relevant community. Cart and horse are misconfigured.
Conservatives didn’t read Derrida, Foucault and Rorty, and liberals only made it through chapter one. If they had read further they wouldn’t now be parroting material from the first week of Postmodernism 101. They wouldn’t be playing the village postmodernist.
Foucault, patron saint of sexual identity among modern liberal academics, himself offered that to speak of homosexuals as a defined group was historically illiterate. He opined that sexual identity was an absurd basis to form one’s personal identity. They usually skip that part during protest practice. The political left in 2021 exists at the stage of postmodern thought before the great postmodernists, Derrida and crew, realized that the assertion that it is objectively true that nothing is objectively true is more than a bit self-undermining. They missed a boat that sailed 50 years back. Postmodern thought, applied to postmodernism, destroys postmodernism as a program. But today its leading adherents don’t know it. On the death of Postmodernism with a capital P we inherited some good tools and perspectives. But the present postmodern evangelists missed the point where logic flushed the postmodern program down the same drain where objective truth had gone. They are like Sunday Christians, they’re the Cafeteria Catholics of postmodernism.
Richard Rorty, a career socialist, late in his life, using postmodern reasoning, took moral relativism to its logical conclusion. He realized that the implausibility of moral absolutism did not support its replacement by moral relativism. The former could be out without the latter being in. If two tribes hold incommensurable “truths,” it is illogical for either to conclude the other is equally correct. After all, each reached its conclusion based on the evidence and what that community judged to be sound reasoning. It would be hypocritical or incoherent to be less resolved about a conclusion, merely by knowing that a group with whom you did not share moral or epistemic values, concluded otherwise. That reasoning has also escaped the academic left. This was the ironic basis for Rorty’s intellectual defense of ethnocentrism, which got him, once the most prominent philosopher in the world, booted from academic prominence, deleted from libraries, and erased from history.
Rorty’s 1970’s socialist side does occasionally get trotted out by The New Yorker to support identify politics whenever needed, despite his explicit rejection of that concept by name. His patriotic side, which emerged from his five decade pursuit of postmodern thought, gets no coverage in The New Republic or anywhere else. National pride, Rorty said, is to countries what self-respect is to individuals – a necessary condition for self-improvement. Hearing that could put some freshpersons in the campus safe place for a few days. Are the kittens ready?
Postmodern sock puppets, Derrida, Foucault, and Rorty are condemned by conservatives and loved by liberals. Both read into them whatever they want and don’t want to hear. Appropriation? Or interpretation?
Derrida would probably approve. He is “dead.” And he can make no “claim” to the words he “wrote.” There is nothing outside the text.
For most of us, there is a large gap between what we know and what we think we know. We hold a level of confidence about our factual knowledge and predictions that doesn’t match our abilities. Since our personal decisions are really predictions about the future based on our available present knowledge, it makes sense to work toward adjusting our confidence to match our skill.
Last year I measured the knowledge-confidence gap of 3500 participants in a trivia game with a twist. For each True/False trivia question the respondents specified their level of confidence (between 50 and 100% inclusive) with each answer. The questions, presented in banks of 10, covered many topics and ranged from easy (American stop signs have 8 sides) to expert (Stockholm is further west than Vienna).
I ran this experiment on a website using 1500 True/False questions, about half of which belonged to specific categories including music, art, current events, World War II, sports, movies and science. Visitors could choose between the category “Various” or from a specific category. I asked for personal information such as age, gender current profession, title, and education. About 20% of site visitors gave most of that information. 30% provided their professions.
Participants were told that the point of the game was not to get the questions right but to have an appropriate level of confidence. For example, if a your average confidence value is 75%, 75% of their your answers should be correct. If your confidence and accuracy match, you are said to be calibrated. Otherwise you are either overconfident or underconfident. Overconfidence – sometime extreme – is more common, though a small percentage are significantly underconfident.
Overconfidence in group decisions is particularly troubling. Groupthink – collective overconfidence and rationalized cohesiveness – is a well known example. A more common, more subtle, and often more dangerous case exists when social effects and the perceived superiority of judgment of a single overconfident participant can leads to unconscious suppression of valid input from a majority of team members. The latter, for example, explains the Challenger launch decision for more than classic groupthink does, though groupthink is often cited as the cause.
I designed the trivia quiz system so that each group of ten questions under the Various label included one that dealt with a subject about which people are particularly passionate – environmental or social justice issues. I got this idea from Hans Rosling’s book, Factfulness. As expected, respondents were both overwhelmingly wrong and acutely overconfident about facts tied to emotional issues, e.g., net change in Amazon rainforest area in last five years.
I encouraged people to use take a few passes through the Various category before moving on to the specialty categories. Assuming that the first specialty categories that respondents chose was their favorite, I found them to be generally more overconfident about topics they presumable knew best. For example, those that first selected Music and then Art showed both higher resolution (correctness) and higher overconfidence in Music than they did in Art.
Mean overconfidence for all first-chosen specialties was 12%. Mean overconfidence for second-chosen categories was 9%. One interpretation is that people are more overconfident about that which they know best. Respondents’ overconfidence decreased progressively as they answered more questions. In that sense the system served as confidence calibration training. Relative overconfidence in the first specialty category chosen was present even when the effect of improved calibration was screened off, however.
For the first 10 questions, mean overconfidence in the Various category was 16% (16% for males, 14% for females). Mean overconfidence for the nine question in each group excepting the “passion” question was 13%.
Overconfidence seemed to be constant across professions, but increased about 1.5% with each level of college education. PhDs are 4.2% more overconfident than high school grads. I’ll leave that to sociologists of education to interpret. A notable exception was a group of analysts from a research lab who were all within a point or two of perfect calibration even on their first 10 questions. Men were slightly more overconfident than women. Underconfidence (more than 5% underconfident) was absent in men and present in 6% of the small group identifying as women (98 total).
The nature of overconfidence is seen in the plot of resolution (response correctness) vs. confidence. Our confidence roughly matches our accuracy up to the point where confidence is moderately high, around 85%. After this, increased confidence occurs with no increase in accuracy. At at 100% confidence level, respondents were, on average, less correct than they were at 95% confidence. Much of that effect stemmed from the one “trick” question in each group of 10; people tend to be confident but wrong about hot topics with high media coverage.
The distribution of confidence values expressed by participants was nominally bimodal. People expressed very high or very low confidence about the accuracy of their answers. The slight bump in confidence at 75% is likely an artifact of the test methodology. The default value of the confidence slider (website user interface element) was 75%. On clicking the Submit button, users were warned if most of their responses specified the default value, but an acquiescence effect appears to have present anyway. In Superforecasters Philip Tetlock observed that many people seem to have a “three settings” (yes, no, maybe) mindset about matters of probability. That could also explain the slight peak at 75%.
I’ve been using a similar approach to confidence calibration in group decision settings for the past three decades. I learned it from a DoD publication by Sarah Lichtenstein and Baruch Fischhoff while working on the Midgetman Small Intercontinental Ballistic Missile program in the mid 1980s. Doug Hubbard teaches a similar approach in his book The Failure of Risk Management. In my experience with diverse groups contributing to risk analysis, where group decisions about likelihood of uncertain events are needed, an hour of training using similar tools yields impressive improvements in calibration as measured above.
The website I used for this experiment (https://www.congap.com/) is still live with most of the features enabled. It’s running on a cheap hosting platform an may be slow to load (time to spin up an instance) if it hasn’t been accessed recently. Give it a minute. Performance is good once it loads.
Wikipedia describes risk-neutrality in these terms: “A risk neutral party’s decisions are not affected by the degree of uncertainty in a set of outcomes, so a risk-neutral party is indifferent between choices with equal expected payoffs even if one choice is riskier”
While a useful definition, it doesn’t really help us get to the bottom of things since we don’t all remotely agree on what “riskier” means. Sometimes, by “risk,” we mean an unwanted event: “falling asleep at the wheel is one of the biggest risks of nighttime driving.” Sometimes we equate “risk” with the probability of the unwanted event: “the risk of losing in roulette is 35 out of 36. Sometimes we mean the statistical expectation. And so on.
When the term “risk” is used in technical discussions, most people understand it to involve some combination of the likelihood (probability) and cost (loss value) of an unwanted event.
We can compare both the likelihoods and the costs of different risks, but deciding which is “riskier” using a one-dimensional range (i.e., higher vs. lower) requires a scalar calculus of risk. If risk is a combination of probability and severity of an unwanted outcome, riskier might equate to a larger value of the arithmetic product of the relevant probability (a dimensionless number between zero and one) and severity, measured in dollars.
But defining risk as such a scalar (area under the curve, therefore one dimensional) value is a big step, one that most analyses of human behavior suggests is not an accurate representation of how we perceive risk. It implies risk-neutrality.
Most people agree, as Wikipedia states, that a risk-neutral party’s decisions are not affected by the degree of uncertainty in a set of outcomes. On that view, a risk-neutral party is indifferent between all choices having equal expected payoffs.
Under this definition, if risk-neutral, you would have no basis for preferring any of the following four choices over another:
1) a 50% chance of winning $100.00
2) An unconditional award of $50.
3) A 0.01% chance of winning $500,000.00
4) A 90% chance of winning $55.56.
If risk-averse, you’d prefer choices 2 or 4. If risk-seeking, you’d prefer 1 or 3.
Now let’s imagine, instead of potential winnings, an assortment of possible unwanted events, termed hazards in engineering, for which we know, or believe we know, the probability numbers. One example would be to simply turn the above gains into losses:
1) a 50% chance of losing $100.00
2) An unconditional payment of $50.
3) A 0.01% chance of losing $500,000.00
4) A 90% chance of losing $55.56.
In this example, there are four different hazards. Many argue that rational analysis of risk entails quantification of hazard severities, independent of whether their probabilities are quantified. Above we have four risks, all having the same $50 expected value (cost), labeled 1 through 4. Whether those four risks can be considered equal depends on whether you are risk-neutral.
If forced to accept one of the four risks, a risk-neutral person would be indifferent to the choice; a risk seeker might choose risk 3, etc. Banks are often found to be risk-averse. That is, they will pay more to prevent risk 3 than to prevent risk 4, even though they have the same expected value. Viewed differently, banks often pay much more to prevent one occurrence of hazard 3 (cost = $500,000) than to prevent 9000 occurrences of hazard 4 (cost = $500,000).
Businesses compare risks to decide whether to reduce their likelihood, to buy insurance, or to take other actions. They often use a heat-map approach (sometimes called risk registers) to visualize risks. Heat maps plot probability vs severity and view any particular risk’s riskiness as the area of the rectangle formed by the axes and the point on the map representing that risk. Lines of constant risk therefore look like y = 1 / x. To be precise, they take the form of y = a/x where a represents a constant number of dollars called the expected value (or mathematical expectation or first moment) depending on area of study.
By plotting the four probability-cost vector values (coordinates) of the above four risks, we see that they all fall on the same line of constant risk. A sample curve of this form, representing a line of constant risk appears below on the left.
In my example above, the four points (50% chance of losing $100, etc.) have a large range of probabilities. Plotting these actual values on a simple grid isn’t very informative because the data points are far from the part of the plotted curve where the bend is visible (plot below on the right).
Students of high-school algebra know the fix for the problem of graphing data of this sort (monomials) is to use log paper. By plotting equations of the form described above using logarithmic scales for both axes, we get a straight line, having data points that are visually compressed, thereby taming the large range of the data, as below.
The risk frameworks used in business take a different approach. Instead of plotting actual probability values and actual costs, they plot scores, say from one ten. Their reason for doing this is more likely to convert an opinion into a numerical value than to cluster data for easy visualization. Nevertheless, plotting scores – on linear, not logarithmic, scales – inadvertently clusters data, though the data might have lost something in the translation to scores in the range of 1 to 10. In heat maps, this compression of data has the undesirable psychological effect of implying much small ranges for the relevant probability values and costs of the risks under study.
A rich example of this effect is seen in the 2002 PmBok (Project Management Body of Knowledge) published by the Project Management Institute. It assigns a score (which it curiously calls a rank) of 10 for probability values in the range of 0.5, a score of 9 for p=0.3, and a score of 8 for p=0.15. It should be obvious to most having a background in quantified risk that differentiating failure probabilities of .5, .3, and .15 is pointless and indicative of bogus precision, whether the probability is drawn from observed frequencies or from subjectivist/Bayesian-belief methods.
The methodological problem described above exists in frameworks that are implicitly risk-neutral. The real problem with the implicit risk-neutrality of risk frameworks is that very few of us – individuals or corporations – are risk-neutral. And no framework is right to tell us that we should be. Saying that it is somehow rational to be risk-neutral pushes the definition of rationality too far.
As proud king of a small distant planet of 10 million souls, you face an approaching comet that, on impact, will kill one million (10%) in your otherwise peaceful world. Your scientists and engineers rush to build a comet-killer nuclear rocket. The untested device has a 90% chance of destroying the comet but a 10% chance of exploding on launch thereby killing everyone on your planet. Do you launch the comet-killer, knowing that a possible outcome is total extinction? Or do you sit by and watch one million die from a preventable disaster? Your risk managers see two choices of equal riskiness: 100% chance of losing one million and a 10% chance of losing 10 million. The expected value is one million lives in both cases. But in that 10% chance of losing 10 million, there is no second chance. It’s an existential risk.
If these two choices seem somehow different, you are not risk-neutral. If you’re tempted to leave problems like this in the capable hands of ethicists, good for you. But unaware boards of directors have left analogous dilemmas in the incapable hands of simplistic and simple-minded risk frameworks.
The risk-neutrality embedded in risk frameworks is a subtle and pernicious case of Hume’s Guillotine – an inference from “is” to “ought” concealed within a fact-heavy argument. No amount of data, whether measured frequencies or subjective probability estimates, whether historical expenses or projected costs, even if recorded as PmBok’s scores and ranks, can justify risk-neutrality to parties who are not risk-neutral. So why is it embed it in the frameworks our leading companies pay good money for?
Toxicity is binary in California. Or so says its governor and most of its residents.
Governor Newsom, who believes in science, recently signed legislation making California the first state to ban 24 toxic chemicals in cosmetics.
The governor’s office states “AB 2762 bans 24 toxic chemicals in cosmetics, which are linked to negative long-term health impacts especially for women and children.”
The “which” in that statement is a nonrestrictive pronoun, and the comma preceding it makes the meaning clear. The sentence says that all toxic chemicals are linked to health impacts and that AB 2762 bans 24 of them – as opposed to saying 24 chemicals that are linked to health effects are banned. One need not be a grammarian or George Orwell to get the drift.
California continues down the chemophobic path, established in the 1970s, of viewing all toxicity through the beloved linear no-threshold lens. That lens has served gullible Californians well since the 1974, when the Sierra Club, which had until then supported nuclear power as “one of the chief long-term hopes for conservation,” teamed up with the likes of Gov. Jerry Brown (1975-83, 2011-19) and William Newsom – Gavin’s dad, investment manager for Getty Oil – to scare the crap out of science-illiterate Californians about nuclear power.
That fear-mongering enlisted Ralph Nadar, Paul Ehrlich and other leading Malthusians, rock stars, oil millionaires and overnight-converted environmentalists. It taught that nuclear plants could explode like atom bombs, and that anything connected to nuclear power was toxic – in any dose. At the same time Governor Brown, whose father had deep oil ties, found that new fossil fuel plants could be built “without causing environmental damage.” The Sierra Club agreed, and secretly took barrels of cash from fossil fuel companies for the next four decades – $25M in 2007 from subsidiaries of, and people connected to, Chesapeake Energy.
What worked for nuclear also works for chemicals. “Toxic chemicals have no place in products that are marketed for our faces and our bodies,” said First Partner Jennifer Siebel Newsom in response to the recent cosmetics ruling. Jennifer may be unaware that the total amount of phthalates in the banned zipper tabs would yield very low exposure indeed.
Chemicals cause cancer, especially in California, where you cannot enter a parking garage, nursery, or Starbucks without reading a notice that the place can “expose you to chemicals known to the State of California to cause birth defects.” California’s litigator-lobbied legislators authored Proposition 65 in a way that encourages citizens to rat on violators, the “citizen enforcers” receiving 25% of any penalties assessed by the court. The proposition lead chemophobes to understand that anything “linked to cancer” causes cancer. It exaggerates theoretical cancer risks stymying the ability of the science-ignorant educated class to make reasonable choices about actual risks like measles and fungus.
California’s linear no-threshold conception of chemical carcinogens actually started in 1962 with Rachel Carson’s Silent Spring, the book that stopped DDT use, saving all the birds, with the minor side effect of letting millions of Africans die of malaria who would have survived (1, 2, 3) had DDT use continued.
But ending DDT didn’t save the birds, because DDT wasn’t the cause of US bird death as Carson reported, because the bird death at the center of her impassioned plea never happened. This has been shown by many subsequent studies; and Carson, in her work at Fish and Wildlife Service and through her participation in Audubon bird counts, certainly had access to data showing that the eagle population doubled, and robin, catbird, and dove counts had increased by 500% between the time DDT was introduced and her eloquence, passionate telling of the demise of the days that once “throbbed with the dawn chorus of robins, catbirds, and doves.”
Carson also said that increasing numbers of children were suffering from leukemia, birth defects and cancer, and of “unexplained deaths,” and that “women were increasingly unfertile.” Carson was wrong about increasing rates of these human maladies, and she lied about the bird populations. Light on science, Carson was heavy on influence: “Many real communities have already suffered.”
In 1969 the Environmental Defense Fund demanded a hearing on DDT. Lasting eight months, the examiner’s verdict concluded DDT was not mutagenic or teratogenic. No cancer, no birth defects. In found no “deleterious effect on freshwater fish, estuarine organisms, wild birds or other wildlife.”
William Ruckleshaus, first director of the EPA didn’t attend the hearings or read the transcript. Pandering to the mob, he chose to ban DDT in the US anyway. It was replaced by more harmful pesticides in the US and the rest of the world. In praising Ruckleshaus, who died last year, NPR, the NY Times and the Puget Sound Institute described his having a “preponderance of evidence” of DDT’s damage, never mentioning the verdict of that hearing.
When Al Gore took up the cause of climate, he heaped praise on Carson, calling her book “thoroughly researched.” Al’s research on Carson seems of equal depth to Carson’s research on birds and cancer. But his passion and unintended harm have certainly exceeded hers. A civilization relying on the low-energy-density renewables Gore advocates will consume somewhere between 100 and 1000 times more space for food and energy than we consume at present.
California’s fallacious appeal to naturalism regarding chemicals also echoes Carson’s, and that of her mentor, Wilhelm Hueper, who dedicated himself to the idea that cancer stemmed from synthetic chemicals. This is still overwhelmingly the sentiment of Californians, despite the fact that the smoking-tar-cancer link now seems a bit of a fluke. That is, we expected the link between other “carcinogens” and cancer to be as clear as the link between smoking and cancer. It is not remotely. As George Johnson, author of The Cancer Chronicles, wrote, “as epidemiology marches on, the link between cancer and carcinogen seems ever fuzzier” (re Tomasetti on somatic mutations). Carson’s mentor Hueper, incidentally, always denied that smoking caused cancer, insisting toxic chemicals released by industry caused lung cancer.
This brings us back to the linear no-threshold concept. If a thing kills mice in high doses, then any dose to humans is harmful – in California. And that’s accepting that what happens in mice happens in humans, but mice lie and monkeys exaggerate. Outside California, most people are at least aware of certain hormetic effects (U-shaped dose-response curve). Small amounts of Vitamin C prevent scurvy; large amounts cause nephrolithiasis. Small amounts of penicillin promote bacteria growth; large amount kill them. There is even evidence of biopositive effects from low-dose radiation, suggesting that 6000 millirems a year might be best for your health. The current lower-than-baseline levels of cancers in 10,000 residents of Taiwan accidentally exposed to radiation-contaminated steel, in doses ranging from 13 to 160 mSv/yr for ten years starting in 1982 is a fascinating case.
Radiation aside, perpetuating a linear no-threshold conception of toxicity in the science-illiterate electorate for political reasons is deplorable, as is the educational system that produces degreed adults who are utterly science-illiterate – but “believe in science” and expect their government to dispense it responsibly. The Renaissance physician Paracelsus knew better half a millennium ago when he suggested that that substances poisonous in large doses may be curative in small ones, writing that “the dose makes the poison.”
To demonstrate chemophobia in 2003, Penn Jillette and assistant effortlessly convinced people in a beach community, one after another, to sign a petition to ban dihydrogen monoxide (H2O). Water is of course toxic in high doses, causing hyponatremia, seizures and brain damage. But I don’t think Paracelsus would have signed the petition.
“The first thing we do, let’s kill all the lawyers.” – Shakespeare, Henry VI, Part 2, Act IV
My last post discussed the failure of most physicians to infer the chance a patient has the disease given a positive test result where both the frequency of the disease in the population and the accuracy of the diagnostic test are known. The probability that the patient has the disease can be hundreds or thousands of times lower than the accuracy of the test. The problem in reasoning that leads us to confuse these very different likelihoods is one of several errors in logic commonly called the prosecutor’s fallacy. The important concept is conditional probability. By that we mean simply that the probability of x has a value and that the probability of x given that y is true has a different value. The shorthand for probability of x is p(x) and the shorthand for probability of x given y is p(x|y).
“Punching, pushing and slapping is a prelude to murder,” said prosecutor Scott Gordon during the trial of OJ Simpson for the murder of Nicole Brown. Alan Dershowitz countered with the argument that the probability of domestic violence leading to murder was very remote. Dershowitz (not prosecutor but defense advisor in this case) was right, technically speaking. But he was either as ignorant as the physicians interpreting the lab results or was giving a dishonest argument, or possibly both. The relevant probability was not the likelihood of murder given domestic violence, it was the likelihood of murder given domestic violence and murder. “The courtroom oath – to tell the truth, the whole truth and nothing but the truth – is applicable only to witnesses,” said Dershowitz in The Best Defense. In Innumeracy: Mathematical Illiteracy and Its Consequences. John Allen Paulos called Dershowitz’s point “astonishingly irrelevant,” noting that utter ignorance about probability and risk “plagues far too many otherwise knowledgeable citizens.” Indeed.
The doctors’ mistake in my previous post was confusing
P(positive test result) vs.
P(disease | positive test result)
Dershowitz’s argument confused
P(husband killed wife | husband battered wife) vs.
P(husband killed wife | husband battered wife | wife was killed)
In Reckoning With Risk, Gerd Gigerenzer gave a 90% value for the latter Simpson probability. What Dershowitz cited was the former, which we can estimate at 0.1%, given a wife-battery rate of one in ten, and wife-murder rate of one per hundred thousand. So, contrary to what Dershowitz implied, prior battery is a strong indicator of guilt when a wife has been murdered.
As mentioned in the previous post, the relevant mathematical rule does not involve advanced math. It’s a simple equation due to Pierre-Simon Laplace, known, oddly, as Bayes’ Theorem:
P(A|B) = P(B|A) * P(A) / P(B)
If we label the hypothesis (patient has disease) as D and the test data as T, the useful form of Bayes’ Theorem is
P(D|T) = P(T|D) P(D) / P(T) where P(T) is the sum of probabilities of positive results, e.g.,
P(T) = P(T|D) * P(D) + P(T | not D) * P(not D) [using “not D” to mean “not diseased”]
Cascells’ phrasing of his Harvard quiz was as follows: “If a test to detect a disease whose prevalence is 1 out of 1,000 has a false positive rate of 5 percent, what is the chance that a person found to have a positive result actually has the disease?”
Plugging in the numbers from the Cascells experiment (with the parameters Cascells provided shown below in bold and the correct answer in green):
- P(D) is the disease frequency = 0.001 [ 1 per 1000 in population ] therefore:
- P(not D) is 1 – P(D) = 0.999
- P(T | not D) = 5% = 0.05 [ false positive rate also 5%] therefore:
- P(T | D) = 95% = 0.95 [ i.e, the false negative rate is 5% ]
P(T) = .95 * .001 + .999 * .05 = 0.0509 ≈ 5.1% [ total probability of a positive test ]
P(D|T) = .95 * .001 / .0509 = .0019 ≈ 2% [ probability that patient has disease, given a positive test result ]
I hope this seeing is believing illustration of Cascells’ experiment drives the point home for those still uneasy with equations. I used Cascells’ rates and a population of 100,000 to avoid dealing with fractional people:
Extra credit: how exactly does this apply to Covid, news junkies?
Edit 5/21/20. An astute reader called me on an inaccuracy in the diagram. I used an approximation, without identifying it. P = r1/r2 is a cheat for P = 1 – Exp(- r1/r2). The approximation is more intuitive, though technically wrong. It’s a good cheat, for P values less that 10%.
Note 5/22/20. In response to questions about how this sort of thinking bears on coronavirus testing -what test results say about prevalence – consider this. We really have one equation in 3 unknowns here: false positive rate, false negative rate, and prevalence in population. A quick Excel variations study using false positive rates from 1 to 20% and false neg rates from 1 to 3 percent, based on a quick web search for proposed sensitivity/specificity for the Covid tests is revealing. Taking the low side of the raw positive rates from the published data (1 – 3%) results in projected prevalence roughly equal to the raw positive rates. I.e., the false positives and false negatives happen to roughly wash out in this case. That also leaves P(d|t) in the range of a few percent.
Most medical doctors, having ten or more years of education, can’t do simple statistics calculations that they were surely able to do, at least for a week or so, as college freshmen. Their education has let them down, along with us, their patients. That education leaves many doctors unquestioning, unscientific, and terribly overconfident.
A disturbing lack of doubt has plagued medicine for thousands of years. Galen, at the time of Marcus Aurelius, wrote, “It is I, and I alone, who has revealed the true path of medicine.” Galen disdained empiricism. Why bother with experiments and observations when you own the truth. Galen’s scientific reasoning sounds oddly similar to modern junk science armed with abundant confirming evidence but no interest in falsification. Galen had plenty of confirming evidence: “All who drink of this treatment recover in a short time, except those whom it does not help, who all die. It is obvious, therefore, that it fails only in incurable cases.”
Galen was still at work 1500 years later when Voltaire wrote that the art of medicine consisted of entertaining the patient while nature takes its course. One of Voltaire’s novels also described a patient who had survived despite the best efforts of his doctors. Galen was around when George Washington died after five pints of bloodletting, a practice promoted up to the early 1900s by prominent physicians like Austin Flint.
But surely medicine was mostly scientific by the 1900s, right? Actually, 20th century medicine was dragged kicking and screaming to scientific methodology. In the early 1900’s Ernest Amory Codman of Massachusetts General proposed keeping track of patients and rating hospitals according to patient outcome. He suggested that a doctor’s reputation and social status were poor measures of a patient’s chance of survival. He wanted the track records of doctors and hospitals to be made public, allowing healthcare consumers to choose suppliers based on statistics. For this, and for his harsh criticism of those who scoffed at his ideas, Codman was tossed out of Mass General, lost his post at Harvard, and was suspended from the Massachusetts Medical Society. Public outcry brought Codman back into medicine, and much of his “end results system” was put in place.
20th century medicine also fought hard against the concept of controlled trials. Austin Bradford Hill introduced the concept to medicine in the mid 1920s. But in the mid 1950s Dr. Archie Cochrane was still fighting valiantly against what he called the God Complex in medicine, which was basically the ghost of Galen; no one should question the authority of a physician. Cochrane wrote that far too much of medicine lacked any semblance of scientific validation and knowing what treatments actually worked. He wrote that the medical establishment was hostile the idea of controlled trials. Cochrane fought this into the 1970s, authoring Effectiveness and Efficiency: Random Reflections on Health Services in 1972.
Doctors aren’t naturally arrogant. The God Complex is passed passed along during the long years of an MD’s education and internship. That education includes rights of passage in an old boys’ club that thinks sleep deprivation builds character in interns, and that female med students should make tea for the boys. Once on the other side, tolerance of archaic norms in the MD culture seems less offensive to the inductee, who comes to accept the system. And the business of medicine, the way it’s regulated, and its control by insurance firms, pushes MDs to view patients as a job to be done cost-effectively. Medical arrogance is in a sense encouraged by recovering patients who might see doctors as savior figures.
As Daniel Kahneman wrote, “generally, it is considered a weakness and a sign of vulnerability for clinicians to appear unsure.” Medical overconfidence is encouraged by patients’ preference for doctors who communicate certainties, even when uncertainty stems from technological limitations, not from doctors’ subject knowledge. MDs should be made conscious of such dynamics and strive to resist inflating their self importance. As Allan Berger wrote in Academic Medicine in 2002, “we are but an instrument of healing, not its source.”
Many in medical education are aware of these issues. The calls for medical education reform – both content and methodology – are desperate, but they are eerily similar to those found in a 1924 JAMA article, Current Criticism of Medical Education.
Covid19 exemplifies the aspect of medical education I find most vile. Doctors can’t do elementary statistics and probability, and their cultural overconfidence renders them unaware of how critically they need that missing skill.
A 1978 study, brought to the mainstream by psychologists like Kahnemann and Tversky, showed how few doctors know the meaning of a positive diagnostic test result. More specifically, they’re ignorant of the relationship between the sensitivity and specificity (true positive and true negative rates) of a test and the probability that a patient who tested positive has the disease. This lack of knowledge has real consequences In certain situations, particularly when the base rate of the disease in a population is low. The resulting probability judgements can be wrong by factors of hundreds or thousands.
In the 1978 study (Cascells et. al.) doctors and medical students at Harvard teaching hospitals were given a diagnostic challenge. “If a test to detect a disease whose prevalence is 1 out of 1,000 has a false positive rate of 5 percent, what is the chance that a person found to have a positive result actually has the disease?” As described, the true positive rate of the diagnostic test is 95%. This is a classic conditional-probability quiz from the second week of a probability class. Being right requires a), knowing Bayes Theorem, and b), being able to multiply and divide. Not being confidently wrong requires only one thing: scientific humility – the realization that all you know might be less than all there is to know. The correct answer is 2% – there’s a 2% likelihood the patient has the disease. The most common response, by far, in the 1978 study was 95%, which is wrong by 4750%. Only 18% of doctors and med students gave the correct response. The study’s authors observed that in the group tested, “formal decision analysis was almost entirely unknown and even common-sense reasoning about the interpretation of laboratory data was uncommon.”
As mentioned above, this story was heavily publicized in the 80s. It was widely discussed by engineering teams, reliability departments, quality assurance groups and math departments. But did it impact medical curricula, problem-based learning, diagnostics training, or any other aspect of the way med students were taught? One might have thought yes, if for no reason than to avoid criticism by less prestigious professions having either the relevant knowledge of probability or the epistemic humility to recognize that the right answer might be far different from the obvious one.
Similar surveys were done in 1984 (David M Eddy) and in 2003 (Kahan, Paltiel) with similar results. In 2013, Manrai and Bhatia repeated Cascells’ 1978 survey with the exact same wording, getting trivially better results. 23% answered correctly. They suggesting that medical education “could benefit from increased focus on statistical inference.” That was 35 years after Cascells, during which, the phenomenon was popularized by the likes of Daniel Kahneman, from the perspective of base-rate neglect, by Philip Tetlock, from the perspective of overconfidence in forecasting, and by David Epstein, from the perspective of the tyranny of specialization.
Over the past decade, I’ve asked the Cascells question to doctors I’ve known or met, where I didn’t think it would get me thrown out of the office or booted from a party. My results were somewhat worse. Of about 50 MDs, four answered correctly or were aware that they’d need to look up the formula but knew that it was much less than 95%. One was an optometrist, one a career ER doc, one an allergist-immunologist, and one a female surgeon – all over 50 years old, incidentally.
Despite the efforts of a few radicals in the Accreditation Council for Graduate Medical Education and some post-Flexnerian reformers, medical education remains, as Jonathan Bush points out in Tell Me Where It Hurts, basically a 2000 year old subject-based and lecture-based model developed at a time when only the instructor had access to a book. Despite those reformers, basic science has actually diminished in recent decades, leaving many physicians with less of a grasp of scientific methodology than that held by Ernest Codman in 1915. Medical curriculum guardians, for the love of God, get over your stodgy selves and replace the calculus badge with applied probability and statistical inference from diagnostics. Place it later in the curriculum later than pre-med, and weave it into some of that flipped-classroom, problem-based learning you advertise.
Congress and Richard Nixon had no intention to pull a bait-and-switch when the enacted the National Maximum Speed Law (NMSL) on Jan. 2, 1974. The emergency response to an embargo, NMSL (Public Law 93-239), specified that it was “an act to conserve energy on the Nation’s highways.” Conservation, in this context, meant reducing oil consumption to prevent the embargo proclaimed by the Organization of Arab Petroleum Exporting in October 1973 from seriously impacting American production or causing a shortage of oil then used for domestic heating. There was a precedent. A national speed limit had been imposed for the same reasons during World War II.
By the summer of 1974 the threat of oil shortage was over. But unlike the case after the war, many government officials, gently nudged by auto insurance lobbies, argued that the reduced national speed limit would save tens of thousands of lives annually. Many drivers conspicuously displayed their allegiance to the cause with bumper stickers reminding us that “55 Saves Lives.” Bad poetry, you may say in hindsight, a sorry attempt at trochaic monometer. But times were desperate and less enlightened drivers had to be brought onboard. We were all in it together.
Over the next ten years, the NMSL became a major boon to jurisdictions crossed by interstate highways, some earning over 80% of their revenues from speeding fines. Studies reached conflicting findings over whether the NMSL had saved fuel or lives. The former seems undeniable at first glance, but the resulting increased congestion caused frequent brake/stop/accelerate effects in cities, and the acceleration phase is a gas guzzler. Those familiar with fluid mechanics note that the traffic capacity of a highway is proportional to the speed driven on it. Some analyses showed decreased fuel efficiency (net miles per gallon). The most generous analyses reported a less than 1% decrease in consumption.
No one could argue that 55 mph collisions were more dangerous than 70 mph collisions. But some drivers, particularly in the west, felt betrayed after being told that the NMSL was an emergency measure (”during periods of current and imminent fuel shortages”) to save oil and then finding it would persist indefinitely for a new reason, to save lives. Hicks and greasy trucker pawns of corporate fat cats, my science teachers said of those arguing to repeal the NMSL.
The matter was increasingly argued over the next twelve years. The states’ rights issue was raised. Some remembered that speed limits had originally been set by a democratic 85% rule. The 85th percentile speed of drivers on an unposted highway became the limit for that road. Auto fatality rates had dropped since 1974, and everyone had their theories as to why. A case was eventually made for an experimental increase to 65 mph, approved by Congress in December 1987. The insurance lobby predicted carnage. Ralph Nader announced that “history will never forgive Congress for this assault on the sanctity of human life.”
Between 1987 and 1995, 40 states moved to the 65 limit. Auto fatality rates continued to decrease as they had done between 1973 and 1987, during which time some radical theorists had argued that the sudden drop in fatality rate in early 1974 had been a statistical blip regressed to the mean a year later and that better cars and seat belt usage accounted for the decreased mortality. Before 1987, those arguments were commonly understood to be mere rationalizations.
In December 1995, more than twenty years after being enacted, Congress finally undid the NMSL completely. States had the authority to set speed limits. An unexpected result of increasing speed limits to 75 mph in some western states was that, as revealed by unmanned radar, the number of vehicles driving above 80 mph dropped by 85% compared to when the speed limit was 65.
From a systems-theory perspective, it’s clear that the highway transportation network is a complex phenomenon, one resistant to being modeled through facile conjecture about causes and effects, naive assumptions about incentives and human behavior, and ivory-tower analytics.
Playing poker online is far more addictive than gambling in a casino. Online poker, and other online gambling that involves a lot of skill, is engineered for addiction. Online poker allows multiple simultaneous tables. Laptops, tablets, and mobile phones provide faster play than in casinos. Setup time, for an efficient addict, can be seconds per game. Better still, you can rapidly switch between different online games to get just enough variety to eliminate any opportunity for boredom that has not been engineered out of the gaming experience. Completing a hand of Texas Holdem in 45 seconds online increases your chances of fast wins, fast losses, and addiction.
Tilt is what poker players call it when a particular run of bad luck, an opponent’s skill, or that same opponent’s obnoxious communications put you into a mental state where you’re playing emotionally and not rationally. Anger, disgust, frustration and distress is precipitated by bad beats, bluffs gone awry, a run of dead cards, losing to a lower ranked opponent, fatigue, or letting the opponent’s offensive demeanor get under your skin.
Tilt is so important to online poker that many products and commitment devices have emerged to deal with it. Tilt Breaker provides services like monitoring your performance to detect fatigue and automated stop-loss protection that restricts betting or table count after a run of losses.
A few years back, some friends and I demonstrated biometric tilt detection using inexpensive heart rate sensors. We used machine learning with principal dynamic modes (PDM) analysis running in a mobile app to predict sympathetic (stress-inducing, cortisol, epinephrine) and parasympathetic (relaxation, oxytocin) nervous system activity. We then differentiated mental and physical stress using the mobile phone’s accelerometer and location functions. We could ring an alarm to force a player to face being at risk of tilt or ragequit, even if he was ignoring the obvious physical cues. Maybe it’s time to repurpose this technology.
In past crises, the flow of bad news and peer communications were limited by technology. You could not scroll through radio programs or scan through TV shows. You could click between the three news stations, and then you were stuck. Now you can consume all of what could be home work and family time with up to the minute Covid death tolls while blasting your former friends on Twitter and Facebook for their appalling politicization of the crisis.
You yourself are of course innocent of that sort of politicizing. As a seasoned poker player, you know that the more you let emotions take control your game, the farther your judgments will stray from rational ones.
Still yet, what kind of utter moron could think that the whole response to Covid is a media hoax? Or that none of it is.
Everyone knows about the marshmallow test. Kids were given a marshmallow and told that they’d get a second one if they resisted eating the first one for a while. The experimenter then left the room and watched the kids endure marshmallow temptation. Years later, the kids who had been able to fight temptation were found to have higher SAT scores, better jobs, less addiction, and better physical fitness than those who succumbed. The meaning was clear; early self control, whether innate or taught, is key to later success. The test results and their interpretation were, scientifically speaking, too good to be true. And in most ways they weren’t true.
That wrinkle doesn’t stop the marshmallow test from being trotted out weekly on LinkedIn and social sites where experts and moralists opine. That trotting out comes with behavioral economics lessons, dripping with references to Kahnemann, Ariely and the like about our irrationality as we face intertemporal choices, as they’re known in the trade. When adults choose an offer of $1000 today over an offer for $1400 to be paid in one year, even when they have no pressing financial need, they are deemed irrational or lacking self control, like the marshmallow kids.
The famous marshmallow test was done by Walter Mischel in the 1960s through 1980s. Not only did subsequent marshmallow tests fail to show as much correlation between not waiting for the second marshmallow and a better life, but, more importantly, similar tests for at least twenty years have pointed to a more salient result, one which Mischel was aware of, but which got lost in popular retelling. Understanding the deeper implications of the marshmallow tests, along with a more charitable view of kids who grabbed the early treat, requires digging down into the design of experiments, Bayesian reasoning, and the concept of risk neutrality.
Intertemporal choice tests like the marshmallow test involve choices between options that involve different payoffs at different times. We face these choices often. And when we face them in the real world, our decision process is informed by memories and judgments about our past choices and their outcomes. In Bayesian terms, our priors incorporate this history. In real life, we are aware that all contracts, treaties, and promises for future payment come with a finite risk of default.
In intertemporal choice scenarios, the probability of the deferred payment actually occurring is always less than 100%. That probability is rarely known and is often unknowable. Consider choices A and B below. This is how the behavioral economists tend to frame the choices.
|$1,000 now||$1,400 paid next year|
But this framing ignores an important feature of any real-world, non-hypothetical intertemporal choice situation: the probability of choice B is always less than 100%. In the above example, even risk-neutral choosers (those indifferent to all choices having the same expected value) would pick choice A over choice B if they judge the probability of non-default (actually getting the deferred payment) to be less than a certain amount.
|$1000 now||$1,400 in one year, P= .99||$1,400 in one year, P= 0.7|
|Expected value =$1000||Expected value = $1386||Expected value = $980|
As shown above, if choosers believe the deferred payment likelihood to be less than about 70%, they cannot be called irrational for choosing choice A.
Lack of Self Control – or Rational Intuitive Bayes?
Now for the final, most interesting twist in tests like the marshmallow test, almost universally ignored by those who cite them. Unlike my example above where the wait time is one year, in the marshmallow tests, the time period during which the subject is tempted to eat the first marshmallow is unknown to the subject. Subjects come into the game with a certain prior – a certain belief about the probability of non-default. But, as intuitive Bayesians, these subjects update the probability they assign to non-default, during their wait, based on the amount of time they have been waiting. The speed at which they revise their probability downward depends on their judgment of the distribution of wait times experienced in their short lives.
If kids in the marshmallow tests have concluded, based on their experience, that adults are not dependable, choice A makes sense; they should immediately eat the first marshmallow, since the second one may never materialize. Kids who endure temptation for a few minutes only to give in and eat their first marshmallow are seen as both irrational and being incapable of self-control.
But if those kids adjust their probability judgments that the second marshmallow will appear based on a prior distribution that is not a normal distribution (i.e., if as intuitive Bayesians they model wait times imposed by adults as a power-law distribution), then their eating the first marshmallow after some test-wait period makes perfect sense. They rightly conclude, on the basis of available evidence, that wait times longer than some threshold period may be very long indeed. These kids aren’t irrational, and self-control is not their main problem. Their problem is that they have been raised by irresponsible adults who have both displayed a tendency to default on payments and who are late to fulfill promises by time durations obeying power-law distributions.
Subsequent marshmallow tests have verified this. In 2013, psychologist Laura Michaelson, after more sophisticated versions of the marshmallow test, concluded “implications of this work include the need to revise prominent theories of delay of gratification.” Actually, tests going back over 50 years have shown similar results (A.R. Mahrer, The role of expectancy in delayed reinforcement, 1956).
In three recent posts (first, second, third) I suggested that behavioral economists and business people who follow them are far too prone to seeing innate bias everywhere, when they are actually seeing rational behavior through their own bias. This is certainly the case with the common misuse of the marshmallow tests. Interpreting these tests as rational behavior in light of subjects’ experience is a better explanatory theory, one more consistent with the evidence, and one that coheres with other explanatory observations, such as humans’ capacity for intuitive Bayesian belief updates.
Charismatic pessimists about human rationality twist the situation so that their pessimism is framed as good news, in the sense that they have at least illuminated an inherent human bias. That pessimism, however cheerfully expressed, is both misguided and harmful. Their failure to mention the more nuanced interpretation of marshmallow tests is dishonest and self-serving. The problem we face is not innate, and it is mostly curable. Better parenting can fix it. The marshmallow tests measure parents more than they measure kids.
Walter Mischel died in 2018. I heard his 2016 talk at the Long Now Foundation in San Francisco. He acknowledged the relatively weak correlation between marshmallow test results and later success, and he mentioned that descriptions of his experiments in popular press were rife with errors. But his talk still focused almost solely on the self-control aspect of the experiments. He missed a great opportunity to help disseminate a better story about the role of trustworthiness and reliability of parents in delayed gratification of children.
A better description of the way we really work through intertemporal choices would require going deeper into risk neutrality and how, even for a single person, our departure from risk neutrality – specifically risk-appetite skewness – varies between situations and across time. I have enjoyed doing some professional work in that area. Getting it across in a blog post is probably beyond my current blog-writing skills.