ScienceSkepticism

The Dunning-Kruger effect: Misunderstood, misrepresented, overused and … non-existent?

Just stop using it!

A couple of weeks ago a statement popped up in my Facebook feed that surprised me.

New evidence suggests that the Dunning-Kruger effect doesn’t exist – people who don’t know what they’re talking about are aware of that fact.

The post came from the QI Elves who delight in posting quirky and unexpected scientific trivia, but rarely include sources and occasionally get fooled by “press release”-science. They are very popular though, so I thought I should look into the claim, which sent me down a rabbit hole or a complete rabbit warren. After reading more social psychology and education science papers than I care to ever do again, I’ve come to a few conclusions that I’d like to share with you all.

1. The Dunning-Kruger effect is a mess

The original paper

If you’ve never been curious about where the term comes from, you have presumably also never read the 1999 paper by social psychologists Justin Kruger and David Dunning that first described the effect. And even if you have read it you likely need a very quick refresher. Kruger and Dunning did four studies, one on recognizing humor, two on solving logic puzzles and one on knowing grammar, where they compared the results volunteer psychology students got on a test in the domain, with the same student’s self assessments related to the test.

Graph of self-assessed ability
Self assessed ability to recognize humor, based on Kruger and Dunning, 1999, figure 1. (Copyright author)

Their results showed that on average the students overestimated themselves and that this was chiefly due to the self-assessments of the lowest scoring quartile, while the highest scoring quartile slightly underestimated their own performance. Their hypothesized explanation was that for certain domains, the skill required for self-assessment is also required to succeed, leading those who lack to skill to be “Unskilled and unaware of it” (the title of their paper).

Just for some domains, mind you, as they state in the paper:

We do not mean to imply that people are always unaware of their incompetence. We doubt whether many of our readers would dare take on Michael Jordan in a game of one-on-one, challenge Eric Clapton with a session of dueling guitars, or enter into a friendly wager on the golf
course with Tiger Woods.
They also mention that the patterns they show in self-assessment would also be influenced by the so-called “Better Than Average” effect and regression toward the mean, but that they believe the magnitude of their result, and their experimental setups, justify ascribing the pattern to “dual curse” of being unskilled and unable to recognize it.
They also mention in passing at the end a slew of other influencing factors:
We have little doubt that other factors such as motivational biases (Alicke, 1985; Brown, 1986; Taylor & Brown, 1988), self-serving trait definitions (Dunning & Cohen, 1992; Dunning et al., 1989), selective recall of past behavior (Sanitioso, Kunda, & Fong, 1990), and the tendency to ignore the proficiencies of others (Klar, Medding, & Sarel, 1996; Kruger, 1999) also play a role.
But if you ask someone today what causes someone to overestimate their skill in a field, it’s likely the thing they have heard of is the Dunning-Kruger effect (DKE).

Public perception

It just fits so neatly with our own personal observations, doesn’t it? Everyone encounters overconfident people in their life, and it feels good to put them down with a reference to SCIENCE! So DKE has become a very popular and often misused and misunderstood concept. Here are a few things DKE is not, even if people say it is:

Skill vs confidence graph that is often mislabeled a Dunning-Kruger graph
Not a Dunning-Kruger graph, no matter what the internet tells you. (By the author.)
  • It is not “unintelligent people think they are smart” – As described, DKE applies to anyone with low skill in a domain, even if they are capable in others, as when physicists turn into Armchair epidemiologists
  • It is not “amateurs think they are smarter than the experts” – DKE says the unskilled overestimate their skill on average, but only in a small minority of studies does the estimate from the bottom quartile put them in the top quartile. Which hasn’t prevented even Bloggers at Psychology Today to peddle the oft seen “alternative” DKE graph that is always just made up in MS paint, like mine.
  • It is not that thing where you start studying a subject, think you’ve got it all figured out after the first two semesters, and realize just how little you know halfway through the third. This sometimes uses the same, pulled-from-the-anus graph.
  • It is not the similar thing where the relationship between knowledge in a field and willingness to defend strong opinions about the field is weirdly non-linear.

They are close enough though, at least in spirit, to help make DKE as popular as it is, in both real and warped variations. David Dunning mentions some of these and the problem with them in this recent article on MSN: What is the Dunning-Kruger effect? And Dunning does have a paper about the overconfidence of beginners, with a graph with a very slightly resemblance to the Not a Dunning-Kruger graph, but since it’s not showing the Dunning-Kruger effect and that paper doesn’t have Kruger as an author, it should really be a Dunning-Sanchez graph.

And of course there’s the same problem one encounters with any finding of this kind. It describes patterns in messy data, not rules applying on the individual level. The variation in self-assessment varies greatly within each group, but the findings rely on the patterns of the means and is presented only through those means, so in public perception those means come to represent everyone and are applied to individuals.

Later work

So public perception is a mess. But what about the scientific perception? Well, although the paper has received criticism right from the start, there are also multiple studies replicating the result in other domains. What seems to be lacking in a lot of latter papers though, in my cursory review, is acknowledging the caveats in the original paper and the criticisms received since. The authors have responded to at least some of the criticism with papers and reanalyses, and I’m not competent to judge the quality of that, but most replications seem to stick with the original level of analysis.

There also seems to be limited recognition that the effect isn’t universal. Yes, it has shown up in a diverse selection of tasks, but also been shown to vary with the difficulty level of the task, with the domain, with the type of self-assessment, and also possibly with the cultural background of the test subject, although I haven’t been able to find a source for this last aspect that uses comparable methodology. So what have scientifically is an effect that seems to appear in a lot of circumstances, but is known to have limitations. And these limitations are largely unexplored, and unfortunately ignored by critics and fans alike. In other words, it’s a mess.

2. The Dunning-Kruger effect doesn’t exist, maybe

As I mentioned above, Dunning and Kruger acknowledged that the Better-Than-Average effect and regression towards the mean would have an effect on their studies, but believed the effect they observed was too large, and influenced by the introduction of training etc., to be due to these. Other researchers disagreed and although Dunning and Kruger have responded to previous criticism with reanalyses and additional studies, new papers critical of their findings keep appearing.

One such paper, written by Gilles E. Gignac and Marcin Zajenkowski, was published early this year in Intelligence and bears the confident title The Dunning-Kruger effect is mostly a statistical artefact: Valid approaches to testing the hypothesis with individual differences data. They use simulated data on IQ and self assessed IQ and supposedly find a Dunning-Kruger effect there that shouldn’t exist, and then they use real data on IQ and self assessed IQ to do a statistical analysis they believe should reveal any DKE, and find none.

Based on this they state that:

When such valid statistical analyses are applied to individual
differences data, we believe that evidence ostensibly supportive of the
Dunning-Kruger hypothesis derived from the mean difference approach
employed by Kruger and Dunning (1999) will be found to be substantially
overestimated.

The statistics in their approach seems to check out, but I think it’s worth to consider a few points:

  • Much like previous critics, they go back to that first paper, and seem to ignore the authors’ responding papers.
  • They have tested one domain, and extrapolate to all others
  • The Dunning-Kruger effect they believe is overestimated is that there is a difference in degree to how much the least skilled and the most skilled misestimate their level. They still show a powerful mean overestimation in the least skilled group.

The first two speak to how this paper adds to the existing mess by ignoring the mess that already exists. And the third shows that even if the scientific hypothesis DKE might be on uncertain foundations, the popular perception of it would not necessarily be affected. Which brings me to the third conclusion I’d like to share.

3. Everyone should move past Kruger and Dunning (1999)

My favouritest of recent papers critical of the Dunning-Kruger effect is the first one I came across when I started searching for what the QI elves might be referring to. How Random Noise and a Graphical Convention Subverted Behavioral Scientists’ Explanations of Self-Assessment Data: Numeracy Underlies Better Alternatives by Nuhfer et. al. This paper does also suffer from some of the problems listed above. It focuses on one domain and it ignores all the elements of DKE research it isn’t directly focused on. But even if that means it fails, at least in my opinion, to refute previous DKE findings, it raises some important points and, again in my opinion, makes a good case for scientists in fields where DKE is relevant to abandon the concept and the approach popularized by that first paper, even with the modifications done since.

Simulated data "DKE graph".
Data from 64 “participants” guessing randomly. (By author)

Let me explain by showing you another DKE type graph. At first glance it seems like a plausible outcome for a DKE type experiment, yes? Unless you have a lot of experience with such graphs, it will take you a little while to notice that it’s very flat, has a surprising downwards slope and that all the points are very close to the 50th percentile.

That is because, as you already know if you read the caption, it is random data. You can’t tell that straight from the graph though, because, as in most DKE type graphs, there is no visual representation of variance, although I suspect that for at least some DKE studies you would still find it hard to tell if the data was random or not.

This does not mean all DKE graphs are nonsense, but it does mean that “noise” in a self-assessment is visually similar to some aspects of the DKE in such a graph. It’s regression to the mean, compounded with the effect of sorting the data into quartiles on one axis and percentiles on the other. Combine that with floor and ceiling effects for the top and bottom quartile, and the better than average effect, a noisy self-assessment in an underpowered study is bound to produce something similar to the DKE.

The actual study in this paper is anything but underpowered though. Nuhfer et. al. have had several thousand students take their Science Literacy Concept Inventory (SLCI), and their results from analyzing that data is fascinating (e.g. taking science courses appear to not correlate with learning science literacy, but spending more time in university does), but this paper is based on a smaller subset of 1154 students and professors who have also done related self-assessments, mainly a Knowledge Survey of the SLCI (KSSLCI).

At the superficial level of visually inspecting a Dunning-Kruger type graph this data appears to show a weak DKE. (Figure 1 in the paper if you’re very interested.) Nuhfer et. al., based on a previous paper and simulations, contend that this is entirely due to floor and ceiling effects and focus on an alternative graphical representation and a different way of defining who is an expert.

Let me first briefly describe their graphical representation. Since they have data on university students at all levels, as well as professors, they graph the over/under-confidence for each Class rank, and they show the spread. They also show the confidence interval for the mean.

Their choice of plotting software, or use of it, leaves a little bit to be desired for showing the spread, so I’m giving you my violin plot of their data instead. The dot shows the mean in each group, the width of the blob represents the number of individuals at the level. Overconfidence is up and 0 represents a precise assessment of one’s score.

A graph showing the accuracy of self-assessment group by Class rank
Class rank vs self-assessment accuracy (Graph by author)

It’s possible to see a weak trend, with overconfidence in Freshmen becoming underconfidence among Professors, but we also see that the spread in self-assessment is quite wide and Nuhfer et.al. show that the confidence intervals for most the means include zero. (Their graph does have that advantage over mine.)

For a lot of DKE papers the self-assessment is in the form of guessing where you would rank in your group. Nuhfer et al. have that type of self-assessment as well for this data material, and it shows a much more DKE-like result. But their contention is that this is a noisy and not particularly interesting form of self-assessment that will always show some level of DKE. Again I think they are treating past analyses a bit unfairly by focusing entirely on the graphical representation, but they still have a point. That you estimate yourself to be of average skill even when you’re not doesn’t necessarily mean you think you’re really good at it or on par with an expert, it means you think most people are as bad as it as you.

As I mentioned, in addition to alternative graphical representation, where they also suggest using histograms of the various sub-groups and show how similar understanding of the self assessment can be read from those, they offer a different way to discuss the self-assessment accuracy in a study. (If you are interested in the histograms, read the paper, this post is too long as it is.) By looking at the group that are arguably “real experts” the professors, they establish boundaries for “good”, “adequate” and “inadequate” self-assessment, and show that by those measures 43% of the Freshmen and Sophomores gave “good” self-assessments and 17% were “adequate”. A small number were extremely over- or underconfident, but not significantly one or the other, showing that the impression one might get from a DKE graph of the data would be misleading.

The paper of course goes into much more detail, the analysis of what should be considered a “good” self-assessment is part of a 17 page supplement, but I hope I’ve managed to summarize it in a not too confusing fashion.

In conclusion

As mentioned at the beginning, my trip through all these papers led me to three conclusions, which can be summed up like this: It’s time to leave the Dunning-Kruger effect behind. If you’re using it as an insult or to explain individual behavior you are over-simplifying to the extent that you are just wrong. If you are using it as the one explanation of differences between self-assessment and actual performance you are probably not producing results that are useful. Analyzing differences between self-assessment and actual performance have an important place in improving education, and education researchers are leading the way to better ways of presenting those differences than DKE-papers have been using for two decades.

Tags

Bjørnar

Bjørnar used to be a CompSci-major high school teacher in Norway, but now lives with his American wife in Boston where he gives her programming advice, walks the dog, works as a tutor and tries to decide what to be when he grows up.

Related Articles

2 Comments

  1. looks at title Oh thank god someone is saying it. I did some minimal research into DKE years ago and was convinced, ironically enough, that people do not understand it as well as they think they do. I’ll have further commentary when I have time to read this in more detail.

  2. A few thoughts:
    1. The “Better than Average” effect is not generalizable to other cultures, and the opposite has been found in studies of East Asians. The obvious experiment, then, is to test East Asian subjects for DKE. If DKE theory is correct, then you’d expect people with low test scores to systematically underestimate their test scores even more than people high test scores do. This implies a highly nonlinear “perceived ability” curve. This isn’t impossible, but it sounds highly implausible to me.

    A fundamental problem I have with the DKE is that there is a philosophical distinction between an accurate self-assessment, and the ability to produce an accurate self-assessment. If I’m a person of above average ability, and I assess my own ability by rolling ten dice and taking the average plus one, I might produce accurate results but it does not show my ability to produce accurate results. If everybody used this method of self-assessment, then you would find that people in the upper range have more accurate results, but it would be wrong to conclude that they have greater ability to produce accurate results. Dunning and Kruger have responded to “regression to the mean” criticisms by performing statistical tests, but I cannot see how any statistical test could possibly bridge the philosophical gap.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to top button
Close
Close
%d bloggers like this: