The mystery of the .003

It is intriguing that my paper, NASA faked the moon landing—therefore (climate) science is a hoax: An anatomy of the motivated rejection of science, continues to attract attention, nearly 7 years after it first saw the light of day and after numerous replications of one of its main findings that has caused such a stir. The most recent report of an association between endorsement of conspiracy theories and the rejection of climate science appeared in Nature Climate Change in an article by Mat Hornsey and colleagues.

A few days ago a blogpost appeared under the somewhat alarming headline that Lewandowsky data had been altered.

I have not altered any data, but .003 of the observations in the extended data set (139 out of 44,655) differed from those reported in the data set on which the original analysis was based. The differences were sufficiently slight for me not to notice them when I ran a correlational analysis on both data sets (rounded to a few digits) roughly 5 years ago when the data were (re-)posted after I moved to Bristol.

Let me first explain the problem: The original analysis as reported in the paper did not include a number of other variables (items) that were surveyed because those additional items were of little relevance to the topic of the paper. Some of those items were analyzed in the online supplement that accompanied the paper, and their existence was also mentioned in the paper itself.

For convenience, I posted two data sets after I moved to Bristol (and after a long and nuanced story involving the conflict between UK privacy laws and Australian consent forms). One data set reports the analysis in the paper and the other one additionally reports all other items that have no missing observations for the same 1,145 participants. Because the former data set is a subset of the latter, the variables that are in common between both should be identical.

They were, except for two items (FMThreatEnv and CO2TempUp) for which 81 and 59 observations, respectively, differed between the two data sets. In those cases, a “1” was coded as a “2” and a “4” as “3”. The differences appear to be random (i.e., there is no discernible pattern and only 27 participants are coded incorrectly in both variables).

So which data set is correct?

After checking this all again and re-running the script that dropped potentially identifying information from the data and performed some other minimal preprocessing, I found the published results and the corresponding data set to be correct. Nothing changes anywhere in the paper or the online supplement, except that the extended data set (which was publicly available but which was not used for analysis in the paper) had to be updated.

This has been done and the new extended data set, which supersedes the old one is now available here.  (Because the original data set was given a DOI that is enshrined for 20 years or more, it continues to be available but has a pointer to the data set that superseded it.)

A final question is, how could this happen in the first place? By what mechanism would .003 of nearly 45,000 data points take on different values, seemingly at random? I do not have an answer to that question. There is nothing in my archives that could have generated the initial extended data set. The script I used to strip identifying information and so on delivers the correct results now, and it delivered the same results 7 years ago.