Saturday, May 12, 2012

Oh the irony - new #OpenAccess #PLoSOne paper on Research Blogs doesn't share data behind analyses.

Interesting new paper: PLoS ONE: Research Blogs and the Discussion of Scholarly Information. All about the new world of science blogging.  Much of the context here relates to openness.  Yet as far as I can tell, the data collected that make up the meat of the analyses in the paper, are not shared.  Uggh.

Is there something I am missing here? Shouldn't a prerequisite of publishing this kind of paper be sharing the information / data used in the analyses?  Shouldn't that be released with the paper?

Definitely time to start "Open Data Watch" where people have a place to complain about lack of open availability of data behind papers (I came up with the name as a mimic of Ivan Oransky's diverse watch sites like Retraction Watch).  Originally in thinking about doing this I had been thinking about genomic data.  But I am sure this is a problem in other areas.  Consider paleontology, where openness to fossils and other samples is, well, not as common as it should be.  It is not that hard anymore to find a place to share one's data.  With places like Data Dryad and Biotorrents and FigShare and Merritt and 100s of others it is really inexcusable not to share the data behind a paper in most cases.  Certainly, in some cases there maybe privacy issues but that is not the case here (I think) and not an issue in most cases.

Come on people.  If scientific papers are to be reproducible and testable, you need to give people access to the data you used. ResearchBlogging.org Shema, H., Bar-Ilan, J., & Thelwall, M. (2012). Research Blogs and the Discussion of Scholarly Information PLoS ONE, 7 (5) DOI: 10.1371/journal.pone.0035869

13 comments:

  1. Shouldn't a prerequisite of publishing this kind of paper be sharing the information / data used in the analyses?
    From the editing & publishing policies of PLoS One:
    Publication is conditional upon the agreement of the authors to make freely available any materials and information described in their publication that may be reasonably requested by others for the purpose of academic, non-commercial research.

    ReplyDelete
  2. Build it an they will come...I could at least a dozen examples like this off the top of my head, and I for one would certainly contribute to a system that could log failures to share data.

    In this case, since it is a PLoS ONE paper, I would leave a comment on the paper highlighting that the data are not available and suggesting that the authors submit, e.g. to Dryad.

    ReplyDelete
  3. I do not understand why journals let people get away with this.

    ReplyDelete
    Replies
    1. This is the basic problem; until journals mend their ways, authors will get away with the bare minimum.

      As for "open data watch": I'd rethink that idea. You'd be covering practically every paper published.

      Delete
    2. Yeah - good point. Maybe I should start "Open data watch" but use it to highlight papers that actually share everything ...

      Delete
  4. You're right. It has to do more with me switching computers, etc. than anything else. I'll try to upload them when things will calm down a bit.

    ReplyDelete
    Replies
    1. Thanks for the reply --- sooner would be better than later ... and wherever you share the data I would suggest posting a link from the PLOS One site for the paper ...

      Delete
  5. I am all for requiring people to provide raw data. But I would not be 100% about it and would provide room for exceptions, approved by editors, etc.

    How about:

    - data that are essentially written in pencil on reams and reams of paper
    - data that make no sense until they are visualized (and where analysis can begin only once the numbers are turned into images, not before, and images are provided in the main body of the paper)
    - qualitative or subjective data, all or representative sample of which are provided in the main body of the paper
    - data that can be processes in only one way, so everyone would always get the same resulting numbers
    - data formatted by software that cannot be read by any computer younger than 20 years?

    And assume that the lab is gone, people are gone, equipment is gone, money is gone and there are no resources to type in tens of thousands of handwritten numbers into an Excel sheet, or to translate data from old software to new (probably expensive) software.

    Should such papers be prevented from being published?

    ReplyDelete
    Replies
    1. Well, I hate rules, so I am OK with not being 100% adamant about it. But here are some points

      1. I am not against people sharing ideas, and even general impressions of what they have found. But in general, the less that is shared the more something is an anecdote and the less it is science. Again, I am not against sharing anecdotes. But if I were running a journal, I would discourage anecdotes and encourage papers where as much as possible was shared.

      2. The "as much as possible" to me would include methods, tools, data, etc. I suppose if someone described their methods perfectly and provided the exact raw materials that were used, one could in theory reproduce each stage of what they did. But in general, this is impossible. So for example, if you say "I made a multiple sequence alignment from these genbank sequences using this program with these settings" you should still share the alignment. The same applies to any other type of data.

      3. As for your specific examples above


      - data that are essentially written in pencil on reams and reams of paper

      Not sure what this could be but it could be scanned and shared.

      - data that make no sense until they are visualized (and where analysis can begin only once the numbers are turned into images, not before, and images are provided in the main body of the paper)

      Well this is basically what happens with sequencing these days -- the daw data (images from sequencers) is usually not shared but some have argued it should be. The only good argument I know of for not sharing such "raw" data would be storage capacity.

      - qualitative or subjective data, all or representative sample of which are provided in the main body of the paper

      If all is in the paper that is fine. If not, and if any of the stuff not in the paper was used in some way it should be shared. In fact, even if it wasn't used, it should be shared. I have seen (and blogged about) some examples where people cleaned up data, and only shared the stuff they found interesting/important. Well, if people don't have the raw data they can't tell if that cleaning up was valid ..

      - data that can be processes in only one way, so everyone would always get the same resulting numbers

      See alignment example above .. very very hard to imagine a real world example of this.

      - data formatted by software that cannot be read by any computer younger than 20 years?

      Well, sure this could be a complication. But you would be amazed at how many times I have heard this with people saying things like "I had this in an ancient version of Excel and I cannot open it anymore." and then it turns out that this is really a Microsoft "feature" where new versions say they cannot open old files directly (but you can open them from within Excel).

      Regarding

      "And assume that the lab is gone, people are gone, equipment is gone, money is gone and there are no resources to type in tens of thousands of handwritten numbers into an Excel sheet, or to translate data from old software to new (probably expensive) software."

      How was the data analyzed in the first place? If it was scanned by hand/manually, then just scan the sheets and share those .. no need to turn it into an Excel file but if one does not convert it into digital form, one should share the analog information.

      Delete
  6. - data that can be processes in only one way, so everyone would always get the same resulting numbers

    This is an important consideration for computationally-oriented papers with simulated datasets. The simulated datasets can often be generated quickly and repeatably using the same pseudo random number generator and seed, but storing these datasets might a million or more times the space needed to store the program that generates them.

    This might also be true for MCMC type analyses wherein the same chain can be regenerated from only the random number seed, but that can be more compute intensive than simple simulations.

    ReplyDelete
    Replies
    1. Reproducibility is really the key. If something can be regenerated with code then sharing the code is all that is needed ....

      Delete
  7. it is really inexcusable not to share the data behind a paper in most cases. Certainly, in some cases there maybe privacy issues but that is not the case here (I think) and not an issue in most cases. njtaxpreparation.net

    ReplyDelete