The Journal of eScience Librarianship (JeSLIB) has just published the following two articles:
- An Analysis of Data Management Plans in University of Illinois National Science Foundation Grant Proposals by William H. Mischo, Mary C. Schlembach, and Megan N. O’Donnell
- Research Data MANTRA: A Labour of Love by Robin Rice
These two articles are part of Volume 3, Issue 1 of JeSLIB that will be published in January. An announcement will be made when the issue is published.
Registration is now open for the 7th annual University of Massachusetts and New England Area Librarian e-Science Symposium, to be held on Thursday, April 9, 2014. For details and to register, visit the 2015 e-Science Symposium conference site. Registration is on a first come, first serve basis and will be capped at 90 people.
Submitted by guest contributor Nancy Glassman, Assistant Director for Informatics, D. Samuel Gottesman Library, Albert Einstein College of Medicine
In conjunction with Albert Einstein College of Medicine’s Faculty Development Program I lead an introduction to research data management workshop. Attendees usually include a mix of clinical and basic science faculty, as well as a few postdocs and graduate students. To set the stage at a recent workshop, I asked the group if they were surprised to have a librarian as the instructor. Taken aback by nodding heads around the table, I quickly recovered my composure and decided to make the most of this “teachable moment.”
All of the workshop’s attendees use the library’s resources and services, but as long as things are running smoothly and they find the information they need, they don’t really need to think about how it was made available to them. Many library users are unaware of what librarians actually do, and that’s just fine. But it’s worthwhile to take a few minutes to show researchers how the traditional library services they use almost every day require similar, if not the same, skill set as managing research data.
Librarians are, arguably, the original data managers. Think about it. Librarians have been managing data and information in one form or another for thousands of years, practically since the dawn of the written word. Archaeologists in Turkey have found collections of stone tablets dating back to the 17th-13th centuries BCE containing early forms of metadata.(1) These examples describe metadata concepts such as attribution and versioning:
“Written by the Hand of Lu, son of Nuggassar, in the presence of Anuwanza …”
“This tablet was damaged. In the presence of Mahhuziand Halwalu, I, Duda, restored it…” (1)
Fast forward to the library of the twenty-first century. We work and live in the era of big data in which “everything is available for free on the Internet.” Who makes sense of this information overload? Who selects, catalogs, curates, backs up, makes available relevant sources of information? Who helps users cite these resources properly? Who safeguards patron information?
- Librarians are expert at making data meaningful and easily discoverable. Look no further than the library’s catalog, a classic example of metadata in action. In medical libraries MeSH (Medical Subject Headings) is used to categorize material by subject. Author names and titles are standardized. Call numbers make it easy to find items on bookshelves.
- Although librarians are not copyright lawyers, we do have a lot of practice navigating copyright, licensing agreements, and open access as part of our regular activities. This includes negotiating with vendors, managing interlibrary loan, as well public- and open-access initiatives (including the NIH Public Access Policy).
- Researchers rely on librarians for help in finding relevant, evidence-based information. In addition to being experienced searchers of online databases such as PubMed, Embase, and Web of Science, we also mine the “deep web” to find those elusive resources.
- Librarians are familiar with the rules and nuances of proper citation and attribution practices. We support many citation management programs, including EndNote, RefWorks, and Mendeley, and teach students on how to cite correctly and avoid plagiarism.
- Data comes in a lot of different packages, and long term preservation and data storage are important aspects of managing research data. Over the millennia we have maintained and preserved collections of tablets, scrolls, manuscripts, maps, audiovisual materials, print books and journals, e-books, e-journals, websites, blogs, wikis, and data sets.
Although the media and the volume of data have changed radically over time, the expertise to manage all of this remains essentially the same. Librarians are particularly adept at adapting to change. Helping researchers manage their data is a logical extension of a long-standing tradition.
After the workshop, one attendee approached me, and acknowledged that at first he was skeptical about taking a class on research data management led by a librarian, but after I described the ways traditional librarian skills apply, it all made sense. Conversations like this can open users’ eyes to librarians’ wide range of information management skills and may lead to new and interesting partnerships.
1. Casson L. Libraries in the ancient world. New Haven: Yale University Press; 2001. xii, 177 p. p 13.
The following announcement is posted on behalf of the ACRL Digital Curation Interest Group Team.
The ACRL Digital Curation Interest Group is looking for proposals for our Spring Webinars and for ALA Annual 2015. The group would like to host three webinars in the Spring and have 3-4 panelists for ALA Annual 2015. So please consider submitting a short abstract proposal!
CFP for our Spring webinars:
We invite proposals on topics germane to digital curation activities including (but not limited to) the following topics:
- Documentation and organization
- Digital preservation
- Digital curation software and tools
- Metadata specialists
- Non-institutional repositories
- Skills needed/Skills learned to tackle digital curation
- Specific data management procedures such as file naming
- Data purchased from vendors
- Careers in digital curation
- Digital curation lifecycle
We seek webinars of 60 minutes in length (including time for questions). If you have an idea for webinar please send a short description of it to Megan Toups at firstname.lastname@example.org by January 31, 2015.
CFP for ALA Annual 2015:
We are putting together a panel of 3-4 people to present for ~10 minutes each covering digital curation from a variety of perspectives. Panelists will present and then engage the audience in a productive conversation on digital curation.
We’d love to have a diverse set of panelists representing a variety of different digital curation perspectives–research data, archives and digital curation, theory, practice, etc. Want to be a part of this interesting panel? Please submit a short description of what you’d like to present to Megan Toups at email@example.com by January 31, 2015.
Thank you for your submissions!
The DCIG Team–Megan Toups, Suzanna Conrad, Rene Tanner
RDAP15, the sixth annual Research Data Access and Preservation Summit, is accepting proposals (max. 300 words) for panels, interactive posters, lightning talks, and discussion tables. Themes for RDAP15 were selected by this year’s planning committee with input from previous years’ attendees and RDAP community members.
These are the proposal deadlines for the 2015 RDAP Summit:
December 19, 2014: Panel Presentations Submissions Due
January 16, 2015: Interactive Posters and Lightning Talks Submissions Due
For further details see RDAP15′s Call for Proposals webpage.
The ALCTS interest group of ALA has issued a Call for Presentations for the program “Metadata Services for Research Data Management” that will be held during the ALCTS Virtual Preconference “Planning for the Evolving Role of Metadata Librarians”, that will be held prior to the ALA annual meeting in June 2015 in San Francisco. Deadline for proposals is this Friday, Dec. 5th. See full announcement on Metadata Interest Group blog .
OCLC is sponsoring a series of workshops that build upon the framework presented in its recent research report The Evolving Scholarly Record. Workshops will be held in Washington, DC, Chicago, San Francisco, and Amsterdam. Seating is limited so you are encouraged to register now. See announcement for further details.
The upcoming 2015 New England Science Boot Camp will be held June 17-19 on the beautiful campus of Bowdoin College in Brunswick, Maine. Plans for session topics and activities are currently underway and will be announced in the next few months.
By Andrew Creamer, Scientific Data Management Specialist, Brown University
The National Science Foundation (NSF) explains that Data Management Plans are to be “reviewed as an integral part of the proposal, coming under Intellectual Merit or Broader Impacts or both, as appropriate for the scientific community of relevance.” As the librarian responsible for writing data management and sharing plans, I was invited to be a part of my institution’s Broader Impacts Committee, which aims to “help Brown faculty and researchers respond effectively to the Broader Impacts criterion and other outreach requirements of governmental funding agencies.” For example, it helps to build collaborations between the K-12 educators in my state and the university’s researchers, and it promotes a database to share STEM curricula, among others.
The NSF views Broader Impacts through the lens of societal outcomes:
NSF values the advancement of scientific knowledge and activities that contribute to the achievement of societally relevant outcomes. Such outcomes include, but are not limited to: full participation of women, persons with disabilities, and underrepresented minorities in science, technology, engineering, and mathematics (STEM); improved STEM education and educator development at any level; increased public scientific literacy and public engagement with science and technology; improved well-being of individuals in society; development of a diverse, globally competitive STEM workforce; increased partnerships between academia, industry, and others; improved national security; increased economic competitiveness of the United States; and enhanced infrastructure for research and education.
Recently I was asked to speak at a Broader Impacts Workshop for faculty. In my presentation I focused on several ways that their proposal’s DMP can connect with the societal outcomes described in their Broader Impacts. For example, researchers detail in their NSF DMPs when and how they will make their data and research products available for other researchers and/or the public, how they will archive and preserve access to their research products after the project ends, and they outline the dissemination strategy for their projects’ research products, which can include citing and sharing the projects’ data, metadata, and code in their publications and presentations and depositing these items into a data-sharing repository. Retaining, preserving and making data, metadata, and code, along with the resulting publications, accessible maximizes the potential for replication and reproduction of research results, and therefore they further the impact of the project by making it possible for their data and research products to be discovered, used, repurposed, and cited to aid in new research and discoveries.
Ways the Library Can Support Broader Impacts and Preserve and Disseminate Related Research Products
- The library can advise on selecting optimal file formats and media in which data can be stored, shared, and accessed. Proprietary software and data formats used to collect and capture data can impact the potential for a dataset to be of use by others. Researchers can work with the library to identify and export their data files into data-sharing and preservation-friendly formats.
- The library can collaborate with researchers to create the documentation and contextual details (metadata) that can make their data discoverable and meaningful to others. The library can help researchers locate metadata schema, standards and ontologies for a specific discipline, and it can also help to create metadata for data being prepared for upload into to a data-sharing repository.
- Depositing their Broader Impacts curricula and data into a repository is a way for researchers to guarantee that their research products will be discovered and used by others. It is also the easiest way to locate and access data years after a project ends. Libraries can offer a number of repository related services. It can help researchers to choose and evaluate potential repositories. The library can offer an institutional repository (IR) as an option for some researchers to publish, archive, and preserve their project’s data after their projects end.
- More libraries are offering a global persistent identifier service for researchers wishing to maximize the dissemination and discoverability of their datasets. A digital object identifier (DOI) is one way the library can provide researchers and the public a way to locate and cite data. The library for example through EZID can issue researchers DOIs, even if their datasets are not in their IR. For example, the library can issue researchers DOIs for the datasets they have deposited in NCBI databases that have accession numbers so they can then cite these datasets in their publications, presentations, and grant reports. The library also mints DOIs for researchers who are required by publishers to submit a DOI for their datasets underlying their manuscripts or for compliance with their publishers’ data availability and data archiving policies.
While researchers may have not thought about the library when it comes to societal outcomes and disseminating research data, we librarians hope that they will begin to see the library as the ideal institutional space to plan for data retention, appraising which research products should be retained, archived, and preserved, exploring options for sharing and long-term preservation-friendly file formats, creating documentation and metadata to make data discoverable and useful, publishing and archiving data in a repository, citing data, and disseminating and measuring the impact of data.
From around the web (mostly from the ALA job list): here’s a list of recent job openings that may be of interest to the e-Science Community:
California State University, East Bay Library: Health Sciences and Scholarly Communications Librarian
California State University, San Marcos: Health Sciences and Human Services Librarian
Cornell University Library: Director of Preservation Services
Dartmouth College: Research and Education Librarian, Biomedical Libraries
Institute for Health Metrics and Evaluation, University of Washington: Data Indexer
Iowa State University: Science & Technology Librarian (Engineering & Physical Sciences)
New York University: Research Data Management Librarian
Pennsylvania State University: Science Data Librarian
Tufts University: Research & Instruction Librarian
University of California at Los Angeles (UCLA): Geospatial Resources Librarian
University of New Hampshire: Life Sciences and Agriculture Librarian
University of New Mexico Libraries: Research Services Librarian for the Engineering, Life & Physical Sciences
The following announcement has been posted on behalf of the Boston Library Consortium and Digital Science. For information about the workshop or to register, please contact Susan Stearns at firstname.lastname@example.orgAddressing the Emerging Needs of the Research Ecosystem: An Invitation
The Boston Library Consortium and Digital Science invite you to attend a free workshop focused on the management, dissemination, and collaboration around research data in the university. Today’s research ecosystem is increasingly complex and includes players from many different departments and groups within the academy: research and sponsored program staff, the CIO and IT staff, library deans/directors and their scholarly communications and research data management librarians, university marketing and communications staff and, of course, the researchers themselves.
Meeting the diverse requirements of these varied groups in efficient and cost-effective ways requires that quality data are able to flow in and out of university information systems, often populating such diverse technologies as grants management systems, researcher profiles, institutional repositories, and enterprise data warehouses. Non-traditional measures of research impact such as Altmetrics and the increasingly prevalent funder mandates create new challenges for universities as they look to ensure a robust research information management environment.
Our goal for this workshop is to assemble a representative cross-section of stakeholders from a variety of BLC institutions. The workshop will bring together experts from Digital Science, a technology company with a focus on the sciences that provides software and tools to support the research ecosystem, and speakers with direct experience of evaluating and implementing research information management systems and services. We hope you will actively encourage your colleagues to attend.
Two options are available for the workshop as indicated below. BLC is considering offering live-streaming of one or both sessions if there is adequate interest.
Friday, November 21st at Tufts University, Medford Campus – 9:30am – 2:30pm; lunch included
Tuesday. November 25th at the University of Massachusetts Medical School, Worcester – 10:00am – 3:00pm; lunch included
Workshop speakers will include: Jonathan Breeze, CEO of Sympletics, Mark Hahnel, CEO of Figshare and the Vice Provost for Research or equivalent from a local Boston University Consortium member institution.
To register or for further information, send an e-mail to email@example.com indicating which of the above sessions you are interested in attending.
Submitted by guest contributor Daina Bouquin, Data & Metadata Services Librarian, Weill Cornell Medical College of Cornell University, firstname.lastname@example.org
The role of the data librarian extends far beyond helping researchers write data management plans. Rather, librarians working where data-intensive science is happening spend their time answering questions about the entire data life cycle—data pre-processing, analysis, visualization and data validation are all important, and sometimes highly intricate, parts of the research process. As a data services librarian I have personally found myself advising researchers to rework their workflows to make use of tools they have available to them help make their research more replicable, efficient, and shareable at these various stages of their research process. Unfortunately though, I do not always have hands-on experience with the tools and techniques I’m advising researchers to use– nor is it possible for me to always have experience using every tool out there available to researchers in computational environments. However, I do believe it’s important for me to get as much hand-on experience as possible with the most useful, commonly used tools, so that I can develop both refined expertise in my field, and also empathy for my patrons. E-Science Portal editor Donna Kafel recently wrote a wonderful post where she reflected upon, and pulled advise from others about self-learning and the challenges associated with it. Here, I aim to outline how I’m making use of some of the excellent advice offered in that post, while focusing in on an area of the data life cycle that I believe is sometimes oversimplified in discussion—I’m referring to the version control processes inherent in good data management.
“Be single-minded. Identify one topic or skills you want to learn and focus on mastering it.” – Donna Kafel, Challenges of Self Learning
I decided the advice I would take to heart most fiercely from Donna’s self-learning post was the above take-away. It rang true with me because I regularly encounter problems by trying to tackle too many new topics at once. If I don’t use something regularly, it’s difficult for me to become proficient—especially with technically challenging tools. It makes sense that I should focus more on mastering a single skill before moving on to anything new, but how to choose what to focus on? This is where Version Control Systems (VCS) or “Revision Control Systems” come in. VCSs are incredibly diverse in both complexity and application, and while I rarely see them discussed at length by librarians, I find them to be exceedingly important to researchers in collaborative environments. I regularly read discussions on file naming as an approach to control versioning and to aid researchers in a multitude of data management processes, and I do not want to discredit that discussion because it is so important (check out some of the great writing on this topic right here on the portal blog!), but I’m hoping to extend that conversation a bit more in this post. Below I focus in on Git as both a self-learning opportunity and incredibly useful VCS.
Git is a technology that “records changes to a file or set of files over time so that you can recall specific versions later”1. You can use Git for just about any type of file, but it is primarily used by people working with code files. Often times, people use simpler version-control methods, like copying files into a time-stamped directory, but this tactic is risky—one could forget which directory files are stored in or accidentally write over the wrong file (file naming helps here), but an even better approach is using a tool like Git. 1
Git is what is called a Distributed Version Control System (DVCS), but it is easier to understand DVCS if you first understand Centralized Version Control Systems (CVCS). CVCSs have a single server that contains all the versioned files a group of people are working on. Individuals can “check out” files from that central place so everyone knows to some extent what other people on the project are doing. Admins have control over who can do what so there is some centralized authority making it easier to manage than local version control solutions. Examples of CVCSs include the popular Apache tool Subversion. 1
There are though some drawbacks to using a CVCS—namely, the single server situation. If the server goes down, not only can no one can make any changes to anything that’s being worked on, but if the server gets damaged and is corrupted, the individuals working on the project are completely reliant on there being sufficient backups of all versions of their files. This is again, quite risky.
To mitigate this problem, DVCSs were developed. In distributed systems (like Git) people do not just check out the latest version of a file, they completely “mirror” the repository. In this way if the server dies, anyone who mirrored the repository can copy back to the server and restore it. Every time someone checks out a file, the data is fully backed up
Distributed systems are also capable of working well with several remote repositories at once, allowing people to collaborate with multiple groups in different ways concurrently on the same project. 1
However, I did not decide to focus my single-minded self-learning on Git just because it is so useful for version control—I wanted to focus on learning as many skills as possible, while still staying focused. You see, in learning to use Git, I’d have more opportunity to learn about Bash Unix Shell. Having some background in using command line interfaces, I am still a beginner with the Terminal and figured that learning Git would get me much more proficient with navigating my computer via the command line, which in-turn could help me get up the confidence to learn how to use a Linux operating system. Learning Git would also help me learn how to use GitHub, which is growing by the day in popularity as a place for people to store and share code. The GitHub graphic user interface would also help get me off the ground. So I found Git to be the great door-opener to many other skillsets on my list of self-learning goals.
Thus, I have begun learning to use Git and GitHub. I was able to get some hands-on experience with it by participating in a Software Carpentry Bootcamp this past summer, but didn’t find the time to dedicate to following up on it– I was not staying focused on learning a single new skill. So now I am re-grouping. I have primarily been using the resources I am providing below, however there is so much more out there. These resources are just a great place to start, and having made some headway in my own reading of these documents I hope to be trying out Git more in the very near future.
Pro Git Great free eBook and videos on getting started with and better understanding Git and version control. I used this excellent book in writing this post.
Pro Git Documentation External Links Tutorials, books and videos, to help get you started.
Even if you don’t think learning to use Git is right for you, learning more about the tools researchers are using to work with their data and getting a look under the hood about how those technologies work can be a great way to continue to grow professionally. I hope you all have the opportunity to join me in exploring a new skill and share your experiences with the e-Science Portal Community.
1. Chacon, S. (2014). Pro Git. Berkeley, CA: Apress. http://git-scm.com/book/en/v2
And just incase you weren’t already overwhelmed, here’s a great TED Blog on places to learn how to code!
Science and the World’s Future
Lecture given by Bruce Alberts, Professor of Science and Education, UCSF
Part of the Sanger Series at Virginia Commonwealth University, Richmond, VA
Bruce Alberts’ lecture was a review of his career that focused on the lessons he learned along the way and how they are important for the future of science research and the earth.
He failed his initial PhD exam at Harvard, but earned it 6 months later after more research. This taught him that having a good strategy in science research was a key to success, and negative results were okay.
Alberts started his own lab at the age of 28, and he believes that it should be easier for researchers to set up their own labs earlier in their careers – so funding needs to change.
After many years of research, Alberts became president of the National Academy of Sciences (NAS) and started learning about science policy. Science allows humans to gain a deep understanding of the natural world, and we can use this knowledge to predict future events or problems. Many government people wanted the NAS reports to be kept secret or have changes made but he felt that science was for all and that NAS was providing independent policy advice based on science, so there could be no changes or secrecy. Now the full text or a report goes on website when the government gets it.
Alberts’ work with NAS and as editor for Science magazine led him to international work with science academies. Alberts said that science and technology developed in North America or Europe can’t always be exported to the countries that need it. Countries need national, merit-based science institutions to help with policy and support science. Only local scientists have the credibility to rescue a nation from misguided local policies. Alberts’ examples were AIDS in Africa or polio vaccine in Nigeria. Alberts feels that the world needs more of the creativity, rationality, openness, and tolerance that are inherent to science for success of every nation. What Pandit Jawaharlal Nehru of India called “scientific temper”.
Alberts suggested strategies to help the world’s future:
- Education – active learning, open access, start by changing college science teaching since that is where high school science teachers learn science. (Science special issue April 19, 2013: Grand Challenges in Science Education and Education Portal http://portal.scienceintheclassroom.org/ )
- Promote science knowledge as a public good – open access again, not just papers but other educational materials, eg. http://www.ibiology.org/
- Empowering best young scientists- Global Young Academy
- Developing scientists as connectors – science communication, scientists need to connect with policy makers and the public, such as the AAAS Science & Technology Policy Fellowship program
- Develop and harness research evidence to improve policies.
What can librarians do
Obviously information literacy is huge when it comes to making sure students and future voting adults, can find the information they need to make decisions about health, technology, and science. Teaching regularly about reliability of web sites and other information sources must be part of this training.
I think librarians can also help harness the research evidence needed to improve policies. We have excellent search skills and many of us already have experience doing systematic reviews, which is what is needed to find all the evidence.
If you want to read more about Bruce Alberts, this interview by Jane Gitschier is good: Scientist Citizen: An Interview with Bruce Alberts
I liked this quote used by Alberts:
“The society of scientists is simple because it has a directing purpose: to explore the truth. Nevertheless, it has to solve the problem of every society, which is to find a compromise between the individual and the group. It must encourage the single scientist to be independent, and the body of scientists to be tolerant. From these basic conditions, which form the prime values, there follows step by step a range of values: dissent, freedom of thought and speech, justice, honor, human dignity and self respect.
Science has humanized our values. Men have asked for freedom, justice and respect precisely as the scientific spirit has spread among them.”
— Jacob Bronowski, Science and Human Values, 1956
The e-Science Portal design team has been conducting a series of online Optimal Workshop user studies of the portal over the past few months. In May the team had issued a Call for Participation for Usability Testing of the e-Science Portal, and we were happy to receive over ninety volunteers! With these volunteers’ participation, we’ve conducted three separate tests and gleaned valuable information from their responses. With this information, we’ll be “tweaking” the design of the portal, but before we do so, we need further input from a new pool of participants.
Whether or not you’re familiar with e-Science and/or the e-Science Portal for New England Librarians, we need you! On average the test takes 12-15 minutes to complete. You do not need to be a web design expert or have previous experience in user testing, and the instructions are easy.
To volunteer, please complete the following e-Science Portal Usability Testing form at https://docs.google.com/forms/d/1Wb6kk4QYtfvi4bZuVMnRoZxF_KQmdYUTsQrnK1VDWDE/viewform by Monday, October 27th.
Thank you for participating,
Donna Kafel, Coordinator for the e-Science Portal
Posted on behalf of Chris Erdmann, Head Librarian, Harvard-Smithsonian Center for Astrophysics, Harvard.
Workshop: Improving integrity in scientific research: How openness can facilitate reproducibility
Time: 3:00pm – 5:30pm
Date: Tuesday, November 25th
Location: Center for Astrophysics, Phillips Auditorium
“Using Zenodo to share and safely store your research data”
Lars Holm Nielsen, CERN
Is your 10-year-old dataset stored safely? Is it openly accessible? In the workshop, you will learn how to preserve, share and receive credit for your research data using Zenodo (https://zenodo.org/), created by OpenAIRE and CERN, and supported by the European Commission. We will explore the different aspects and issues related to research data and software publishing, why preservation is important, how to link it up and make your research data discoverable. We will also see how research software hosted by GitHub can be automatically preserved with just a few clicks. In addition, we will look at how research communities can be created in Zenodo to support a variety of publication activities.
Requirements: None, but it’s highly preferable to bring your own laptop and an example research output (dataset, software, presentation, poster, publication, …) you would like to share to be able to follow the interactive part of the workshop.
Improving integrity in scientific research: How openness can facilitate reproducibility
Courtney Soderberg, COS
Have you heard about the reproducibility crisis in science (ex. in AAAS and Economist) and worry about false positive results? Ever wondered how you could increase the reproducibility of your own work and help the accumulation of scientific knowledge? Join us for a workshop on reproducible research, hosted by the Center for Open Science.
This presentation will briefly review the evidence and challenges for reproducibility and discuss how greater transparency and openness across the entire scientific workflow (from project inception, to data sets and analysis, to publication and beyond) can increase levels of reproducibility. It will also include a hands-on demonstration of the Open Science Framework (http://osf.io/) a free, open source web application developed to help researchers connect, document, and share all aspects of their scientific workflow to increase the reproducibility of their work.
Attendees are encouraged to bring laptops and research materials (stimuli, analysis scripts, data sets, etc.) they would like to share so they can follow along with the hands-on section of the presentation.
How can you tell if data has been useful to other researchers?
Tracking how often data has been cited (and by whom) is one way, but data citations only tell part of the story, part of the time. (The part that gets published in academic journals, if and when those data are cited correctly.) What about the impact that data has elsewhere?
We’re now able to mine the Web for evidence of diverse impacts (bookmarks, shares, discussions, citations, and so on) for diverse scholarly outputs, including data sets. And that’s great news, because it means that we now can track who’s reusing our data, and how.
All of this is still fairly new, however, which means that you likely need a primer on data metrics beyond citations. So, here you go.
In this post, I’ll give an overview of the different types of data metrics (including citations and altmetrics), the “flavors” of data impact, and specific examples of data metric indicators.What do data metrics look like?
There are two main types of data metrics: data citations and altmetrics for data. Each of these types of metrics are important for their own reasons, and offer the ability to understand different dimensions of impact.Data citations
Much like traditional, publication-based citations, data citations are an attempt to track data’s influence and reuse in scholarly literature.
The reason why we want to track scholarly data influence and reuse? Because “rewards” in academia are traditionally counted in the form of formal citations to works, printed in the reference list of a publication.
There are two ways to cite data: cite the data package directly (often by pointing to where the data is hosted in a repository), and cite a “data paper” that describes the dataset, functioning primarily as detailed metadata, and offering the added benefit of being in a format that’s much more appealing to many publishers.
In the rest of this post, I’m going to mostly focus on metrics other than citations, which are being written about extensively elsewhere. But first, here’s some basic information on data citations that can help you understand how data’s scholarly impacts can be tracked.How data packages are cited
Much like how citations to publications differ depending on whether you’re using Chicago style or APA style formatting, citations to data tend to differ according to the community of practice and the recommended citation style of the repository that hosts data. But there are a core set minimums for what should be included in a citation. Jon Kratz has compiled these “core elements” (as well as “common elements”) over on the DataPub blog. The core elements include:
Creator(s): Essential, of course, to publicly credit the researchers who did the work. One complication here is that datasets can have large (into the hundreds) numbers of authors, in which case an organizational name might be used.
Date: The year of publication or, occasionally, when the dataset was finalized.
Title: As is the case with articles, the title of a dataset should help the reader decide whether your dataset is potentially of interest. The title might contain the name of the organization responsible, or information such as the date range covered.
Publisher: Many standards split the publisher into separate producer and distributor fields. Sometimes the physical location (City, State) of the organization is included.
Arguably the most important principle? The use of a persistent identifier like a DOI, ARK, or Handle. They’re important for two reasons: no matter if the data’s URL changes, others will still be able to access it; and PIDs provide citation aggregators like the Data Citation Index and Impactstory.org an easy, unambiguous way to parse out “mentions” in online forums and journals.
It’s worth noting, however, that as few as 25% of journal articles tend to formally cite data. (Sad, considering that so many major publishers have signed on to FORCE11’s data citation principles, which include the need to cite data packages in the same manner as publications.) Instead, many scholars reference data packages in their Methods section, forgoing formal citations, making text mining necessary to retrieve mentions of those data.How to track citations to data packages
When you want to track citations to your data packages, the best option is the Data Citation Index. The DCI functions similarly to Web of Science. If your institution has a subscription, you can search the Index for citations that occur in the literature that reference data from a number of well-known repositories, including ICPSR, ANDS, and PANGEA.
Here’s how: login to the DCI, then head to the home screen. In the Search box, type in your name or the dataset’s DOI. Find the dataset in the search results, then click on it to be taken to the item record page. On the item record, find and click the “Create Citation Alert” button on the right hand side of the page, where you’ll also find a list of articles that reference that dataset. Now you have a list of the articles that reference your data to date, and you’ll also receive automated email alerts whenever someone new references your data.
Another option comes from CrossRef Search. This experimental search tool works for any dataset that has a DataCite DOI and is referenced in the scholarly literature that’s indexed by CrossRef. (DataCite issues DOIs for Figshare, Dryad, and a number of other repositories.) Right now, the search is a very rough one: you’ll need to view the entire list of DOIs, then use your browser search (often accessed by hitting CTRL + F or Command +F) to check the list for your specific DOI. It’s not perfect–in fact, sometimes it’s entirely broken–but it does provide a view into your data citations not entirely available elsewhere.How data papers are cited
Data papers tend to be cited like any other paper: by recording the authors, title, journal of publication, and any other information that’s required by the citation style you’re using. Data papers are also often cited using permanent identifiers like DOIs, which are assigned by publishers.How to find citations for data papers
There’s no guarantee that your data paper is included in their database, though, since data paper journals are still a niche publication type in some fields, and thus aren’t tracked by some major databases. You’ll be smart to follow up your database search with a Google Scholar search, too.Altmetrics for data
Citations are good for tracking the impact of your data in the scholarly literature, but what about other types of impact, among other audiences like the public and practitioners?
Altmetrics are indicators of the reuse, discussion, sharing, and other interactions humans can have with a scholarly object. These interactions tend to leave traces on the scholarly web.
Altmetrics are so broadly defined that they include pretty much any type of indicator sourced from a web service. For the purposes of this post, we’ll separate out citations from our definition of altmetrics, but note that many altmetrics aggregators tend to include citation data.
There are two main types of altmetrics for data: repository-sourced metrics (which often measure not only researchers’ impacts, but also repositories’ and curators’ impacts), and social web metrics (which more often measure other scholars’ and the public’s use and other interactions with data).
First, let’s discuss the nuts and bolts of data altmetrics. Then, we’ll talk about services you can use to find altmetrics for data.Altmetrics for how data is used on the social web
Data packages can be shared, discussed, bookmarked, viewed, and reused using many of the same services that researchers use for journal articles: blogs, Twitter, social bookmarking sites like Mendeley and CiteULike, and so on. There are also a number of services that are specific to data, and these tend to be repositories with altmetric “indicators” particular to that platform.
For an in-depth look into data metrics and altmetrics, I recommend that you read Costas et al’s report, “The Value of Research Data” (2013). Below, I’ve created a basic chart of various altmetrics for data and what they can likely tell us about the use of data.
Quick caveat: aside from the Costas et al report, there’s been little research done into altmetrics for data. (DataONE, PLOS, and California Digital Library are in fact the first organizations to do major work in this area, and they were recently awarded a grant to do proper research that will likely confirm or negate much of the below list. Keep an eye out for future news from them.) The metrics and their meanings listed below are, at best, estimations based on experience with both research data and altmetrics.Repository- and publisher-based indicators
Note that some of the repositories below are primarily used for software, but can sometimes be used to host data, as well.
What it might tell us
Akin to “favoriting” a tweet or underlining a favorite passage in a book, GitHub stars may indicate that some who has viewed your dataset wants to remember it for later reference.
A user is interested enough in your dataset (stored in a “repository” on GitHub) that they want to be informed of any updates.
A user has adapted your code for their own uses, meaning they likely find it useful or interesting.
GitHub, Impactstory, PlumX
Ratings & Recommendations
What do others think of your data? And do they like it enough to recommend it to others?
Dryad, Figshare, and most institutional and subject repositories
Views & Downloads
Is there interest in your work, such that others are searching for and viewing descriptions of it? And are they interested enough to download it for further examination and possible future use?
Dryad, Figshare, and IR platforms; Impactstory (for Dryad & Figshare); PlumX (for Dryad, Figshare, and some IRs)
Implicit endorsement. Do others like your data enough to share it with others?
Figshare, Impactstory, PlumX
Supplemental data views, figure views
Are readers of your article interested in the underlying data?
PLOS, Impactstory, PlumX
A user is interested enough in your dataset that they want to be informed of any updates.
Social web-based indicators
What it might tell us
tweets that include links to your product
Others are discussing your data–maybe for good reasons, maybe for bad ones. (You’ll have to read the tweets to find out.)
PlumX, Altmetric.com, Impactstory
Delicious, CiteULike, Mendeley
Bookmarks may indicate that some who has viewed your dataset wants to remember it for later reference.
Impactstory, PlumX; Altmetric.com (CiteULike & Mendeley only)
Mentions (sometimes also called “citations”)
Does others think your data is relevant enough to include it in Wikipedia encyclopedia articles?
ResearchBlogging, Science Seeker
Blog post mentions
Is your data being discussed in your community?
Altmetric.com, PlumX, Impactstory
How to find altmetrics for data packages and papers
Aside from looking at each platform that offers altmetrics indicators, consider using an aggregator, which will compile them from across the web. Most altmetrics aggregators can track altmetrics for any dataset that’s either got a DOI or is included in a repository that’s connected to the aggregator. Each aggregator tracks slightly different metrics, as we discussed above. For a full list of metrics, visit each aggregator’s site.
Impactstory (full disclosure: my current employer) easily tracks altmetrics for data uploaded to Figshare, GitHub, Dryad, and PLOS journals. Connect your Impactstory account to Figshare and GitHub and it will auto-import your products stored there and find altmetrics for them. To find metrics for Dryad datasets and PLOS supplementary data, provide DOIs when adding products one-by-one to your profile, and the associated altmetrics will be imported. Here’s an example of what a altmetrics for dataset stored on Dryad looks like on Impactstory.
PlumX tracks similar metrics, and offers the added benefit of tracking altmetrics for data stored on institutional repositories, as well. If your university subscribes to PlumX, contact the PlumX team about getting your data included in your researcher profile. Here’s what altmetrics for dataset stored on Figshare looks like on PlumX.
Altmetric.com can track metrics for any dataset that has a DOI or Handle. To track metrics for your dataset, you’ll either need an institutional subscription to Altmetric or the Altmetric bookmarklet, which you can use when on the item page for your dataset on a website like Figshare or in your institutional repository. Here’s what altmetrics for a dataset stored on Figshare looks like on Altmetric.com.Flavors of data impact
While scholarly impact is very important, it’s far from the only type of impact one’s research can have. Both data citations and altmetrics can be useful in illustrating these flavors. Take the following scenarios for example.Useful for teaching
What if your field notebook data was used to teach undergraduates how to use and maintain their own field notebooks? Or if a longitudinal dataset you created were used to help graduate students learn the programming language, R? These examples are fairly common in practice, and yet they’re often not counted when considering impacts. Potential impact metrics could include full-text mentions in syllabi, views & downloads in Open Educational Resource repositories, and GitHub forks.Reuse for new discoveries
Researcher and open data advocate Heather Piwowar (full disclosure: the co-founder of Impactstory and my boss) once noted, “the potential benefits of data sharing are impressive: less money spent on duplicate data collection, reduced fraud, diverse contributions, better tuned methods, training, and tools, and more efficient and effective research progress.” If those outcomes aren’t indicative of impact, I don’t know what is! Potential impact metrics could include data citations in the scholarly literature, GitHub forks, and blog post and Wikipedia mentions.Curator-related metrics
Could a view-to-download ratio be an indicator of how well a dataset has been described and how usable a repository’s UI is? Or of the overall appropriateness of the dataset for inclusion in the repository? Weber et al (2013) recently proposed a number of indicators that could get at these and other curatorial impacts upon research data, indicators that are closely related to previously-proposed indicators by Ingwersen and Chavan (2011) at the GBIF repository. Potential impact metrics could include those proposed by Weber et al and Ingwersen & Chavan, as well as a repository-based view-to-download ratio.
Ultimately, more research is needed into altmetrics for datasets before these flavors–and others–are fully understood.Now that you know about data metrics, how will you use them?
Some options include: in grant applications, your tenure and promotion dossier, and to demonstrate the impacts of your repository to administrators and funders. I’d love to talk more about this on Twitter.Recommended reading
CODATA-ICSTI Task Group. (2013). Out of Cite, Out of Mind: The current state of practice, policy, and technology for the citation of data [report]. doi:10.2481/dsj.OSOM13-043
Costas, R., Meijer, I., Zahedi, Z., & Wouters, P. (2013). The Value of research data: Metrics for datasets from a cultural and technical point of view [report]. Copenhagen, Denmark. Knowledge Exchange. www.knowledge-exchange.info/datametrics
Submitted by Donna Kafel, Project Coordinator for the e-Science Portal.
Here are some recent job postings for science, health sciences, and data librarians at various institutions across the US and Canada.
California State University, East Bay Library, Health Sciences and Scholarly Communications Librarian: https://csucareers.calstate.edu/Detail.aspx?pid=41475
Carilion Clinic, Roanoke, VA, Clinical Research Librarian: https://www.healthcaresource.com/carilion/index.cfm?fuseaction=search.jobDetails&template=dsp_job_details.cfm&cJobId=734201&fromCarilion=true
Lewis & Clark College(Portland, OR): Science & Data Services Librarian: https://jobs.lclark.edu/postings/4720
New York University Health Sciences Libraries, Knowledge Management Librarian http://hsl.med.nyu.edu/content/knowledge-management-librarian
Life Sciences Librarian, New York University: http://library.nyu.edu/about/jobs.html#sciences
Research Data Services Librarian, New York University: http://library.nyu.edu/about/jobs.html#RDM
McGill University: Data Reference Services Librarian: http://joblist.ala.org/modules/jobseeker/Data-Reference-Services-Librarian/27493.cfm
University of Cincinnati, Digital Metadata Librarian: http://www.libraries.uc.edu/about/employment.html
University of Connecticut, Sciences Librarian,: http://joblist.ala.org/modules/jobseeker/Sciences-Librarian/27501.cfm
University of Delaware, Science Liaison Librarian: http://www2.lib.udel.edu/personnel/employment/102465ScienceLiaisonLibrarian.pdf
University of Kentucky: Head of Science Library and e-Science Initiatives: http://www.diglib.org/archives/6865/
University of Massachusetts Medical School, Assoc. Director of Library Education and Research https://careers-umms.icims.com/jobs/23818/assoc-dir%2c-lib-education-%26-research/job?mobile=false&width=1837&height=500&bga=true&needsRedirect=false
IASSIST (International Association of Social Science Information Services and Technology) announces a Call for Papers for IASSIST 2015, which will be held June 2-5 in Minneapolis, MN.
Submitted by Donna Kafel, e-Science Coordinator, University of Massachusetts Medical School
Data Visualization, Research Methods in Information, How to Think Like a Computer Scientist, Interactive Web Design, Blindspot, the Harvard Edx course “Introduction to Computer Science.” These are just a few examples of the many topics and items on my to-read and to-learn list. I want to learn about Python script and R, I want to be better versed in research methodology, develop self-paced educational modules, be more aware of hidden biases, and develop proficiency in data science techniques. Knowing these things would be very useful for me professionally. And I’m sure I’d enjoy learning some of them if I could find the time.
The following picture depicts my typical daily dilemma. During the course of my workday, I come across a book, or a new tool, or an online course, or something that I want to learn about. And I think to myself, when I go home tonight I’m going to delve into reading about this topic and learn something. Or I’m going to set aside an hour every night for a week and learn the basics of Python. I’m going to learn the ins and outs of a new database. These ideas seem very do-able in the light of the workday. Yet after work, other demands and tasks take over, and I let these great aspirations fall by the wayside, night after night.
Reflecting on this vicious cycle, I got to wondering about how my colleagues approach self-learning. Do they set specific goals for themselves? Do they set aside work time to learn a new technology? Do they ever sleep?
I decided to interview two librarians who I admire for their creativity, unique skills, and passion for learning: Sally Gore and Chris Erdmann. I work with Sally at the Lamar Soutter Library at UMass Medical School. Sally works as an Embedded Research Informationist and is involved in some very interesting projects with faculty researchers who are investigating things like patient compliance in mammogram screening and developing a system for citing neuroimages. Sally is a thoughtful and articulate writer who regularly shares her insights about her experiences working as a librarian in the research environment and emerging trends in librarianship in her blog A Librarian by any Other Name. Chris is the Head Librarian at the Harvard-Smithsonian Center for Astrophysics. Much of his work there focuses on astrophysics data and developing library data services that support the needs of astrophysics researchers. Chris directly works with researchers doing data processing and analysis; assisting them with data citation and publishing, and exploring new approaches for repository systems that support access to huge astrophysics data sets. What’s particularly striking about Chris is his passion for teaching other librarians data science techniques in his DST4L (Data Science Training For Librarians) class that is now in its third iteration. In this class, Chris and his associates have taught librarians programming skills and technologies through hands-on activities and group projects.
I interviewed Sally and Chris individually but both of their responses are noted below each question.
How do you find time to “teach” yourself new things?
Sally: I set aside one morning a week, usually Friday mornings, for professional reading and writing my blog posts. Making this a weekly practice is a good habit. I strongly believe that librarians need to make an active effort to stay informed, and to do that, we need to set aside some work time for reading and learning. In my spare time I also take the opportunity to attend seminars, and learning events, like Science Café Woo for example. I also try to meet new people at such events, by sitting with people I don’t know and talking with them about their interests and the work they do.
Chris: When things are quieter at work, I seize moments to focus on learning a new skill. One of my fears is that I won’t be able to keep up with the rapid pace of changing technologies. It’s a huge challenge to find this time, but that’s how I learned a lot of computer programming, during breaks.
I do encourage librarians who are interested in the DST4L class to advocate for professional development time to take the class by pointing out to their administrators the usefulness of the skills they’ll learn. I have thought about teaching the class online but it wouldn’t work well that way. One of the key factors to successfully sticking with a class is being involved in group projects in which your classmates are counting on your participation. No one wants to let their group down, so they consistently attend the classes.
Did your educational background prior to library school help you with your work now?
Sally: I have a B.A. in Philosophy and a Master’s in Divinity, and a Master’s in Exercise Physiology. These are all very different fields than the research disciplines that I’m involved with right now. I do think having worked in a research environment while studying physiology has been a huge plus. It gave me a sound background in research methods and familiarity with research work and environments.
Chris: My background is a B.A. in History with a minor in Agriculture and Managerial Economics. Very different from computer science! But several years back, I really wanted to get a job as a programmer and was pretty sure that I could teach myself the basics. I learned programming initially by picking up a C++ book years ago and studying it. It wasn’t easy but I was determined to learn programming so I could work in a software company. I did get hired as a programmer. The first week on the job was a bit shaky, but I persevered and learned as I went along.
The thing I missed though in working as a programmer was not working directly with users. I enjoy working with people. I did a consulting gig for a while and was able to work more with users then. As I thought more about wanting to work with people, I started to consider library school.
What has inspired you?
Sally: I have been working at UMass Medical School for nine years now, but it’s only been during the last two years that I’ve worked directly with researchers. I know much more now about the research work that is being done at the school, yet the more I know the more I realize how much more is being done here that I know nothing about. I’m inspired by the incredibly bright researchers with whom I have the opportunity to work. I enjoy the work I’m doing now as an informationist on a neuroscience project. I like looking at the big picture; understanding the project activities and project design and data management challenges. In this project, we’re trying to explore new ways for effectively citing individual neuroimages that are part of a “dataset” that is basically a collection of neuroimages.
Chris: I was inspired by an internship I did at the Smithsonian in which I worked on DigiLab, an interactive exhibit of digital materials. I enjoyed the work I did there so much, it inspired me to continue learning everything I could about programming for the web.
Another thing that I’ve found motivating is going to user forums to learn new coding skills. They’re ideally places where you can informally learn from a community of other users. When I started learning, forums used to be intimidating spaces but have improved dramatically. Now the users are generally more welcoming, though, there is still work to be done to improve the culture so it is less male dominated. I’ve been favorably impressed by Software Carpentry. They’ve been great to work with, and I always recommend the bootcamps they run to students.
From there my conversations with both Sally and Chris veered to new roles for libraries, data repositories, and revamping library school curricula to include data science courses—topics for other blog posts. However, I did come away from the interviews with a few do-able approaches for self-learning:
· Be single-minded. Identify one topic or skills you want to learn and focus on mastering it.
· Seize opportunities to attend lectures, seminars, poster presentations on research topics (there are many of these in an academic institution)
· Enroll in a face-to-face class with required projects
I found these helpful and hope they’ll help others who are paralyzed by the so much to learn, so little time conundrum. I’ll let you know how my self-learning proceeds in a future post!
Enhancing Reproducibility and Transparency of Research Findings.
Lecture given by Lawrence Tabak, Principal Deputy Director of the NIH
Part of the Sanger Series at Virginia Commonwealth University, Richmond,VA
The starting point of Dr. Tabak’s lecture was the editorial he and Dr. Francis Collins published in Nature, January 30 2014, on NIH plans to enhance the reproducibility of research.
One of the things they noted in the article was that the NIH can’t make the changes to research alone. Scientists, in their roles as reviewers of grants and articles, editors of journals, and members of tenure panels, can help with the process.
Science has always been viewed as self-correcting, and it generally is over the long term, but the checks and balances for reproducibility in the short and medium term are a problem. Tabak discussed several problems with current research publications. Journals want exciting articles, “Cartoon biology” according to Tabak, and so the methods sections are shrinking – “more like method tweets”. Add to this issues with;
- poor research design, eg. not using blinding or randomizing, using a small sample,
- incorrectly identified materials, eg. not verifying cell strains or antibodies,
- variability of animals,
- contamination of cells,
- sex differences
Along with methods issues, Tabak identified problems with poor training in experimental design, poor evaluation leading to more errata and retractions, the difficulty of publishing negative findings, and the “perverse reward incentives” in the US biomedical research system.
What is the NIH doing?
As well as speaking at many venues outside of the NIH (such as this lecture at VCU) there are efforts to work with editors, industry, and other groups to improve research.
Editors of journals with the most NIH researcher publications were invited to a workshop in June 2014, and a set of principles for journal publication were drawn up. Science and Nature will run editorials in November with the finalized principles. The principles will include encouraging rigorous statistical analysis, transparency in reporting, data and materials sharing, and establishing best practices guidelines.
NIH is working with industry through PhRMA to make training materials on research design available to everyone. And there will be some training films developed at the NIH for use around the country.
Tabak mentioned a couple of projects that should help with the validation of high-quality experimental results.The Reproducibility Initiative is a collaboration between Science Exchange, PLOS, figshare and Mendeley, and the Open Science Framework from the Center for Open Science allows researchers to register materials and methods before research, similar to a clinical trials register.
Tabak also discussed a checklist of core elements that might be used when reviewing grants. Included was the idea that researchers need to make sure their background articles are reproducible and of high-quality. He mentioned that some of the false hopes of patients for a cure for devastating illnesses, such as ALS or cancer, are based on poorly designed animal studies that should have never progressed to clinical trials.
Post-publication review of papers, in forums like PubMed Commons, is one way to insure transparency. As well as discussing and clarifying the research, some authors have linked to data sets, including negative data sets, which increases the usefulness of the Commons model.
There was also a discussion of funding for replications studies, and alternative funding methods to increase stability for mid-career researchers. Tabak concluded by mentioning that these new funding models would need to be evaluated and it may be that different institutes will use different models.
What can librarians do?
Throughout the lecture I thought of things librarians could be doing to support scientific transparency and reproducibility. We can encourage the best possible background searching for research, which means training students as well as working with researchers to refine their searching. We can encourage citation searching and show researchers how to follow up on errata and retractions so they know what others think of the research they are reading. We can encourage the use of social media for informal communications about research. And be sure to keep an eye out for the principles the journals will be sharing in November.
As a data librarian,I can encourage proper data management and documentation for reliable reporting. I can suggest sharing data in various venues and linking that data to their articles. I can suggest that data management training be part of the research design training and I can offer to do it.
And libraries can invite lectures such as this one – VCU Libraries sponsored this lecture with our Office of Research and Innovation. And it was very well attended.
I’m sure there are ideas I’ve missed. I would love to hear any ideas you might have for ways librarians can help research transparency and reproducibility.
Video of the lecture is availale at http://youtu.be/E06QJTZ6LUw