The SPECTRa Project
Jim Downing
University of Cambridge
This presentation
- The need for Open Data in Chemistry
- The SPECTRa project
- Project Outline
- Crystallography
- Social challenges - Technical implications
- DSpace
- Potential collaborations
- Summary
The need for Open Data
This is the context for the SPECTRa project.
Problems in Chemistry
- Data is lost - an estimated 80-99% (*) of
high-quality scientific data never leaves the laboratory
- Extra work required to publish data
- Lack of infrastructure to handle Open Data
- Hearts & minds
- Many publishers unconvinced about / antagonistic towards OA
- => Many academics unconvinced about / antagonistic towards OA
- Lack of exemplars to illustrate benefits
SPECTRa ...
Submission, Preservation and Exposure of Chemistry
Training and Research Data
- Study needs of Chemistry researchers
- Develop Open Source tools to enable deposit of and access to Open
Data using DSpace institutional repositories


http://www.lib.cam.ac.uk/spectra/
... To The Rescue
- Data is lost
- Workflow integrated tools to capture data
- Extra work required to publish data
- User driven development of tools to minimize effort of publication
to archive
- Lack of exemplars to illustrate benefits
- Participating scientists will use tools day to day - demonstrable
success.
- Lack of infrastructure to handle Open Data
- Tools will be portable, customizable Open Source software
components
- Integration with DSpaces at Cambridge and Imperial College.
SPECTRa Phases
- Planning: Oct 2005 - Dec 2005
- Crystallography: Jan 2006 - Jun 2006
- Computational Chemistry: May 2006 - Sept 2006
- Organic Synthetic Chemistry: Aug 2006 - Dec 2006
- Distribution & Dissemination: Jan 2007 - March 2007
Crystallography from 30,000ft
Measuring diffraction patterns to determine crystal and molecular
structure
Structural data + chemical context is valuable.
Problems
- 300-500 structures per year per department
- Most DO NOT end up being published
- Crystallographer would like to publish structures
- Chemists and crystallographers interests in the data are often
different
- Chemist interested in confirmation of structure
- Crystallographers interested in analysis of large datasets of
structures
- Additional metadata required from Chemist at start of process
- Low tolerance for additional work that isn't scientific research
- OA opposition and FUD
Crystallography Sample Manager

Crystallography 2
- Crystallographer has all the data required to publish structure

All the data can be present, but the structure can't be published.
Publication sensitivity
- Chemist's work is often in co-operation with commercial partners
with stringent IP requirements
- Chemists often need to keep their research interests hidden from
competitor research groups
- Structure publication must be embargoed until
- Chemist publishes work involving structures
- Chemist moves on to a different line of research
- Structures can be published
- But only after a period of time
Failure to collect metadata up front coupled with
this time delay has been major barrier in crystallographers doing their
own structure publication
DSpace Escrow Repository
- Data only
- Headless (No GUI)
DSpace Core Strengths
- Content agnostic
- Packaging mechanism
- Extensible metadata handling
- OAI-PMH
DSpace Core Development Areas
- Identifier handling
- Honour existing Handles in item packages
- Honour collection-unique domain-specific IDs (e.g. InChI)
- Scalability with number of items still problematic
- Simple network API
- DSpace LNI satisfies most of our functional requirements
- JISC Deposit API (under construction!)
Potential Collaborations?
- Cross repository network API interoperability work (e.g. JISC
Deposit API)
- Chemistry services for data collected by SPECTRa tools - e.g.
substructure searching, characterization
- LNI
- Improvements to Handle handling
- Potential requirement to support DOIs as well / instead
SPECTRa In Summary
- Collaboration between libraries and chemistry departments at
Imperial College, London and University of Cambridge, and with the eBank project
- Runs until March 2007
- Developing OS tools to facilitate Open Data publication to
Institutional Repositories from Chemists
- Using DSpace as an IR and also a dark repository for content escrow
Thanks for listening!
Questions?