[linux-elitists] [CHMINF-L] Open Data - an appeal (fwd from pm286@CAM.AC.UK)

Eugen Leitl eugen@leitl.org
Sun Apr 17 02:06:46 PDT 2005

----- Forwarded message from Peter Murray-Rust <pm286@CAM.AC.UK> -----

From: Peter Murray-Rust <pm286@CAM.AC.UK>
Date: Sun, 17 Apr 2005 09:54:23 +0100
Subject: [CHMINF-L] Open Data - an appeal
X-Mailer: QUALCOMM Windows Eudora Version

At 02:16 17/04/2005, Brian Lynch wrote:
>         I attended a "National Consultation on Access to Scientific 
>Research Data [NCASRD]" in Ottawa last November,
>convened by three Canadian Federal Research Agencies - discussion was 
>general rather than specific; a final report was
>to appear but a Google search today on "NCASRD" didn't find it.  As I 
>recall open data was regarded as akin to old-fashioned motherhood in that 
>no-one attending would question its desirability;  but no government 
>representatives seemed able to suggest that
>provision of open data access might cost real $$$$.

Many thanks Brian,
        I suppose if any list can discover this information it should be 
this one

        I agree that data publication and preservation costs money. 
However there are some positive ways forward.

* We should not strive for perfection either in quality or 
comprehensiveness. For example we should start at 2005 and move forward, 
not look back. If I want historical data in existing Journals, secondary 
publications, etc. then I should be prepared to pay for it in some form. 
Personally I would rather have access to 50% of published chemical data, of 
highly varying quality, than 0% of wonderfully curated and annotated data. 
(If I were designing a chemical process, or defending a patent my views 
would be different but a large volume of chemists - and other scientists 
who use chemistry - are not.)

* I think Institutional Repositories will become universal. If they don't 
it will be the effective end of the Scientific librarian and this role will 
be taken over by commercial companies selling curated information of 
various sorts. There is, and must be, funding for IRs. I would urge that 
their role for data preservation and dissemination is at least as important 
as the deposition of a rather fragmented set of "full-text" PDFs. For 
example at our JISC meeting, Robert Terry (of the UK Wellcome Trust funding 
body) said that 1-2% of a research grant should be for dissemination. If 
that battle can be won, then we have a useful way forward for the transition.

* There is an increasing realisation of the value of Open Source. This 
means that it is possible to build systems, including chemical ones, that 
are not restricted by additional licenses from the software vendors. (For 
example the Zinc database of ca 3 million compounds cannot be redistributed 
because of the restrictions placed by both the original providers and the 
software vendors. We have built a proof-of-concept system WWMM 
(WorldWideMolecularMatrix) where InChI is used as a central index and links 
to XML data repositories. For the size of data (ca 1 million compounds) we 
find that the Open XML repository eXist works extremely well and can 
replace more heavyweight SQL-based systems. The database is then, 
essentially, an XML file which can be easily updated on a per-compound basis.

* Open Web-based such as RSS, RDF, Google/MSN and free-text indexers are at 
least as powerful as the average scientific database manager requires. A 
database need not be centralised. For example we would not replicate data 
in PubChem but point to it. Our main data addition is high-quality 
reproducible QM calculation and this again is available to any other 
collection. The use of InChI as an identifier ties all this together 
painlessly. New additions to the global chemical information are 
communicated by CMLRSS.

* preservation. The issue of digital preservation is far larger than 
chemistry. If there are generic solutions, then chemistry will benefit - 
there are relatively few preservation issues which are unique to chemistry 
as long as we use XML/CML for the deposition.

Appeal for Open Data

We really are starting very much from scratch. I know of virtually no 
clearly OPEN data/molecules other than PubChem/ NCI/ KEGG/ WWMM/ EBI/ EPA/ 
Moldata. [Unfortunately the NIST Webbook is not Open. The Open/closed 
status of Chembank is unclear from its web pages.] We are particularly 
missing lists of simple compounds of the sort that could be found in 
undergraduate courses or in Departmental stores. I am hopeful that there is 
much that could be made available now. If you know of such material, please 
consider making it available.

** At this stage please indicate interest and URLs but do NOT send data! **

If you are still reading, here is a suggested  checklist:

* positively state that it is Open Data and either assert that its reuse is 
consistent with the BOAI or use a Creative Commons license. ESSENTIAL.
* include your metadata (institution, possibly purpose of use).
* All information should be in ASCII, preferably US-ANSI. Binary data, and 
the use of ISO-8859-* or other encodings is problematic. Images (e.g. GIFs 
for alpha characters are not useful.
* make sure that no information is closed (e.g. copyright identifiers). We 
and many others assert that scientific facts are not copyrightable
* include a machine-processable connection table. MOL/SDF, SMILES and many 
other formats are acceptable - we can convert them automatically to 
InChI/CML (and since the service is Open, so can you if you want to try). 
(Names by themselves are not useful). ESSENTIAL

That is the minimum that is useful. From the connection table it is 
possible to generate 3D structures, submit them to QM calculations, etc. A 
(largish) list of SMILES with an institution's name and a statement of 
purpose could be useful in itself, but normally more info is available.

The following (which are commonly published) would be very useful additions 
if they exist. Any or all could be included. We have Open software (OSCAR) 
which is capable of extracting information from common formats.

* physical properties - e.g. mpt, bpt, alpha
* spectral peaks and assignments (UV, IR, HNMR, CNMR, FNMR, LRMS, HRMS)
* one or more names.
* identifiers which the owner has the right to distribute (e.g. suppliers 
* supplier (but not pricing)
* descriptive text

Does anyone know of any suppliers who have made their information Open? I 
assume that much of PubChem comes from this source and if so I assume the 
suppliers have allowed the catalog details to be redistributed.

Spectra. Online collections of Open Spectra would be highly valued. However 
there is no agreed metadata at present and although we are working on this, 
it is less urgent than the InChIs.

Note, as Dr Karthikeyan has done, that all of this is present in many - if 
not most - chemical theses. The costs involved in capturing it are fairly 
modest. The main challenge is changing the culture. For example at present 
the average graduate student spends 2 weeks transcribing electronic spectra 
to text. Tools to facilitate this Byzantine requirement would therefore be 
highly valued and would capture the complete spectrum. Its preservation in 
a well-found repository should be fairly straightforward. We hope to have a 
pilot project in this area.

Hoping for some positive contributions


Peter Murray-Rust
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069 Fax: +44 1223 763076

CHMINF-L Archives (also to join or leave CHMINF-L, etc.)
Search the CHMINF-L archives at:
Sponsors of CHMINF-L:

----- End forwarded message -----
Eugen* Leitl <a href="http://leitl.org">leitl</a>
ICBM: 48.07078, 11.61144            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
http://moleculardevices.org         http://nanomachines.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20050417/71457d93/attachment.pgp 

More information about the linux-elitists mailing list