[linux-elitists] [CHMINF-L] Open Data - an appeal (fwd from pm286@CAM.AC.UK)
Sun Apr 17 02:06:46 PDT 2005
----- Forwarded message from Peter Murray-Rust <pm286@CAM.AC.UK> -----
From: Peter Murray-Rust <pm286@CAM.AC.UK>
Date: Sun, 17 Apr 2005 09:54:23 +0100
Subject: [CHMINF-L] Open Data - an appeal
X-Mailer: QUALCOMM Windows Eudora Version 188.8.131.52
Reply-To: CHEMICAL INFORMATION SOURCES DISCUSSION LIST <CHMINF-L@LISTSERV.INDIANA.EDU>
At 02:16 17/04/2005, Brian Lynch wrote:
> I attended a "National Consultation on Access to Scientific
>Research Data [NCASRD]" in Ottawa last November,
>convened by three Canadian Federal Research Agencies - discussion was
>general rather than specific; a final report was
>to appear but a Google search today on "NCASRD" didn't find it. As I
>recall open data was regarded as akin to old-fashioned motherhood in that
>no-one attending would question its desirability; but no government
>representatives seemed able to suggest that
>provision of open data access might cost real $$$$.
Many thanks Brian,
I suppose if any list can discover this information it should be
I agree that data publication and preservation costs money.
However there are some positive ways forward.
* We should not strive for perfection either in quality or
comprehensiveness. For example we should start at 2005 and move forward,
not look back. If I want historical data in existing Journals, secondary
publications, etc. then I should be prepared to pay for it in some form.
Personally I would rather have access to 50% of published chemical data, of
highly varying quality, than 0% of wonderfully curated and annotated data.
(If I were designing a chemical process, or defending a patent my views
would be different but a large volume of chemists - and other scientists
who use chemistry - are not.)
* I think Institutional Repositories will become universal. If they don't
it will be the effective end of the Scientific librarian and this role will
be taken over by commercial companies selling curated information of
various sorts. There is, and must be, funding for IRs. I would urge that
their role for data preservation and dissemination is at least as important
as the deposition of a rather fragmented set of "full-text" PDFs. For
example at our JISC meeting, Robert Terry (of the UK Wellcome Trust funding
body) said that 1-2% of a research grant should be for dissemination. If
that battle can be won, then we have a useful way forward for the transition.
* There is an increasing realisation of the value of Open Source. This
means that it is possible to build systems, including chemical ones, that
are not restricted by additional licenses from the software vendors. (For
example the Zinc database of ca 3 million compounds cannot be redistributed
because of the restrictions placed by both the original providers and the
software vendors. We have built a proof-of-concept system WWMM
(WorldWideMolecularMatrix) where InChI is used as a central index and links
to XML data repositories. For the size of data (ca 1 million compounds) we
find that the Open XML repository eXist works extremely well and can
replace more heavyweight SQL-based systems. The database is then,
essentially, an XML file which can be easily updated on a per-compound basis.
* Open Web-based such as RSS, RDF, Google/MSN and free-text indexers are at
least as powerful as the average scientific database manager requires. A
database need not be centralised. For example we would not replicate data
in PubChem but point to it. Our main data addition is high-quality
reproducible QM calculation and this again is available to any other
collection. The use of InChI as an identifier ties all this together
painlessly. New additions to the global chemical information are
communicated by CMLRSS.
* preservation. The issue of digital preservation is far larger than
chemistry. If there are generic solutions, then chemistry will benefit -
there are relatively few preservation issues which are unique to chemistry
as long as we use XML/CML for the deposition.
Appeal for Open Data
We really are starting very much from scratch. I know of virtually no
clearly OPEN data/molecules other than PubChem/ NCI/ KEGG/ WWMM/ EBI/ EPA/
Moldata. [Unfortunately the NIST Webbook is not Open. The Open/closed
status of Chembank is unclear from its web pages.] We are particularly
missing lists of simple compounds of the sort that could be found in
undergraduate courses or in Departmental stores. I am hopeful that there is
much that could be made available now. If you know of such material, please
consider making it available.
** At this stage please indicate interest and URLs but do NOT send data! **
If you are still reading, here is a suggested checklist:
* positively state that it is Open Data and either assert that its reuse is
consistent with the BOAI or use a Creative Commons license. ESSENTIAL.
* include your metadata (institution, possibly purpose of use).
* All information should be in ASCII, preferably US-ANSI. Binary data, and
the use of ISO-8859-* or other encodings is problematic. Images (e.g. GIFs
for alpha characters are not useful.
* make sure that no information is closed (e.g. copyright identifiers). We
and many others assert that scientific facts are not copyrightable
* include a machine-processable connection table. MOL/SDF, SMILES and many
other formats are acceptable - we can convert them automatically to
InChI/CML (and since the service is Open, so can you if you want to try).
(Names by themselves are not useful). ESSENTIAL
That is the minimum that is useful. From the connection table it is
possible to generate 3D structures, submit them to QM calculations, etc. A
(largish) list of SMILES with an institution's name and a statement of
purpose could be useful in itself, but normally more info is available.
The following (which are commonly published) would be very useful additions
if they exist. Any or all could be included. We have Open software (OSCAR)
which is capable of extracting information from common formats.
* physical properties - e.g. mpt, bpt, alpha
* spectral peaks and assignments (UV, IR, HNMR, CNMR, FNMR, LRMS, HRMS)
* one or more names.
* identifiers which the owner has the right to distribute (e.g. suppliers
* supplier (but not pricing)
* descriptive text
Does anyone know of any suppliers who have made their information Open? I
assume that much of PubChem comes from this source and if so I assume the
suppliers have allowed the catalog details to be redistributed.
Spectra. Online collections of Open Spectra would be highly valued. However
there is no agreed metadata at present and although we are working on this,
it is less urgent than the InChIs.
Note, as Dr Karthikeyan has done, that all of this is present in many - if
not most - chemical theses. The costs involved in capturing it are fairly
modest. The main challenge is changing the culture. For example at present
the average graduate student spends 2 weeks transcribing electronic spectra
to text. Tools to facilitate this Byzantine requirement would therefore be
highly valued and would capture the complete spectrum. Its preservation in
a well-found repository should be fairly straightforward. We hope to have a
pilot project in this area.
Hoping for some positive contributions
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069 Fax: +44 1223 763076
CHMINF-L Archives (also to join or leave CHMINF-L, etc.)
Search the CHMINF-L archives at:
Sponsors of CHMINF-L:
----- End forwarded message -----
Eugen* Leitl <a href="http://leitl.org">leitl</a>
ICBM: 48.07078, 11.61144 http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 198 bytes
Desc: not available
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20050417/71457d93/attachment.pgp
More information about the linux-elitists