The Cambridge Crystallographic Data Centre (CCDC):
40 years of database development, software and research

Beginnings

The CCDC was created to record crystal structures, and the Cambridge Structural Database, CSD, was one of the first numerical databases created anywhere in the world. The CCDC originated from a small group set up in 1959 by J D Bernal and Olga Kennard, initially at Birkbeck College, London and from 1962 at the Chemistry Department in Cambridge, collecting data on organic and metal-organic crystal structures and using these to investigate intermolecular arrangements and forces. In January 1965 David Watson joined the group and later that year the CCDC was formally established with a grant from the Office for Scientific and Technical Information. The collection of data was greatly accelerated and both numeric and bibliographic data were transferred from edge punched cards to "machine readable" form. Subsequent CSD growth statistics suggest that, had this work started later, it is doubtful if it would have started at all. But it did, and 40 years later the CSD contains 335,276 structures.

Development

By modern standards, early progress was horribly slow: computer technology was in its mainframe, card chewing, batch-processing era, and hardware was temperamental. On the human level, staff were needed to acquire, log and encode data, and crystallographer-programmers were needed to turn the vision into a reality. Scientific abstractors and data entry personnel, most of whom worked from home, were also vital early colleagues on the developing production line. The CCDC itself was partly a hub which managed a complex data preparation network, and partly a scientific analysis centre that processed the raw material into a growing database. Data acquisition itself has now completed its own transformation from the days when all coordinates were printed to the current nirvana of electronic deposition via the CIF. In between we had to cope with the myriad vagaries of hard-copy depositions, which even involved some in handwritten form!

An early need was for structure validation software, to guard against local data entry mistakes and to locate the errors that occurred in some 10% of typed or typeset tables. Many errors were trivial to correct, but in the pre-email era a significant number had to be referred back to authors by letter. Crystallographers took these 'CCDC letters' in good part, and this was the beginning of a special relationship with the community that has enhanced the development of the CSD throughout the past 40 years.

An electronic bibliographic file was being regularly updated by 1970, and was disseminated via the Molecular Structures and Dimensions book series - itself one of the earliest handbooks to be typeset directly by computer. Meanwhile, the first 5,000 crystal structures were being validated and entered into a CSD data file. Finally, it was realised that a system of chemical structure representation was needed and a third component, a file of chemical connection tables, was created. 2D and 3D substructure search capabilities were now possible, adding tremendous value to the underlying crystal structure information. These three separate files were eventually amalgamated into the CSD that we know today.

Millions of lines of code

Software development has always been at the heart of CCDC activities, and we have run the gamut from FORTRAN II to our current object-oriented C++ environment. FORTRAN, as its name implies, was never really created for text processing, and we pushed the available compilers to their limit and beyond in the early days.

The CCDC is responsible for three types of code:

CCDC software developers have blended the 3D representations of crystallography with the 2D representations of chemical informatics, and have been at the forefront in creating novel systems for 3D substructure searching, including searches for intermolecular interactions, and the statistical analysis and visualisation of parameter distributions retrieved from the CSD. More recently, we have generated knowledge-based libraries of structural information, and have diversified (often collaboratively) into software applications that use crystal structure information.

CSD System releases

By the mid-1970s, the first version of the CSD System had been released to academics in the UK, USA, Japan and Italy. Many other countries formed National Affiliated Centres and became subscribers to the service. The pharmaceutical and agrochemicals industries began to experiment with computational chemistry and modelling tools for rational molecular design, and the number of industrial subscribers began to rise during the 1980s. Early releases were on magnetic tape, and the number of 1600 foot tapes per release was certainly a challenge for the average postman, particularly the one who 'delivered' several CCDC parcels to a hedge 'somewhere in Europe'. Software was released as source code, to be compiled under the user's local operating system. Today all that has changed, with the universality of just a few operating systems, CDs and internet downloads, click-of-a-button installers, and e-mail support desks.

1,200 Applications Papers

The first papers that made use of the CSD for fundamental research began to appear in the late 1970s, inspired by the work of Hans-Beat Buergi and Jack Dunitz on structure correlation. Recognising the CSD as a growing library of geometric structures, there was a rapid acceleration in this type of research from about 1980. A key issue was to improve database searching and develop a proper statistical basis for data analysis, so that improvements in distributed software were often driven by current research needs.

The CCDC itself has been heavily involved in this research effort, and has published applications papers covering both intramolecular and intermolecular topics. Tables of mean bond lengths published in J.Chem.Soc, Perkin Trans (1987, pp S1-S19) and J.Chem.Soc. Dalton Trans. (1989, ppS1-S83) have now jointly received more than 10,000 citations. In the study of intermolecular interactions, the CSD has underpinned many fundamental contributions. These have helped to provide tools for studying protein-ligand interactions, and played a part in the emergence of crystal engineering as a sub-discipline. The CCDC's most cited paper in this area - more than 1,000 citations and the 60th most cited paper ever in the first 125 years of JACS - is the categorisation of short C-H...O interactions as true H-bonds (Taylor & Kennard, J. Amer. Chem. Soc., 104, 5063-70, 1982), work that re-shaped the global view of weaker interactions.

The CCDC maintains a web-accessible database of published applications of its products, and the 1,200 current entries chart the many and varied uses of the CSD. The CCDC is well represented with over 150 papers, but more than 1,000 other references show the truly international impact of CSD-based research.

The CSD at 40

On 1 January 2005, the CSD contained 335,276 crystal structures and grew by nearly 29,000 structures in 2004. The size and complexity of structures has also increased steadily with time. The CCDC has excellent relationships with journals, and 84 titles now require electronic data deposition to the CCDC when a paper is submitted. These data enter the CSD when the paper is published, and the CCDC now maintains a growing parallel archive of more than 160,000 of the initial 'raw' CIFs.

Current CSD statistics are also available on the website, and although the CCDC encourages direct deposition of Private Communications, these statistics refer primarily to published data. The issue of the very large number of structures that languish unpublished in laboratory records is quite another matter, but one that must surely be addressed. Software for data processing and maintenance of both the CIF archive and the CSD are currently undergoing a major overhaul, and new software will incorporate much expert knowledge that has been gained over the past 40 years.

New Products

Two new components of the CSD System have been added since 1997. These are knowledge-based libraries of intramolecular geometry (Mogul) and intermolecular interactions (IsoStar). They provide click-of-a-button access to millions of individual pieces of geometrical and chemical information that can be derived from the CSD (and PDB protein-ligand complexes in the case of IsoStar). Further development, and integration of this structural knowledge with other software, is ongoing in both cases.

Recent years have also seen the CCDC diversify into developing and marketing specific software applications for rational drug design (GOLD, SuperStar, Relibase+) and for structure solution from powder diffraction data (DASH). All of these products make use of crystal structure data from the CSD or PDB in some way, and all except SuperStar are being developed through collaborations with industry and academia. The life sciences products, concentrating essentially on protein-ligand interactions and protein-ligand docking, help to solve difficult problems, and promote the value of small-molecule crystal structure data in structural biology and in the pharmaceutical and agrochemicals industries. The CCDC continues to broaden its horizons, by seeking new areas of science in which crystal structure data adds value to research and development activities.

The CCDC as an Independent Institution

The CCDC was grant-funded from 1965 until 1989, when it became an independent institution: a non-profit Company Limited by Guarantee and with charitable status. This means that the CCDC must be financially self-sufficient, and that any surplus income must be ploughed back into the company (e.g. for new equipment) or into specific charitable activities. Thus, the CCDC provides grants-in-aid for access to the CSD System in developing countries, sponsorship to students who are working on projects allied to the CCDC's interests, and support for the activities of relevant professional organisations. The CCDC's affairs are overseen by an international Board of Governors, eight eminent scientists who, in their turn, are responsible to UK Companies House and to the Charity Commissioners for England and Wales.

Our most valuable assets: Staff, Customers and Collaborators

The CCDC has expanded steadily, and now has 50 employees divided between database creation, product development, research, scientific and technical support, and administration. The CCDC now has customers in academia and industry all over the world, and the nearly 2,000 CSD System licenses were distributed across 56 countries in 2004. The CCDC has a long history of scientific collaboration with academia and industry, and this work has fuelled our research output and fed into our product developments. Currently, the Pfizer Institute for Pharmaceuticals Materials Research, a major partnership involving the CCDC, Cambridge University and Pfizer Inc., is generating exciting results and further extending our areas of scientific interest.

We do not have a precise total of the number of staff and visitors who have worked at the CCDC over the past 40 years, but it must be 250 or more. What we do know is that they have left, or are leaving, their own mark on the organisation. It is the stronger for their contributions. Customers, scientific collaborators and data depositors also leave their mark, through their constructive input and feedback on our efforts. The CSD, our products, and ultimately all of our customers, have benefited enormously from these interactions, and we are grateful for their involvement.

We look forward to the next 40 years.

Frank Allen
www.ccdc.cam.ac.uk


Published in 'Crystallography News' no 92 March 2005 page 5 - 8