Dave Weininger passed away recently. He was very well known in the chemical informatics community because of his contribution to the field and his personality. Dave and Yosi Taitz founded Daylight Chemical Information Systems to turn some of these ideas into a business, back in the 1980s. It was very profitable. (As a bit of trivia, the "Day" in "Daylight" comes from "Dave And Yosi".)
Some of the key ideas that Dave and Daylight introduced are SMILES, SMARTS, and fingerprints (both the name and the hash-based approach). Together these made for a new way to handle chemical information search, and do so in significantly less memory. The key realization which I think lead to the business success of the comany, is that the cost of memory was decreasing faster than the creation of chemical information. This trend, combined with the memory savings of SMILES and fingerprints, made it possible to store a corporate dataset in RAM, and do chemical searches about 10,000 times faster than the previous generation of hard-disk based tools, and do it before any competition could. I call this "Weininger's Realization". As a result, the Daylight Thor and Merlin databases, along with the chemistry toolkits, became part of the core infrastructure of many pharmaceutical companies.
I don't know if there was a specific "a-ha" moment when that realization occurred. It certainly wasn't what drove Dave to work on those ideas in the first place. He was a revolutionary, a Prometheus who wanted to take chemical information from what he derisively called 'the high priests' and bring it to the people.
An interest of mine in the last few years is to understand more about the history of chemical information. The best way I know to describe the impact of Dave and Daylight is to take some of the concepts back to the roots.
You may also be interested in reading Anthony Nicholls description of some of the ways that Dave influenced him , and Derek Lowe's appreciation of SMILES .
Errors and OmissionsBefore I get there, I want to emphasize that the success of Daylight cannot be attributed to just Dave, or Dave and Yosi. Dave's brother Art and his father Joseph were coauthors on the SMILES canonicalization paper. The company hired people to help with the development, both as employees and consultants. I don't know the details of who did what, so I will say "Dave and Daylight" and hopefully reduce the all too easy tendency to give all the credit on the most visible and charismatic person.
I'm unfortunately going to omit many parts of the Daylight technologies, like SMIRKS, where I don't know enough about the topic or its effect on cheminformatics. I'll also omit other important but invisible aspects of Daylight, like documentation or the work Craig James did to make the database servers more robust to system failures. Unfortunately, it's the jockeys and horses which attract the limelight, not those who muck the stables or shoe the horses.
Also, I wrote this essay mostly from what I have in my head and from presentations I've given, which means I've almost certainly made mistakes that could be fixed by going to my notes and primary sources. Over time I hope to spot and fix those mistakes in this essay. Please let me know of anything you want me to change or improve.
Dyson and Wiswesser notationsSMILES is a "line notation", that is, a molecular representation which can be described as a line of text. Many people reading this may have only a vague idea of the history of line notations. Without that history, it's hard to understand what helped make SMILES successful.
The original line notations were developed in the 1800s. By the late 1800s chemists began to systematize the language into what is now called the IUPAC nomenclature. For example, caffeine is "1,3,7-trimethylpurine-2,6-dione". The basics of this system are taught in high school chemistry class. It takes years of specialized training to learn how to generate the correct name for complex structures.
Chemical nomenclature helps chemists index the world's information about chemical structures. In short, if you can assign a unique name to a chemical structure (a "canonical" name), then it you can use standard library science techniques to find information about the structure.
The IUPAC nomenclature was developed when books and index cards were the best way to organize data. Punched card machines brought a new way of thinking about line notations. In 1946, G. Malcolm Dyson proposed a new line notation meant for punched cards. The Dyson notion was developed as a way to mechanize the process of organizing and publishing a chemical structure index. It became a formal IUPAC notation in 1960, but was already on its last legs and dead within a few years. While it might have been useful for mechanical punched card machines, it wasn't easily repurposed for the computer needs of the 1960s. For one, it depended on superscripts and subscripts, and used characters which didn't exist on the IBM punched cards.
William J. Wiswesser in 1949 proposed the Wiswesser Line Notation, universally called WLN, which could be represented in EBCIDIC and (later) ASCII in a single line of text. More importantly, unlike the Dyson notation, which follows the IUPAC nomenclature tradition of starting with the longest carbon chain, WLN focuses on functional groups, and encodes many functional groups directly as symbols.
Chemists tend to be more interested infunctional groups, and want to search based on those groups. For many types of searches, WLN acts as its own screen, that is, it's possible to do some types of substructure search directly on the symbols of the WLN, without having to convert the name into a molecular structure for a full substructure search. To search for structures containing a single sulfur, look for WLNs with a single occurrence of S, but not VS or US or SU. The chemical information scientists of the 1960s and 1970s developed several hundred such clever pattern searches to make effective use of the relatively limited hardware of that era.
WLNs started to disappear in the early 1980s, before SMILES came on the scene. Wendy Warr summarized the advantages and disadvantages of WLNs in 1982 . She wrote "The principle disadvantage of WLN is that it is not user friendly. This can only be overcome by programs which will derive a canonical WLN from something else (but no one has yet produced a cost-effective program to do this for over 90% of compounds), by writing programs to generate canonical connection tables from noncanonical WLNs, or by accepting the intervention of a skilled "middle man"."
Dyson/IUPAC and WLNs were just two of dozens, if not hundreds, of proposed line notations. Nearly every proposal suffered from a fatal flaw - they could not easily