Selecting Molecules for Virtual Screening

E. K. Davies and C. J. Davies
Treweren Consultants Ltd, Holmleigh, Evesham Road, Harvington, Evesham WR11 8LU, UK

Abstract

This paper describes the origin of molecules of the 3.5 billion molecules that were used in the CAN-DDO Screen Saver Cancer Research project. In addition, to commercially available catalogues and well-established combinatorial libraries, de novo derivative generation was used to increase the number of molecules by two orders of magnitude. The importance of drug-likeness criteria and those used are also discussed.

Introduction

In order to make a significant contribution to pharmaceutical research, High Throughput Screening (HTS) requires a large number of molecules to be tested for biological activity. It is not unusual to be able to test over 10,000 samples per day which means that most pharmaceutical in-house historical collections can be tested in 1-3 months. Unless larger numbers of molecules are to be tested any further reduction in this time is unlikely to significant reduce the timescales for drug discovery research. However, there are two other issues of potential concern: the cost of testing and the limited small quantities of each sample in the collection. Consequently, if all molecules where tested on all screens, over time many companies would face the prospect of consuming their entire in-house historical collection or at least substantially reducing the quantities of samples. For some companies combinatorial chemistry provides a means of generating samples for HTS. However, even if it were conceivable to make all drug-like molecules by combinatorial methods, the costs would be prohibitive.

This paper focuses on early stage lead generation and assumes that the elucidation of the human genome and rapid increase in the number of 3-D protein receptor structures will mean that it is not necessary to make and biologically test large numbers of drug-like molecules. Instead vast numbers of molecules should be computationally evaluated. Fortunately, there is a vast range of software which claims to be applicable to this problem1,2 which leaves the issue of how generate appropriate molecules in sufficient numbers.

Drug-likeness

It is obvious that many small organic molecules such as those that insoluble, reactive or toxic are generally unsuitable as potential drugs. The importance of ADME-Tox (Adsorption, Distribution, Metabolism, Excretion and Toxicity) is widely acknowledged and significant progress has been made understanding and predicting such molecular characteristics 3-7. Nonetheless, the best predictions remain approximate and some of the crudest and fastest approaches of eliminating molecules based on simple properties such as molecular mass, number of heteroatoms, LogP etc are often used. In addition, it is common practice to eliminate reactive and toxic molecules based on undesirable substructures.

The THINK software8 includes functionality to calculate a range of properties shown in Table I. There are a number of properties which can be used in combination to filter out molecules which are not drug-like. The CAN-DDO project9 used molecular mass, number of centres (hydrogen bond donors, acceptors, charged atoms, ring centres etc), polar surface area, number of rotatable bonds and number of conformations. The choice of properties and ranges was complicated by the fact that there are molecules which are marketed as drugs have property values outside the ranges indicated in Table I. It is acknowledged, that during lead optimisation the molecular weight and lipophilicity often increase and consequently leads which start with high values of either of these parameters are perhaps poor choices for optimisation.

Table I THINK properties and filters used for CAN-DDO

NameMinimum(a)Maximum(a)Comment
AtomsCounter
BondsCounter
HetAtomsCounter
DonorsCount of hydrogen donors
AcceptorsCount of hydrogen acceptors
PositivesCount of positively charged centres
NegativesCount of negatively charged centres
AcidsCount of acidic H-donors
BasesCount of basic H-acceptors
RingsCounter
AromaticCounter
HetAromaticCounter
BranchesCounter
HalogensCounter
Centres29Count of centres
ChiralsCount of chiral atoms
Mass150800
FlexibilityOn geometric scale (b)
VolumeBased on VdW radii
AreaBased on VdW radii
LipophilicityOn scale of 0-1 (b)
PSAPolar surface area
NPSANon-polar surface area
PFAPolar fractional area
NPFANon-polar fractional area
XSA20240O+N surface area
XFAO+N fractional area
CPK-ContactsCounter
VDW-ContactsCounter
Rot-Bonds010Count excluding rings
Conformers01000000Based on product of increments
E-TorsionTorsional energy
(a) Where no minimum and maximum is specified no property filter was used
(b) The algorithms used are described elsewhere

If properties are used alone, then it is inevitable that some molecules are included which are reactive, easily metabolised or toxic. Consequently, it is common practice when selecting molecules for experimental High Throughput Screening to eliminate molecules which are considered undesirable based on a list of substructures. The substructures used in the CAN-DDO project (and by THINK by default) are indicated in Table II. This list was constructed following discussions with several vendors of samples for High Throughput Screening and certain pharmaceutical companies. Again, there are examples of drugs which contain many of these substructures and consequently some chemists would use a smaller or different list.

Table II Substructures used as filters

UnstableReactiveUndesirable
NOC CC(H)=C(H)C=O [M]
HNO C=COH [Si]N
[ND]=O C=CNH [Si]O
[NITR] O=C[HAL]SH
N[SSP2] N=C[HAL] C1XC1
O[SSP2] S=C[HAL] C1SC1
OO [HAL]C[HAL] SC#N
SS HOCOH C(=O)S
NN O=COC=O NP
N#N COS(=O)(=O)C PS
N=N=N COS(=O)OCC=P
N[CAK]O P[HAL]
N=C=O CN#C
N=C=S PC#N
N=C=N O=CC#N
O[HAL] OCC#N
N[HAL] NC#N
S[HAL] CC(H)=O
OC[HAL] PP
NC[HAL]
C=COH

THINK supports atom types and wildcards in square brackets
ND Nitrogen with double bond
NITR Nitrogen of nitro group
SSP2 Sp2 hybridised sulphur
CAK Alkyl chain carbon
HAL F, Cl, Br or I
M Metal atom
X Nitrogen or Oxygen

Catalogue Collections

Prior to commencing the CAN-DDO project the catalogues from some 13 suppliers were filtered to eliminate molecules that were not drug-like. The versions of the catalogues included in HTS Chemicals10 were used in some cases although updates were used where these were readily available. The numbers of molecules eliminated and some of the common reasons are summarized in Table III. The filters were applied to 3D structures in the order of Tables I and II with the consequence that once a molecule was eliminated it was not determined whether any other filters would also remove it. The most common reason for eliminating molecules was the inability of THINK 1.0 to create valid 3D coordinates. While THINK 1.14 has some improvements, the inability to create 3D structures for bridged ring systems remains the most significant limitation - notwithstanding some difficulties synthesising such molecules.

Table III Results of filtering catalogue molecules

SupplierTotalFiltered3DCentresMassXSA[ND]=ON-NC=CNH[HAL]C[HAL]
Asinex55003162431222161206014276259572713101935
Bionet240467935626560212054894818633302461
Chembridge23440210898612912252549396892575719094126075710
Chemstar49473167449151420426811316681394839331217
Comgenics-560000205857711756509401108176932292331
Comgenics-10400001281060612892046466263141525
Labotest2570077555069233737090020673593440454
Maybridge5392924314222932712921784299530416245417
Orion18405614424421178163530244516211961753
Sigma-Aldrich714102639211986613320714252598835304632292
ChemDiv 0114062599814131725033216221249547398
ChemDiv 1013202554610771687035814371061697589
ChemDiv 50115796444196905186601249114201283471853521
SPECS797272830621148707749218175465628035811300
SVETA97425307582457889011261877110291035240632132
TRG185337620331716047439184218661354150
Totals97111337055512209011921038672932578029801544157333185

It is apparent that based on the THINK criteria, a significant proportion of the catalogue molecules marketed for High Throughput Screening are not drug like. Most suppliers appear to eliminate molecules on molecular mass but not the number of centres (or heteroatoms). Although it is known that some suppliers filter on lipophilicity, the range of property calculations used mean that this is not apparent from our work. The final set of molecules used for the CAN-DDO project removed 55,760 duplicates and added an update of the ASINEX catalogue and the drug-like subset of the NCI collection giving a total of 409,843 drug-like molecules which might be available for biological testing.

Combinatorial Chemistry Libraries

Combinatorial chemistry is now well-established as a means of making large numbers of molecules for High Throughput Screening. The range of chemistry which can be automated is quite large and continues to grow11. Some of these libraries were developed to optimise specific series of molecules and do not have the generality of reagent that is common to many of the more well-known libraries. A set of 23 libraries were selected from the literature and are summarized in Table IV together with the results of filtering using the THINK standard drug-like criteria.

Table IV Libraries and R-groups

LibraryCoreR-groupsTotal(a)Filtered
L1S=C1N([3])C(=O)C([1])N1C[2][1]C(H)=O[2]N=C=S[3]C(H)(N)C(=O)OH268320126105
L2C1C(C(=O)O)N(C(=O)[2])C([1])S1[1]C(=O)[HAL][2]C(H)=O372528217989
L4(b)[1]-[2][1]N=C=O[2]NH144744114269
L5(b)[1]-[2][1]C(=O)Cl[2]NH267972179137
L6(b)[1]-[2][1]OC(=O)Cl[2]NH5281233702
L7(b)[1]-[2][1]S(=O)(=O)Cl[2]NH7628455321
L8[1]C(=O)N([2])C([3])C(=O)O[4][1]C(=O)Cl[2]N#C[3]N=C=S[4]OH115678087228135200
L18O=C(O)c1ccc2N([1])C(=O)N([2])c2c1[1]N(H)(H)[2][HAL]236489750119
L20O=C(N)c1ccc2n([1])c(C[2])nc2c1[1]N(H)(H)[2]NH2529108466930
L21n1c2ccccc2n([1])c1CS[2][1]N(H)(H)[2]SH15645388314
L23BNC(=O)c1ccc(cc1)-c2c([2])c([1])no2[1]C(=O)OR'[2][HAL]2183826477703
L29c1ccc(cc1)-c2nc([2])n([1])c2C(=O)N[1]N(H)(H)[2]C(=O)OH1543842559489
L32N1([1])CCC(=O)N([2])C1(=O)[1]N(H)(H)[2]N=C=O9568271236
L36c12ccccc1C(=O)N([1])C(=O)N([2])2[1]N(H)(H)[2]N(H)(H)1671849566401
L37[1]C1C(=O)NC([3])C(=O)N1C[2][1]C(H)(N(H)(H))C(=O)OH[2]C(H)=O[3]C(H)(N(H)(H))C(=O)OH2875392972364
L39o1c([1])c([2])cc1C(=O)OCC[1]C(=O)OH[2]C#CH9074462665
L41n1([1])cc([2])cc1[3][1]C(=O)OH[2]C(H)(H)C(=O)H[3]C(H)(H)[NITR](=O)[0-]1578753618937
L43C1([2])CC(=O)C=CN1C(=O)[1][1]C(=O)Cl[3][HAL]250573191834
L4A[1]NC(=O)[2] [1]N=C=O[2]NH144744125387
L5A[1]C(=O)[2][1]C(=O)Cl[2]NH267972240212
L5C[1]C(=O)[2][1]C(=O)OH[2]NH958782736052
L7A[1]S(=O)(=O)[2][1]S(=O)(=O)Cl[2]NH7628467502
L44C1([2])OC(=O)C=CN1C(=O)[1][1]C(=O)OH[2][HAL]21838261170344
Total117693625935327212
(a) Based on filtered R-groups in Sigma-Aldrich catalogue
(b) These libraries do not utilise valid reactions

During this work it became apparent, that it was necessary to implement within THINK functionality to enumerate libraries faster than commercial products such as Chem-X10 and in such a way that reagents and the associated products can be eliminated prior to enumeration. This required applying the upper property filters and substructure filters to the R-groups. In addition, use is made of the fact that many of the properties can be estimated by summing the properties of the R-groups with the consequence that products can be eliminated without performing a detailed enumeration. As a consequence, the effective enumeration speed was increased.

De Novo Derivative Generation

The numbers of molecules which can be made by combinatorial chemistry is dependent upon the available reagents and the number of libraries. In general, libraries with 4 or more R-groups such peptides and Ugi-type libraries have a large proportion of molecules that are not drug-like because of their flexibility or molecular mass. Furthermore, some care needs to be choosing libraries to avoid repeated use of the same lists of R-groups with subtly different cores. Thus, if larger numbers of molecules for virtual screening are to be generated, it might be necessary to consider reagents which are not commercially available.

For the CAN-DDO project, rather than generate reagent series, enumerate and filter the corresponding libraries, derivatives of molecules were generated and filtered at search time. This has the advantage of reducing the amount of data which has to be pre-processed and can also be applied to catalogue molecules. The current implementation uses a list of transformations which are summarized in Table V and include a range of oxidations, reductions, additions, eliminations and substitutions. The algorithm selects one of these at random and determines whether it can be applied. If so, one location is selected at random and the resulting molecule checked for drug likeness. In addition, in order to bias the molecules generated to be similar to the starting molecule, an annealing step is performed which effectively reduces the probability of molecule mass and other properties included in the drug-likeness criteria greatly increasing. It should also be recognised that it is not necessary to be restricted to transformations which correspond to chemical reactions.

Table V Transforms used by de novo derivative generator

SubstructureReplacementType
[A]C(H)(H)[A][A]C(H)(H)C(H)(H)[A]Chain increase
C(H)(H)C(H)(H)C(H)(H)C(H)(H)C(H)(H)C(=O)C(H)(H)C(H)(H)Chain increase
C(H)(H)C(H)(H)C(H)(H)C(H)(H)C(H)(H)C(=O)N(H)C(H)(H)C(H)(H)Chain increase
C(H)(H)C(H)(H)C(H)(H)C(H)(H)C(H)(H)C(=O)OC(H)(H)C(H)(H)Chain increase
C(H)(H)C(H)(H)C(H)(H)C(H)(H)C(H)(H)OC(H)(H)C(H)(H)Chain increase
C(H)(H)C(H)(H)C(H)(H)C(H)(H)C(H)(H)N(H)C(H)(H)C(H)(H)Chain increase
C(H)(H)C(H)(H)C(H)(H)C(H)(H)C(H)(H)SC(H)(H)C(H)(H)Chain increase
C(=O)CC(=O)OCChain increase
C(=O)CC(=O)N(H)CChain increase
C(H)(H)C(H)(H)C(H)(H)Chain reduction
C(H)(H)C(=O)C(H)(H)C(H)(H)Chain reduction
C(H)(H)C(=O)N(H)C(H)(H)Chain reduction
C(=O)OC(H)(H)C(H)(H)Chain reduction
C(H)(H)OC(H)(H)C(H)(H)Chain reduction
C(H)(H)N(H)C(H)(H)C(H)(H)Chain reduction
C(H)(H)SC(H)(H)C(H)(H)Chain reduction
C(=O)OCC(=O)CChain reduction
C(=O)N(H)CC(=O)CChain reduction
C(H)(H)C(H)(H)C(H)(H)C(H)(H)OC(H)(H)Heteroatoms
C(H)(H)C(H)(H)C(H)(H)C(H)(H)C(=O)C(H)(H)Heteroatoms
C(H)(H)C(H)(H)C(H)(H)C(H)(H)N(H)C(H)(H)Heteroatoms
C(H)(H)C(H)(H)C(H)(H)C(H)(H)SC(H)(H)Heteroatoms
[OAL]C(H)(H)Heteroatoms
C=OC(H)(H)Heteroatoms
[NAL]C(H)Heteroatoms
[SSP2]C(H)(H)Heteroatoms
OHClSubstitution
OHBrSubstitution
OHISubstitution
OHFSubstitution
OHC(H)(H)(H)Substitution
OHC#NSubstitution
OHC(H)(H)N(H)(H)Substitution
OHN(H)(H)Substitution
OHOC(H)(H)(H)Substitution
OHOC(H)(H)C(H)(H)(H)Substitution
ClOHSubstitution
BrOHSubstitution
IOHSubstitution
FOHSubstitution
C(H)(H)(H)OHSubstitution
C#NOHSubstitution
C(H)(H)N(H)(H)OHSubstitution
N(H)(H)OHSubstitution
OC(H)(H)(H)OHSubstitution
OC(H)(H)C(H)(H)(H)OHSubstitution
N(H)(H)N(C(H)(H)(H))(C(H)(H)(H))Substitution
N(H)(H)N(C(H)(H)C(H)(H)(H))(C(H)(H)C(H)(H)(H))Substitution
N(H)(H)FSubstitution
N(H)(H)ClSubstitution
N(H)(H)BrSubstitution
N(H)(H)ISubstitution
N(H)(H)OHSubstitution
N(H)(H)C#NSubstitution
N(H)(H)N(H)C(=O)N(H)(H)Substitution
N(H)(H)N(H)C(=S)N(H)(H)Substitution
N(H)(H)C(H)(H)(H)Substitution
N(H)(H)OC(H)(H)(H)Substitution
N(C(H)(H)(H))(H)N(H)(H)Substitution
N(C(H)(H)(H))(C(H)(H)(H))N(H)(H)Substitution
N(C(H)(H)C(H)(H)(H))(H)N(H)(H)Substitution
N(C(H)(H)C(H)(H)(H))(C(H)(H)C(H)(H)(H))N(H)(H)Substitution
FN(H)(H)Substitution
ClN(H)(H)Substitution
BrN(H)(H)Substitution
IN(H)(H)Substitution
OHN(H)(H)Substitution
C#NN(H)(H)Substitution
N(H)C(=O)N(H)(H)N(H)(H)Substitution
N(H)C(=S)N(H)(H)N(H)(H)Substitution
C(H)(H)(H)N(H)(H)Substitution
OC(H)(H)(H)N(H)(H)Substitution
HOHSubstitution
FHSubstitution
ClHSubstitution
BrHSubstitution
IHSubstitution
OHHSubstitution
N(H)(H)HSubstitution
C(H)(H)(H)HSubstitution
C(H)(H)C(H)(H)(H)HSubstitution
C(H)(H)C(H)(H)C(H)(H)(H)HSubstitution
N(H)NC(=O)C(H)(H)(H)Substitution
N(H)NC(H)(H)(H)Substitution
N(H)NC(H)(H)C(H)(H)(H)Substitution
[CAR]H[CAR]N(H)(H)Substitution
C(OH)(H)C=OOxidation
CN(H)(H)C[N+](=O)[O-]Oxidation
C=OC(OH)(H)Reduction
C[N+](=O)[O-]CN(H)(H)Reduction
C=CC(H)C(H)Hydrogenation
C([HAL])C(H)C=CElimination
C(OH)C(H)C=CElimination
C=CC(H)C(H)Addition
C=CC(F)C(H)Addition
C=CC(Cl)C(H)Addition
C=CC(Br)C(H)Addition
C=CC(I)C(H)Addition
C=CC(OH)C(H)Addition

THINK supports atom types and wildcards in square brackets
A Any atom
OAL Aliphatic (non-aromatic) oxygen
NALAliphatic (non-aromatic) nitrogen
SSP2 Sp2 hybridised sulphur
CAR Aromatic carbon

When starting from similar molecules, it is conceptually possible to generate identical molecules and in the worst case scenario the series might converge to give identical structures at each subsequent step. The probability of this occurring is effectively eliminated by using different random number generator starting number or seed for each molecule. In the CAN-DDO project, 100 derivatives were attempted for each of the 35 million starting molecules, generating approximately 3.5 billion molecules. In order to give some indication of the possible overlap in derivatives some simple experiments were performed:

(a) Starting with 3 commercially available molecules, a 100 derivatives where generated twice for each molecule using different random number seeds. This resulted in 3, 5 and 3 identical molecules.
(b) A further experiment took two derivatives that had been generated (which may or may not be related by a single transform) and generated 100 derivatives from each of these using the same random number seed. This resulted in 2, 0 and 0 identical molecules.
(c) A final experiment extracted 3 random molecules from a medium sized library (L4A) and the lists of molecules in this library that were more than 99% similar to these using THINK's functional group based keys. For each of the 3 starting molecules and an arbitrary similar molecule, 100 derivatives where generated using the same random number seed. On comparison, no identical molecules were found.

These experiments confirm the intuitive conclusion, that if the catalogues or libraries contain molecules that are closely related (eg by a transform known to the de novo derivative generator), then there is a small chance of generating duplicate molecules. In practice, the fact that very few molecules are related in this way and even for these different random number generator seeds are used, means that the number of duplicate molecules is negligible.

The derivatives generated depend upon the transformations present and the random number generator and starting seed. To reproduce a given analogue series that may be of historic interest, the order of transformations, the random number generator or the random number seed can be modified. Figure 1 shows an interesting series which starts burimamide (a molecule which shows some histamine H2 antagonist activity) through a series of transformations to cimetidine (which was the first commercial H2 antagonist) to the more potent ranitidine molecule. Although, this example omits some of the earliest molecules and many analogues that were made, the discovery of ranitidine was made independently and those who discovered cimetidine failed to find a significantly better molecule12. Perhaps if THINK had been available at the time, the fortunes and future of the company that discovered cimetidine would have been different.

Conclusions

This paper suggests that it is possible to generate vast numbers of drug-like molecules for virtual screening and that the catalogue molecules available represent a small proportion of these (about 0.1%). It is also apparent that the proportion of drug-like molecules from huge libraries is usually small and consequently such libraries are likely to be of lower value. There may be scope to increase the numbers of derivatives and/or the number of libraries to cover more drug-like molecules. However, this will inevitably give more hits for subsequent analysis and it is not necessarily true that this will have the consequence of reducing timescales and costs for drug discovery.

One might expect the inclusion of derivatives to result in series of related hits which may be indicative of hits that are synthetically accessible are potentially good leads. In addition, the approach should identify families that might be missed if only a small number of representatives for each family (or cluster) was screened. Obviously, some virtual screening software is too slow and/or cannot be run on a sufficiently large distributed processing scale to include derivatives in the virtual screening.

References

(1) Warr, W. Virtual High-Throughput Screening: Computational Tools for Drug Discovery and Design in Spectrum Life Sciences, Decision Resources, Inc, MA, USA. 2001
(2) Li, J., Murray, C.W., Waszkowycz, B. and Young, S.C. Targeted molecular diversity in drug discovery: integration of structure-based design and combinatorial chemistry. Drug Discovery Today 1998, 3, 105-112.
(3) Lipinski, C.A., Lombardo, F., Dominy, B.W. and Feeny, P.J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Advanced Drug Delivery Rev. 1997, 23, 3-25.
(4) Clark, D.E. and Pickett, S.D, Computational methods for the prediction of 'drug-likeness' Drug Discov. Today 2000, 5,59-58
(5) Liu, R. and So, S-S. Development of Quantative Structure-Property Relationship Models for Early ADME Evaluation in Drug Discovery. 1 Aqueous Solubility. J. Chem. Inf. Comput. Sci. 2001, 41, 1633-1639
(6) Liu, R. Sun, H. and So, S-S. Development of Quantative Structure-Property Relationship Models for Early ADME Evaluation in Drug Discovery. 2 Blood-Brain Barrier Penetration J. Chem. Inf. Comput. Sci. 2001, 41, 1623-1632
(7) Beresford, A.P., Selick, H.E. and Tarbit, M.H. The merging importance of predictive ADME simulation in drug discovery. Drug Discov. Today 2002, 7, 109-116.
(8) Davies, E.K. and Davies C.J. THINK A new program for Virtual Screening In preparation
(9) Hand, L. Computing for Cancer Research. The Scientist 2001, 15, 1-5.
(10) HTS Chemicals was developed at Chemical Design by the authors for use with Chem-X. Both products were discontinued after the company was purchased by Oxford Molecular in 1998.
(11) Solid Phase Synthesis database available from Accelrys (www.Accelrys.com)
(12) Ganellin, C.R. Chemistry and Structure-Activity Relationships of Drug Acting at Histamine Receptors in Pharmacology of Histamine Receptors; Ganellin, C.R. and Parsons M.E. Eds; John Wright & Sons 1982 Chapter 2 p10-102