The Technical Challenge
When we set out to build CompoundLookup, we faced a unique database challenge: index compounds by their constituent elements, not by their formulas.
Traditional chemistry databases index by: - Compound name - Molecular formula (exact match) - CAS number - InChI key
We needed to index by: which elements are present.
The Data Pipeline
Step 1: Data Extraction We download compound data from PubChem's public FTP servers. This includes millions of compounds with their formulas, names, and properties.
Step 2: Formula Parsing Each formula (like "C₆H₁₂O₆") is parsed to extract unique elements: - Input: "C6H12O6" - Output: [C, H, O]
This sounds simple but has edge cases: - Two-letter symbols (Fe, Na, Cl) - Parentheses in formulas - Ions and charges - Isotope notations
Step 3: Element Indexing For each compound, we store relationships between the compound and its elements. This creates a many-to-many relationship enabling queries like:
"Find all compounds WHERE elements CONTAIN [C, H, O]"
Step 4: Subset Generation To enable partial matching (find C-H compounds even if they also contain O), we pre-generate common element subsets for faster queries.
Query Optimization
When you select elements, we run optimized SQL:
SELECT compounds.* FROM compounds
WHERE EXISTS (element = 'C')
AND EXISTS (element = 'H')
AND EXISTS (element = 'O')
Indexes on element columns make this fast even with millions of compounds.
The Result
Our database enables something no other chemistry database can do: instant compound lookup by any element combination.
This technology powers every search on CompoundLookup, delivering results in milliseconds.