How We Indexed Millions of Compounds by Element

A technical look at how we built the database powering CompoundLookup's unique element-based search.

CompoundLookup Team

Author

January 28, 20266 min read

The Technical Challenge

When we set out to build CompoundLookup, we faced a unique database challenge: index compounds by their constituent elements, not by their formulas.

Traditional chemistry databases index by: - Compound name - Molecular formula (exact match) - CAS number - InChI key

We needed to index by: which elements are present.

The Data Pipeline

Step 1: Data Extraction We download compound data from PubChem's public FTP servers. This includes millions of compounds with their formulas, names, and properties.

Step 2: Formula Parsing Each formula (like "C₆H₁₂O₆") is parsed to extract unique elements: - Input: "C6H12O6" - Output: [C, H, O]

This sounds simple but has edge cases: - Two-letter symbols (Fe, Na, Cl) - Parentheses in formulas - Ions and charges - Isotope notations

Step 3: Element Indexing For each compound, we store relationships between the compound and its elements. This creates a many-to-many relationship enabling queries like:

"Find all compounds WHERE elements CONTAIN [C, H, O]"

Step 4: Subset Generation To enable partial matching (find C-H compounds even if they also contain O), we pre-generate common element subsets for faster queries.

Query Optimization

When you select elements, we run optimized SQL:

SELECT compounds.* FROM compounds
WHERE EXISTS (element = 'C')
  AND EXISTS (element = 'H')
  AND EXISTS (element = 'O')

Indexes on element columns make this fast even with millions of compounds.

The Result

Our database enables something no other chemistry database can do: instant compound lookup by any element combination.

This technology powers every search on CompoundLookup, delivering results in milliseconds.

Try Element-Based Search

Experience the revolutionary way to discover chemical compounds. No formulas needed — just select elements.

Start Searching