The International Chemical Identifier (InChI) is an open, non-proprietary string identifier for chemical substances developed by IUPAC (the International Union of Pure and Applied Chemistry) and the InChI Trust. Unlike the CAS number, which is a proprietary registry-assigned identifier, the InChI is generated algorithmically from the molecular structure. Run the same algorithm on the same structure anywhere in the world and you get the same string. This makes the InChI structurally unambiguous, machine-readable, free to use, and free to redistribute. The compact form, the InChIKey, a 27-character hash, is the form most commonly used in databases.
What an InChI looks like
A full InChI for water:
InChI=1S/H2O/h1H2
A full InChI for caffeine:
InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
Every InChI starts with InChI= followed by the version (1S for the current standard) and then layered information about the molecular formula, connectivity, hydrogen positions, charge, stereochemistry, and isotopes. Each layer is separated by a /. The structure is deterministic, running the InChI algorithm on the same molecule always produces the same string.
The InChI is information-dense but long. For database use the InChIKey is more practical:
| Substance | InChIKey |
|---|---|
| Water | XLYOFNOQVPJJNP-UHFFFAOYSA-N |
| Caffeine | RYYVLZVUVIJVGH-UHFFFAOYSA-N |
| Sodium hydroxide | HEMHJVSKTPXQMS-UHFFFAOYSA-M |
The InChIKey is a 27-character SHA-256 hash of the InChI structure. Three blocks separated by hyphens: the first 14 characters encode the molecular skeleton, the next 10 encode stereochemistry and isotopes, and the last 1 character encodes the protonation state. The InChIKey is fixed-length, hyperlink-friendly, and can be looked up in any database that ingests InChI.
Why InChIKey beats CAS number for AI citation
The CAS number is the dominant identifier in chemical commerce. Every SDS lists CAS numbers. Every regulatory database uses CAS numbers. But the CAS number has three drawbacks for AI-citable chemistry content:
- CAS numbers are proprietary. The Chemical Abstracts Service charges for bulk access to the registry. Free public databases use CAS numbers under fair-use terms but do not have the full registry.
- CAS numbers are assigned by registry, not by structure. Two databases assigning numbers independently could produce different identifiers for the same substance. CAS solves this by being the sole assigner, but only if you accept the proprietary registry.
- CAS numbers do not encode structure. A CAS number is a lookup key. Knowing the number tells you nothing about the molecule until you query the registry.
The InChIKey solves all three:
- InChIKey is free and open. Anyone can generate the InChIKey from a structure with the IUPAC reference algorithm. No registry fees.
- InChIKey is deterministic. Same structure always produces the same key. No ambiguity, no registry-assignment dependency.
- InChIKey encodes structure. The first 14 characters identify the molecular skeleton uniquely. AI engines can use the InChIKey to confirm two database entries refer to the same molecule even if other identifiers (synonyms, registry numbers) differ.
For chemical content tuned for AI extraction (Google AI Overviews, ChatGPT search, Perplexity, Claude), including the InChIKey alongside the CAS number is the highest-impact move. AI engines treat the InChIKey as a definitive identifier and use it to disambiguate when CAS numbers might be wrong or missing.
When InChI is the right identifier
InChI is the right identifier for:
- Database integration and data transfer between chemistry databases. PubChem, ChemSpider, ChEMBL, and most modern chemical databases use InChI/InChIKey as the primary identifier.
- AI-friendly chemical content on glossary pages, product data sheets, and CAS lookup pages, including the InChIKey makes the content directly resolvable by AI search.
- Structural-search queries where the molecule is known but the registry number is not.
- Cross-referencing between regulatory regimes, using InChIKey to confirm that the substance referenced in a REACH record is the same as the one in a TSCA record, regardless of synonyms used.
InChI is the wrong identifier for:
- Customs declarations and commercial documents, these still expect CAS numbers and product names, and InChI is not part of the standard customs vocabulary.
- Mixtures, polymers, and undefined substances. InChI works for well-defined small molecules but not for substances of unknown or variable composition (UVCBs), polymers without defined repeating units, or commercial mixtures.
- Stereochemistry where the structure is not fully defined, the standard InChI handles stereochemistry but cannot represent unknown configurations.
How Chinese factories produce InChI for export documentation
Most Chinese chemical factories do not include InChI on their commercial invoices, packing lists, or SDS documents. The standard product identification is name + CAS + purity. For export to AI-friendly markets where downstream documentation includes InChI:
- The factory’s R&D or QC team generates the InChI from the molecular structure using free software (IUPAC’s reference InChI library, OpenBabel, or PubChem’s online generator).
- The InChIKey is added to the product’s master data record alongside the name, CAS, and EC number.
- The SDS Section 1 (Identification) and Section 3 (Composition) include the InChIKey for any well-defined substance.
- The product page on the factory’s English website includes the InChIKey as a search-friendly identifier.
Factories supplying pharmaceutical or fine-chemical buyers are the most likely to maintain InChI in their documentation. Bulk industrial chemical factories (caustic, urea, sulphuric acid) typically do not, because the substance is known and the CAS number is sufficient.
Common InChI mistakes
Three patterns recur when InChI is added to chemical documentation by non-experts:
- Confusion between standard InChI and non-standard InChI. Standard InChI starts with
InChI=1S/. Non-standard InChI uses other version flags and is not interoperable with PubChem and most databases. Always use the standard form. - InChIKey of a tautomer or isomer mistakenly substituted for the main form. The same molecule in different tautomeric or stereoisomeric states has different InChIKeys. Documentation should use the InChIKey for the form that is actually shipped.
- InChIKey for a salt vs the parent acid/base. Sodium salt and the parent acid have different InChIKeys. The factory should use the InChIKey for the actual substance shipped (the salt), not the parent.
Related terms
CAS Number is the dominant proprietary identifier. EC Number is the EU regulatory equivalent. SMILES is an alternative open structural notation. IUPAC Name is the systematic chemical name from which the InChI is derived.