Training
This page covers the tadah train command in depth. For a one-shot
introduction see Training in the quickstart; for the full
key-by-key reference see Configuration File.
Command
tadah train -c <config.train> [--force] [--stress] [--uncertainty] \
[--outfile <pot.tadah>] [--verbose <0|1|2>]
The required input is a single config file
(--config / -c) that declares the dataset(s) via DBFILE,
the descriptor(s), and the regression model. Every CLI flag maps
one-to-one to an upper-case configuration key (e.g. --force ↔
FORCE true); CLI flags override file values.
Flag |
Config key |
Effect |
|---|---|---|
|
|
Training configuration file. |
|
|
Trained-model output file (default |
|
|
Train on forces in addition to energies (default off). |
|
|
Train on stresses in addition to energies (default off). |
|
|
Compute and write per-weight uncertainty (default off). |
|
|
Log level: 0 ERROR, 1 WARNING, 2 INFO (default 1). |
Training data must be supplied as one or more Tadah! database files via
DBFILE in the configuration; see Dataset Format for the on-disk format.
Pipeline
For each tadah train invocation the engine runs the following stages
in order:
Load the structures declared by every
DBFILEkey into memory.Apply load-time data transforms (
LSCALE,ESHIFT*,EFILTER/FFILTER,EWEIGHT_TEMP,WDBFILE/WDBFILE_AUTO,ZERO_COM_FORCE) — see Load-time data transforms.Build the neighbour list out to
RCUTMAX.Run the pre-flight audit (
AUDIT warnby default,errorto refuse-to-launch on findings) which checks dataset statistics and per-descriptor metadata for ρ-collapse and out-of-range parameter combinations.Fit the regression model declared by
MODEL.Write ``pot.tadah`` (or whatever
--outfilerequests) and the optional uncertainty file.
Load-time data transforms
The transforms below reshape the training set before any neighbour
finding, descriptor calculation, or fitting. They run inside
apply_load_transforms() as soon as the datasets are loaded and are
shared across tadah train, tadah predict, and tadah hpo.
Two of them — LSCALE and the resolved ESHIFT — are stamped into
pot.tadah so tadah predict round-trips them automatically; the
rest are training-time only. For the per-key CLI form and defaults see
Configuration File and tadah explain <key> (e.g. tadah explain
train.lscale).
Order of application
The transforms run in a fixed order, and the order matters because several feed the next:
Outlier filter (
EFILTER/FFILTER) — runs first, so an extreme outlier cannot poison the later least-squares and reweighting steps.Length rescale (
LSCALE).ESHIFT — determine the per-element shifts, then subtract them.
Boltzmann reweight (
EWEIGHT_TEMP) — after ESHIFT, so the Boltzmann factor sees the shifted, user-meaningful energies.COM-force zeroing (
ZERO_COM_FORCE).Per-dataset weights —
WDBFILEcomposed multiplicatively withWDBFILE_AUTO.Equilibrium-volume diagnostic — logs the lowest post-transform
E/Nconfiguration as a reference-state sanity check.
Key summary
Key |
One-line summary |
|---|---|
|
Uniform length rescale (DFT → experimental scale). Persisted to pot.tadah. |
|
Explicit per-element energy shift; one value per element ordered by Z. |
|
Derive ESHIFT from isolated-atom configurations in the dataset. |
|
Derive ESHIFT by least-squares atomic-energy fit over the database. |
|
Drop configurations whose per-atom energy is outside |
|
Drop configurations with any single force component above |
|
Per-dataset weight multipliers; one per |
|
Automatic size balancing: per-config weight × |
|
Boltzmann reweighting at temperature |
|
Subtract the mean force per configuration. |
LSCALE: uniform length rescale
LSCALE s multiplies every position and cell vector by s. To
keep the total energy invariant under the change of length scale,
forces are divided by s; stresses are stored as a virial (energy
units) and are left untouched. s = 1 (the default) is a no-op and
s <= 0 is a configuration error.
The intended use is matching a DFT lattice constant to experiment: set
LSCALE = a_exp / a_DFT so the trained model’s equilibrium volume
sits at the experimental scale.
LSCALE is stamped into pot.tadah. tadah predict re-applies
it so a raw DFT dataset round-trips; pass --no-lscale to ignore the
stored value when the dataset is already at the trained scale. The
LAMMPS pair_style tadah does not re-apply LSCALE — supply
LAMMPS with positions already at the trained scale.
ESHIFT: per-element energy shift
For every configuration ESHIFT subtracts a per-element baseline:
E_after = E_before − Σ_Z N_Z · ESHIFT[Z]
ESHIFT[Z] is the per-atom reference energy of species Z to be
removed — feeding the isolated-atom energy turns total energies into
cohesive energies. There are three mutually exclusive ways to obtain
the values; if more than one is set the engine warns and applies the
precedence explicit > ATOM > DBATOM:
Explicit —
ESHIFT a b …, one value per unique species, ordered byZ.ESHIFT_ATOM — the mean energy of the single-atom (
natoms == 1) configurations of each species. A species with no isolated-atom config gets0and a warning; a spread above 1 meV across same-species configs is logged.ESHIFT_DBATOM — a least-squares atomic-energy fit over the whole database (
min ‖y − Mβ‖², withM[i,k]the count of specieskin configi). Robust when the dataset has compositional diversity but no isolated-atom configs.
Whichever path is used, the resolved per-element values are stamped
into pot.tadah (the ESHIFT_ATOM / ESHIFT_DBATOM toggles are
not). tadah predict re-applies them; --no-eshift ignores the
stored values.
Filtering and weighting
EFILTER lo hidrops any configuration whose per-atom energyE/Nis outside[lo, hi];FFILTER thrdrops any configuration with a single-atom force magnitude abovethr. Both run first and are training-time only.WDBFILE w_1 … w_nmultiplies theeweight/fweight/sweightof every config in datasetibyw_i— one value perDBFILE.WDBFILE_AUTO αmultiplies dataseti’s per-config weight byN_i^(−α), whereN_iis its post-filter config count.α = 0disables it,α = 0.5is a soft balance,α = 1makes every dataset contribute equal aggregate loss. It composes multiplicatively withWDBFILE;αoutside[0, 1]is rejected.EWEIGHT_TEMP Tmultiplies eacheweightbyexp(−(E/N − E_min)/(k_B·T))(E_minthe minimum per-atom energy,Tin Kelvin), biasing the loss toward low-energy configs. Applied after ESHIFT;T <= 0is rejected.ZERO_COM_FORCEsubtracts the mean force of each configuration so its net force is exactly zero — useful against residual translational forces left by incomplete DFT relaxation.
These transforms are training-time only: they are not stamped into
pot.tadah and not re-applied at predict time.
MPI training caveat
The load transforms need a global view of the dataset that is not yet
wired through the MPI TrainerWorker. Under an MPI training build
the engine fails fast if LSCALE, ESHIFT (explicit or derived
via ESHIFT_ATOM / ESHIFT_DBATOM), EWEIGHT_TEMP, or
WDBFILE_AUTO is set — run the training without MPI to use them.
Note
The MPI training build is broken in the 1.3.0-beta.1 release (see Installation); this caveat is documented for forward reference only.
Output files
tadah train writes its results into the current working
directory (i.e. wherever you invoked the command).
File |
Contents |
|---|---|
|
The trained interatomic potential. Path overridable via |
|
Per-weight mean and uncertainty, written only when |
pot.tadah format
The potential file is plain ASCII. Each line is one KEY VALUE[…]
pair using the same syntax as the training configuration file
(whitespace separates fields; comments and blank lines are not
required). The file is round-tripped through tadah predict and
pair_style tadah — every field needed to reproduce the prediction
is stored here.
A minimal single-species, two-body, kernel-ridge example:
ATOM Ta
BIAS false
DIMER false 0 false
EWEIGHT 1.0
FWEIGHT 1.0
INIT2B true
INITMB false
LAMBDA 0
MODEL M_KRR Kern_Linear 1
NORM false
OALGO 1
RCTYPE2B Cut_Dummy
RCUT2B 6.5
SWEIGHT 1.0
TYPE2B D2_LJ Ta Ta
WATOM 73
WEIGHTS 14.20421334... 2604.39696519...
Three categories of keys appear:
Configuration echo — every key that fully specifies the model geometry: descriptor type and parameters (
TYPE2B,RCUT2B,RCTYPE2B,INIT2B/INITMB, …), model choice and hyperparameters (MODEL,LAMBDA,NORM,BIAS, …), and the species list (ATOM/WATOM).Fitted parameters — the row
WEIGHTS …carries the learned coefficients \(\mathbf w\). The ordering matches the column layout of the design matrix and is reproduced exactly on load.Stamped load-time transforms — when the training run resolved
LSCALE(≠ 1) orESHIFT(explicit or viaESHIFT_ATOM/ESHIFT_DBATOM), the resolved values are written so prediction re-applies them automatically. The*_ATOM/*_DBATOMtoggles themselves are not stamped (only their resolved output).
Runtime-only knobs (VERBOSE, NUMERIC, OUTFILE, AUDIT,
HEALTH_LOG) are stripped before write — leaving them in the file
would trigger “Invalid key” errors in the LAMMPS-side parser.
weights.tadah format
Written only when --uncertainty is requested. Three
whitespace-separated columns with a single header line:
Index Mean weight Uncertainty
0 1.420421e+01 3.117841e-02
1 2.604397e+03 4.882103e+00
…
Index matches the ordering of the WEIGHTS row in
pot.tadah. --uncertainty is not supported by the MPI training
build.
Diagnostic output (verbose 2)
With --verbose 2 (INFO level) the engine logs progress to stdout:
load-time transform summaries, neighbour-finding wall-time, the
pre-flight audit report, and the regression solver status. None of
these lines are written to files; capture them with shell redirection
if needed (tadah train -c config.train -v 2 2>&1 | tee train.log).
See also
Training — quickstart introduction.
Configuration File — every configuration key, with defaults.
Prediction — using a
pot.tadahwithtadah predict.Nested Fitting — driving training from an outer hyperparameter optimiser.