Examples
Example 0 - Basics: Tadah!MLIP
tadah is a command-line tool designed to simplify training and evaluation of machine learning models. It provides a user-friendly interface for managing configurations, datasets, and training processes. tadah is the main entry point for subcommands that perform specific tasks within your machine learning workflow.
In this example, we demonstrate how to:
Access help information using the
tadahbinary.Use the
explainsubcommand to understand the purpose of different subcommands and their options.
Before getting started, ensure that Tadah! is installed and accessible through your $PATH. If it is not, either add it to your $PATH or use the full path to the Tadah! binary in your commands.
Note
If you need installation instructions or system prerequisites, refer to Installation.
Accessing Tadah! Help
The tadah binary provides a help command to display information about available subcommands and their options. This is especially useful for first-time users or those needing a refresher on Tadah!’s capabilities.
tadah --help
or:
tadah -h
These commands will print a list of subcommands and their short descriptions.
Subcommand Hierarchy
Tadah! supports multiple levels of subcommands for better organization and clarity:
Top-level subcommands: These include commands like
trainordata.Second-level subcommands: Certain commands (like
data) have subcommands of their own (e.g.,split,balance).
For instance, tadah data split is an example of a second-level subcommand where data is the top-level subcommand and split is a subset of data-related operations.
Subcommand Help
For help with a specific subcommand, such as train:
tadah train --help
Tadah! explain command
Tadah! offers an explain subcommand to help you understand the purpose of each subcommand and its options. The explain subcommand takes a single argument:
The argument represents a query about either a particular (sub)command or an option.
It prints an extended description, including usage, purpose, and any additional helpful information.
To specify which subcommand or option to explain use a dot-separated format. Each dot separates the subcommand(s), and the specific option of interest.
For example, to learn about the train command’s --task option:
tadah explain train.task
For a second-level subcommand, such as tadah data split:
tadah explain data.split
And to learn more about the --dbfile option of tadah data split:
tadah explain data.split.dbfile
Runtime Errors
Tadah! provides helpful runtime errors, such as when a configuration file is missing a necessary key. While we strive for reliability, no code is entirely bug-free. If you encounter any issues, please report them on Tadah!’s GitLab:
Example 1 - Basics: Training and Prediction
This example demonstrates how to train a model using Tadah!MLIP. The config.train file defines
the necessary parameters for the model.
To use this example, ensure you have access to all required files available in the Git repository at: https://git.ecdf.ed.ac.uk/tadah/tadah.mlip/-/tree/main/examples/CLI/ex_1
Training
To train the model, use the following command:
tadah train -c config.train -v 2
-v 2: Enables verbosity for detailed output during the training process.
Prediction
Once the model is trained, you can run predictions on any Tadah!MLIP dataset. Here we simply reuse the training dataset for demonstration purposes.
tadah predict -p pot.tadah -d tdata.db -v 2 -A
-A: Generates useful analytics such as RMSE and R^2.-v 2: Enables verbosity for detailed output during prediction.
Additional Notes
The dataset contains over 10,000 structures, so the process may take a while when using the
-For-Sflags to train or predict with forces and stresses, respectively.
Example 2 - Basics: Nested Fitting Procedure
This example shows how to optimise a model’s hyper-parameters (HPs) with Tadah!MLIP. The workflow is deliberately simple and is intended only as a minimal, self-contained demonstration of the nested fitting (also called hyper-parameter optimisation, HPO) capability.
Scope
In this tutorial we optimise the HPs only against a validation data set. Doing so will generally lead to some degree of over-fitting to that validation set. How to add physics-based constraints or extra test sets to mitigate over-fitting is covered in a separate example.
Files provided
config.train– training-stage settings for the underlying potential.
# For a description of KEYS and corresponding values, see the documentation:
# https://tadah.readthedocs.io/
DBFILE tdata.db # Training dataset
INIT2B true # Use two-body descriptor
MODEL M_BLR BF_Linear # Use linear model
TYPE2B D2_Blip 4 4 Ta Ta # Use D2_Blip descriptor
RCTYPE2B Cut_Cos # Cutoff function for two-body descriptor
RCUT2B 5.3 # Cutoff distance
SGRID2B GEOM 4 0.1 1.0 # Automatic generation blips for SGRID
CGRID2B LIN 4 1.0 5.3 # Automatic generation blips for CGRID
LAMBDA 1e-7 # Control regularization parameter
OALGO 3 # Select optimization algorithm
BIAS true # Append constant to descriptor
config.val– list of configurations that form the validation data set.
# The validation datasets are used to obtain
# MAE, RMSE and R^2 on energies or forces.
# Specify one or more validation datasets with DBFILE keyword
# Path can be either absolute or relative to the program current
# working directory.
DBFILE vdata.db
config.hpo– instructions that tell Tadah!MLIP how to perform HPO (choice of optimiser, loss function, search space, logging policy, …).
# 1. Define the optimizer (see the documentation for the full list).
# Here we use dlib’s Global Search Function (also known as MaxLIPO+TR).
OPTIMIZER
LIB DLIB
ALGO GFS
MAXEVAL 2000 # maximum number of iterations
FTOL_ABS 1e-8 # numerical convergence threshold for local minima
ENDOPTIMIZER
# 2. Use the L2-norm loss function.
LOSS L2
# 3. Basic performance constraint: include the energy RMSE in the loss function with a target of 0 and a weight of 1.0.
# The RMSE is measured on the validation dataset defined in config.val.
PC_ERMSE 0 1.0
# 4. Define search-space constraints.
OPTIM CGRID2B (1-4) 0.0 5.5
OPTIM SGRID2B (1-4) 0.1 1.0
# 5. Output-control parameters. hpo always saves the current best model to `best_pot.tadah`.
# Log files are written as (best_)loss.tadah, (best_)params.tadah, and (best_)outvar.tadah.
OUTPUT 10 # write to logs every 10th iteration, regardless of the loss value
BOUTPUT 1 # always write to the best_ logs
# 6. Directories and dump rate for archiving intermediate potentials.
DUMP 100 newPotsDir # dump pot.tadah_i every 100th potential, regardless of the loss
BDUMP 1 newBestPotsDir # archive a potential whenever it improves on the previous best
Running the example
Invoke the nested fitting run with:
tadah hpo --config config.train --validation config.val --hpotarget config.hpo
The program iterates until it reaches the maximum number of evaluations set in
config.hpo or until you stop it manually (e.g. with <Ctrl-C>).
Whenever it finds a parameter set that improves upon the current best loss,
it writes the corresponding potential to best_pot.tadah.
Output files
The optimiser produces three main log files that record every iteration
loss.tadah– value of each individual loss term and the total loss.params.tadah– hyper-parameter vector for the current iteration.outvar.tadah– any additional output variables (by default each loss term).
For convenience, each of these files has a best counterpart that is updated only when a new optimum is reached:
best_loss.tadahbest_params.tadahbest_outvar.tadah
These “best_” logs allow you to track the progress of the optimisation without having to parse the full history.
Example 3 - DM_mBlipEnv: Train and Predict
This example exercises the DM_mBlipEnv many-body descriptor.
What the descriptor does
DM_mBlipEnv mixes two different basis families:
The atomic density
rhois built from a Gaussian radial basis - Gaussian-type orbitalsexp(-eta*(r - r_s)^2)summed over the neighbours of each atom.CGRIDMBsets the centrer_sandSGRIDMBthe inverse widtheta.The embedding function
F(rho)is built from blip (cubic B-spline) basis functions inrho-space, each multiplied by an inverse-cutoff envelope (Cut_SinInv) of widthRCUTENV.CEMBFUNC/SEMBFUNCset the blip centres and widths.
So the descriptor is a Gaussian density fed through a blip embedding.
The blip embedding is a flexible spline basis: with enough blips it can
represent essentially any EAM-style embedding function F(rho).
The envelope enforces the physical EAM boundary condition
F(rho) -> 0 as rho -> 0 - an atom with no neighbours contributes
no embedding energy - by smoothly ramping F down over the
low-density region rho < RCUTENV.
Restriction on L
The L integer in TYPEMB is the maximum orbital angular momentum.
DM_mBlipEnv is rotationally invariant only for L = 0: its
embedding function is not F_SQ (rho^2), so for L > 0 the
embedding is applied to individual orbital densities before the
invariance-restoring quadratic sum and the descriptor is not
rotationally invariant (the code emits a runtime WARNING). Lmax
is also hard-capped at 6. This example therefore uses L = 0. For a
rotationally invariant L > 0 many-body descriptor use the
DM_PS_* family instead.
Files provided
config.train- training settings usingDM_mBlipEnv.tdata.db- 50 thermally displaced (MD) snapshots of bcc Ta, each a 54-atom 3x3x3 conventional supercell (lattice constanta0 ~ 3.32Angstrom). The set is intentionally small so the example runs in seconds.
The TYPEMB line
TYPEMB DM_mBlipEnv 0 1 1 5 5 Ta Ta - the integers, in order, are:
0-L, the maximum orbital angular momentum (see above);1-size(CGRIDMB), number of radial Gaussian centres;1-size(SGRIDMB), number of radial Gaussian widths;5-size(CEMBFUNC), number of embedding blip centres;5-size(SEMBFUNC), number of embedding blip widths;
followed by the element pair the descriptor block applies to (Ta Ta).
CGRIDMB / SGRIDMB must have size 1 - the dimension-expanding blip
embedding needs a single radial basis function; the descriptor output
dimension comes from CEMBFUNC / SEMBFUNC instead.
Density range and WATOM
With the radial basis in config.train (CGRIDMB = 2.88,
SGRIDMB = 8.0, RCUTMB = 5.0, Cut_Cos) the raw per-atom
density of this dataset has a median of ~3.24. The WATOM key applies
a per-species weight to every neighbour’s contribution to rho;
WATOM = 0.309 (i.e. 1/3.24) rescales the density so it is ~1.
After scaling the per-atom density lies in rho ~ [0.89, 1.10] with
median ~1.00.
Choosing the blip basis (CEMBFUNC / SEMBFUNC)
The blips must give an embedding F(rho) that is finite for every
rho > 0 and goes to zero only at rho = 0. Two facts drive the
choice:
A blip contributes nothing outside its
rho-supportmu +/- 2/eta. If no blip has support at a givenrho, the embedding is identically zero there - a flat dead zone.A linear fit can only constrain a blip whose support overlaps the data band (
rho ~ 0.89-1.10). A blip with no data overlap gets a zero weight from the regulariser, so adding narrow low-rhoblips does not remove the dead zone - they are simply zeroed.
The fix is to make the blips wide: SEMBFUNC = 2.0 gives each
blip a support of width 4/eta = 2.0. Every blip in
CEMBFUNC = 0.5 0.75 1.0 1.25 1.5 then overlaps the data band (so its
weight is fit from data) and the lower blips reach all the way down to
rho = 0. The fitted F(rho) is consequently finite and non-zero
for all rho > 0.
These wide blips give a non-zero embedding at rho = 0 itself, and
RCUTENV = 0.6 is exactly what cancels it: the Cut_SinInv envelope
ramps from 0 at rho = 0 up to 1 at rho = RCUTENV, smoothly
pulling F -> 0 as rho -> 0. RCUTENV = 0.6 lies below the data
band, so the envelope is saturated to 1 across rho ~ 0.89-1.10 and
does not reshape the fit where training data is present.
Training
tadah train -c config.train -v 2
The expected output ends with a trained pot.tadah potential file in
the working directory (a 5-column design matrix, one column per blip).
Prediction
To verify the round-trip, predict back on the training set:
tadah predict -p pot.tadah -d tdata.db -v 2 -A
-A prints RMSE / R^2 analytics.
Expected results
The dataset is a near-uniform set of thermal bcc-Ta snapshots, so the total energy spread is tiny (standard deviation ~7.3 meV/atom). On this set the configuration above gives roughly:
Energy R^2 ~ 0.70
Energy RMSE ~ 4.0 meV/atom
i.e. the descriptor captures most of the (small) energy variance. This is a genuine, working fit - the descriptor block is non-zero and informative.
A caveat on the embedding tails: this dataset only samples rho ~ 1
(dense bcc environments). The fitted F(rho) is finite and non-zero
for all rho > 0 - there is no dead zone - but its shape away from
rho ~ 1 is set by the wide-blip basis and the regulariser, not by
data. The values of F at low and high rho are therefore a smooth
extrapolation, not a physically constrained embedding; pinning those
regions would need training structures that actually sample low- and
high-density environments (surfaces, vacancies, dimers, compression).
For high-accuracy fits on this dataset the two-body D2_Blip baseline
in ex_1 remains a strong choice.