Examples

Example 0 - Basics: Tadah!MLIP

tadah is a command-line tool designed to simplify training and evaluation of machine learning models. It provides a user-friendly interface for managing configurations, datasets, and training processes. tadah is the main entry point for subcommands that perform specific tasks within your machine learning workflow.

In this example, we demonstrate how to:

  • Access help information using the tadah binary.

  • Use the explain subcommand to understand the purpose of different subcommands and their options.

Before getting started, ensure that Tadah! is installed and accessible through your $PATH. If it is not, either add it to your $PATH or use the full path to the Tadah! binary in your commands.

Note

If you need installation instructions or system prerequisites, refer to Installation.

Accessing Tadah! Help

The tadah binary provides a help command to display information about available subcommands and their options. This is especially useful for first-time users or those needing a refresher on Tadah!’s capabilities.

tadah --help

or:

tadah -h

These commands will print a list of subcommands and their short descriptions.

Subcommand Hierarchy

Tadah! supports multiple levels of subcommands for better organization and clarity:

  • Top-level subcommands: These include commands like train or data.

  • Second-level subcommands: Certain commands (like data) have subcommands of their own (e.g., split, balance).

For instance, tadah data split is an example of a second-level subcommand where data is the top-level subcommand and split is a subset of data-related operations.

Subcommand Help

For help with a specific subcommand, such as train:

tadah train --help

Tadah! explain command

Tadah! offers an explain subcommand to help you understand the purpose of each subcommand and its options. The explain subcommand takes a single argument:

  • The argument represents a query about either a particular (sub)command or an option.

  • It prints an extended description, including usage, purpose, and any additional helpful information.

  • To specify which subcommand or option to explain use a dot-separated format. Each dot separates the subcommand(s), and the specific option of interest.

For example, to learn about the train command’s --task option:

tadah explain train.task

For a second-level subcommand, such as tadah data split:

tadah explain data.split

And to learn more about the --dbfile option of tadah data split:

tadah explain data.split.dbfile

Runtime Errors

Tadah! provides helpful runtime errors, such as when a configuration file is missing a necessary key. While we strive for reliability, no code is entirely bug-free. If you encounter any issues, please report them on Tadah!’s GitLab:

Example 1 - Basics: Training and Prediction

This example demonstrates how to train a model using Tadah!MLIP. The config.train file defines the necessary parameters for the model.

To use this example, ensure you have access to all required files available in the Git repository at: https://git.ecdf.ed.ac.uk/tadah/tadah.mlip/-/tree/main/examples/CLI/ex_1

Training

To train the model, use the following command:

tadah train -c config.train -v 2
  • -v 2: Enables verbosity for detailed output during the training process.

Prediction

Once the model is trained, you can run predictions on any Tadah!MLIP dataset. Here we simply reuse the training dataset for demonstration purposes.

tadah predict -p pot.tadah -d tdata.db -v 2 -A
  • -A: Generates useful analytics such as RMSE and R^2.

  • -v 2: Enables verbosity for detailed output during prediction.

Additional Notes

  • The dataset contains over 10,000 structures, so the process may take a while when using the -F or -S flags to train or predict with forces and stresses, respectively.

Example 2 - Basics: Nested Fitting Procedure

This example shows how to optimise a model’s hyper-parameters (HPs) with Tadah!MLIP. The workflow is deliberately simple and is intended only as a minimal, self-contained demonstration of the nested fitting (also called hyper-parameter optimisation, HPO) capability.

Scope

In this tutorial we optimise the HPs only against a validation data set. Doing so will generally lead to some degree of over-fitting to that validation set. How to add physics-based constraints or extra test sets to mitigate over-fitting is covered in a separate example.

Files provided

  • config.train – training-stage settings for the underlying potential.

# For a description of KEYS and corresponding values, see the documentation:
# https://tadah.readthedocs.io/

DBFILE tdata.db           # Training dataset

INIT2B true               # Use two-body descriptor

MODEL M_BLR BF_Linear     # Use linear model
TYPE2B D2_Blip 4 4 Ta Ta  # Use D2_Blip descriptor
RCTYPE2B Cut_Cos          # Cutoff function for two-body descriptor

RCUT2B 5.3                # Cutoff distance

SGRID2B GEOM 4 0.1 1.0    # Automatic generation blips for SGRID
CGRID2B LIN 4 1.0 5.3     # Automatic generation blips for CGRID

LAMBDA 1e-7               # Control regularization parameter
OALGO  3                  # Select optimization algorithm
BIAS true                 # Append constant to descriptor
  • config.val – list of configurations that form the validation data set.

# The validation datasets are used to obtain
# MAE, RMSE and R^2 on energies or forces.

# Specify one or more validation datasets with DBFILE keyword
# Path can be either absolute or relative to the program current
# working directory.

DBFILE vdata.db
  • config.hpo – instructions that tell Tadah!MLIP how to perform HPO (choice of optimiser, loss function, search space, logging policy, …).

# 1. Define the optimizer (see the documentation for the full list).
#    Here we use dlib’s Global Search Function (also known as MaxLIPO+TR).
OPTIMIZER 
  LIB DLIB
  ALGO GFS
  MAXEVAL 2000    # maximum number of iterations
  FTOL_ABS 1e-8   # numerical convergence threshold for local minima
ENDOPTIMIZER

# 2. Use the L2-norm loss function.
LOSS L2

# 3. Basic performance constraint: include the energy RMSE in the loss function with a target of 0 and a weight of 1.0.
#    The RMSE is measured on the validation dataset defined in config.val.
PC_ERMSE 0 1.0

# 4. Define search-space constraints.
OPTIM CGRID2B (1-4) 0.0 5.5
OPTIM SGRID2B (1-4) 0.1 1.0

# 5. Output-control parameters. hpo always saves the current best model to `best_pot.tadah`.
#    Log files are written as (best_)loss.tadah, (best_)params.tadah, and (best_)outvar.tadah.
OUTPUT 10     # write to logs every 10th iteration, regardless of the loss value
BOUTPUT 1     # always write to the best_ logs

# 6. Directories and dump rate for archiving intermediate potentials.
DUMP 100 newPotsDir    # dump pot.tadah_i every 100th potential, regardless of the loss
BDUMP 1 newBestPotsDir # archive a potential whenever it improves on the previous best

Running the example

Invoke the nested fitting run with:

tadah hpo --config config.train --validation config.val --hpotarget config.hpo

The program iterates until it reaches the maximum number of evaluations set in config.hpo or until you stop it manually (e.g. with <Ctrl-C>). Whenever it finds a parameter set that improves upon the current best loss, it writes the corresponding potential to best_pot.tadah.

Output files

The optimiser produces three main log files that record every iteration

  • loss.tadah – value of each individual loss term and the total loss.

  • params.tadah – hyper-parameter vector for the current iteration.

  • outvar.tadah – any additional output variables (by default each loss term).

For convenience, each of these files has a best counterpart that is updated only when a new optimum is reached:

  • best_loss.tadah

  • best_params.tadah

  • best_outvar.tadah

These “best_” logs allow you to track the progress of the optimisation without having to parse the full history.

Example 3 - DM_mBlipEnv: Train and Predict

This example exercises the DM_mBlipEnv many-body descriptor.

What the descriptor does

DM_mBlipEnv mixes two different basis families:

  • The atomic density rho is built from a Gaussian radial basis - Gaussian-type orbitals exp(-eta*(r - r_s)^2) summed over the neighbours of each atom. CGRIDMB sets the centre r_s and SGRIDMB the inverse width eta.

  • The embedding function F(rho) is built from blip (cubic B-spline) basis functions in rho-space, each multiplied by an inverse-cutoff envelope (Cut_SinInv) of width RCUTENV. CEMBFUNC / SEMBFUNC set the blip centres and widths.

So the descriptor is a Gaussian density fed through a blip embedding. The blip embedding is a flexible spline basis: with enough blips it can represent essentially any EAM-style embedding function F(rho). The envelope enforces the physical EAM boundary condition F(rho) -> 0 as rho -> 0 - an atom with no neighbours contributes no embedding energy - by smoothly ramping F down over the low-density region rho < RCUTENV.

Restriction on L

The L integer in TYPEMB is the maximum orbital angular momentum. DM_mBlipEnv is rotationally invariant only for L = 0: its embedding function is not F_SQ (rho^2), so for L > 0 the embedding is applied to individual orbital densities before the invariance-restoring quadratic sum and the descriptor is not rotationally invariant (the code emits a runtime WARNING). Lmax is also hard-capped at 6. This example therefore uses L = 0. For a rotationally invariant L > 0 many-body descriptor use the DM_PS_* family instead.

Files provided

  • config.train - training settings using DM_mBlipEnv.

  • tdata.db - 50 thermally displaced (MD) snapshots of bcc Ta, each a 54-atom 3x3x3 conventional supercell (lattice constant a0 ~ 3.32 Angstrom). The set is intentionally small so the example runs in seconds.

The TYPEMB line

TYPEMB DM_mBlipEnv 0 1 1 5 5 Ta Ta - the integers, in order, are:

  • 0 - L, the maximum orbital angular momentum (see above);

  • 1 - size(CGRIDMB), number of radial Gaussian centres;

  • 1 - size(SGRIDMB), number of radial Gaussian widths;

  • 5 - size(CEMBFUNC), number of embedding blip centres;

  • 5 - size(SEMBFUNC), number of embedding blip widths;

followed by the element pair the descriptor block applies to (Ta Ta). CGRIDMB / SGRIDMB must have size 1 - the dimension-expanding blip embedding needs a single radial basis function; the descriptor output dimension comes from CEMBFUNC / SEMBFUNC instead.

Density range and WATOM

With the radial basis in config.train (CGRIDMB = 2.88, SGRIDMB = 8.0, RCUTMB = 5.0, Cut_Cos) the raw per-atom density of this dataset has a median of ~3.24. The WATOM key applies a per-species weight to every neighbour’s contribution to rho; WATOM = 0.309 (i.e. 1/3.24) rescales the density so it is ~1. After scaling the per-atom density lies in rho ~ [0.89, 1.10] with median ~1.00.

Choosing the blip basis (CEMBFUNC / SEMBFUNC)

The blips must give an embedding F(rho) that is finite for every rho > 0 and goes to zero only at rho = 0. Two facts drive the choice:

  • A blip contributes nothing outside its rho-support mu +/- 2/eta. If no blip has support at a given rho, the embedding is identically zero there - a flat dead zone.

  • A linear fit can only constrain a blip whose support overlaps the data band (rho ~ 0.89-1.10). A blip with no data overlap gets a zero weight from the regulariser, so adding narrow low-rho blips does not remove the dead zone - they are simply zeroed.

The fix is to make the blips wide: SEMBFUNC = 2.0 gives each blip a support of width 4/eta = 2.0. Every blip in CEMBFUNC = 0.5 0.75 1.0 1.25 1.5 then overlaps the data band (so its weight is fit from data) and the lower blips reach all the way down to rho = 0. The fitted F(rho) is consequently finite and non-zero for all rho > 0.

These wide blips give a non-zero embedding at rho = 0 itself, and RCUTENV = 0.6 is exactly what cancels it: the Cut_SinInv envelope ramps from 0 at rho = 0 up to 1 at rho = RCUTENV, smoothly pulling F -> 0 as rho -> 0. RCUTENV = 0.6 lies below the data band, so the envelope is saturated to 1 across rho ~ 0.89-1.10 and does not reshape the fit where training data is present.

Training

tadah train -c config.train -v 2

The expected output ends with a trained pot.tadah potential file in the working directory (a 5-column design matrix, one column per blip).

Prediction

To verify the round-trip, predict back on the training set:

tadah predict -p pot.tadah -d tdata.db -v 2 -A

-A prints RMSE / R^2 analytics.

Expected results

The dataset is a near-uniform set of thermal bcc-Ta snapshots, so the total energy spread is tiny (standard deviation ~7.3 meV/atom). On this set the configuration above gives roughly:

  • Energy R^2 ~ 0.70

  • Energy RMSE ~ 4.0 meV/atom

i.e. the descriptor captures most of the (small) energy variance. This is a genuine, working fit - the descriptor block is non-zero and informative.

A caveat on the embedding tails: this dataset only samples rho ~ 1 (dense bcc environments). The fitted F(rho) is finite and non-zero for all rho > 0 - there is no dead zone - but its shape away from rho ~ 1 is set by the wide-blip basis and the regulariser, not by data. The values of F at low and high rho are therefore a smooth extrapolation, not a physically constrained embedding; pinning those regions would need training structures that actually sample low- and high-density environments (surfaces, vacancies, dimers, compression). For high-accuracy fits on this dataset the two-body D2_Blip baseline in ex_1 remains a strong choice.