Examples

Example 0 - Basics: Tadah!MLIP

tadah is a command-line tool designed to simplify training and evaluation of machine learning models. It provides a user-friendly interface for managing configurations, datasets, and training processes. tadah is the main entry point for subcommands that perform specific tasks within your machine learning workflow.

In this example, we demonstrate how to:

Access help information using the tadah binary.
Use the explain subcommand to understand the purpose of different subcommands and their options.

Before getting started, ensure that Tadah! is installed and accessible through your $PATH. If it is not, either add it to your $PATH or use the full path to the Tadah! binary in your commands.

Note

If you need installation instructions or system prerequisites, refer to Installation.

Accessing Tadah! Help

The tadah binary provides a help command to display information about available subcommands and their options. This is especially useful for first-time users or those needing a refresher on Tadah!’s capabilities.

tadah --help

or:

tadah -h

These commands will print a list of subcommands and their short descriptions.

Subcommand Hierarchy

Tadah! supports multiple levels of subcommands for better organization and clarity:

Top-level subcommands: These include commands like train or data.
Second-level subcommands: Certain commands (like data) have subcommands of their own (e.g., split, balance).

For instance, tadah data split is an example of a second-level subcommand where data is the top-level subcommand and split is a subset of data-related operations.

Subcommand Help

For help with a specific subcommand, such as train:

tadah train --help

Tadah! explain command

Tadah! offers an explain subcommand to help you understand the purpose of each subcommand and its options. The explain subcommand takes a single argument:

The argument represents a query about either a particular (sub)command or an option.
It prints an extended description, including usage, purpose, and any additional helpful information.
To specify which subcommand or option to explain use a dot-separated format. Each dot separates the subcommand(s), and the specific option of interest.

For example, to learn about the train command’s --task option:

tadah explain train.task

For a second-level subcommand, such as tadah data split:

tadah explain data.split

And to learn more about the --dbfile option of tadah data split:

tadah explain data.split.dbfile

Runtime Errors

Tadah! provides helpful runtime errors, such as when a configuration file is missing a necessary key. While we strive for reliability, no code is entirely bug-free. If you encounter any issues, please report them on Tadah!’s GitLab:

Tadah! GitLab Issues

Example 1 - Basics: Training and Prediction

This example demonstrates how to train a model using Tadah!MLIP. The config.train file defines the necessary parameters for the model.

To use this example, ensure you have access to all required files available in the Git repository at: https://git.ecdf.ed.ac.uk/tadah/tadah.mlip/-/tree/main/examples/CLI/ex_1

Training

To train the model, use the following command:

tadah train -c config.train -v 2

-v 2: Enables verbosity for detailed output during the training process.

Prediction

Once the model is trained, you can run predictions on any Tadah!MLIP dataset. Here we simply reuse the training dataset for demonstration purposes.

tadah predict -p pot.tadah -d tdata.db -v 2 -A

-A: Generates useful analytics such as RMSE and R^2.
-v 2: Enables verbosity for detailed output during prediction.

Additional Notes

The dataset contains over 10,000 structures, so the process may take a while when using the -F or -S flags to train or predict with forces and stresses, respectively.

Example 2 - Basics: Nested Fitting Procedure

This example shows how to optimise a model’s hyper-parameters (HPs) with Tadah!MLIP. The workflow is deliberately simple and is intended only as a minimal, self-contained demonstration of the nested fitting (also called hyper-parameter optimisation, HPO) capability.

Scope

In this tutorial we optimise the HPs only against a validation data set. Doing so will generally lead to some degree of over-fitting to that validation set. How to add physics-based constraints or extra test sets to mitigate over-fitting is covered in a separate example.

Files provided

config.train – training-stage settings for the underlying potential.

# For a description of KEYS and corresponding values, see the documentation:
# https://tadah.readthedocs.io/

DBFILE tdata.db           # Training dataset

INIT2B true               # Use two-body descriptor

MODEL M_BLR BF_Linear     # Use linear model
TYPE2B D2_Blip 4 4 Ta Ta  # Use D2_Blip descriptor
RCTYPE2B Cut_Cos          # Cutoff function for two-body descriptor

RCUT2B 5.3                # Cutoff distance

SGRID2B GEOM 4 0.1 1.0    # Automatic generation blips for SGRID
CGRID2B LIN 4 1.0 5.3     # Automatic generation blips for CGRID

LAMBDA 1e-7               # Control regularization parameter
OALGO  3                  # Select optimization algorithm
BIAS true                 # Append constant to descriptor

config.val – list of configurations that form the validation data set.

# The validation datasets are used to obtain
# MAE, RMSE and R^2 on energies or forces.

# Specify one or more validation datasets with DBFILE keyword
# Path can be either absolute or relative to the program current
# working directory.

DBFILE vdata.db

config.hpo – instructions that tell Tadah!MLIP how to perform HPO (choice of optimiser, loss function, search space, logging policy, …).

# 1. Define the optimizer (see the documentation for the full list).
#    Here we use dlib’s Global Search Function (also known as MaxLIPO+TR).
OPTIMIZER 
  LIB DLIB
  ALGO GFS
  MAXEVAL 2000    # maximum number of iterations
  FTOL_ABS 1e-8   # numerical convergence threshold for local minima
ENDOPTIMIZER

# 2. Use the L2-norm loss function.
LOSS L2

# 3. Basic performance constraint: include the energy RMSE in the loss function with a target of 0 and a weight of 1.0.
#    The RMSE is measured on the validation dataset defined in config.val.
PC_ERMSE 0 1.0

# 4. Define search-space constraints.
OPTIM CGRID2B (1-4) 0.0 5.5
OPTIM SGRID2B (1-4) 0.1 1.0

# 5. Output-control parameters. hpo always saves the current best model to `best_pot.tadah`.
#    Log files are written as (best_)loss.tadah, (best_)params.tadah, and (best_)outvar.tadah.
OUTPUT 10     # write to logs every 10th iteration, regardless of the loss value
BOUTPUT 1     # always write to the best_ logs

# 6. Directories and dump rate for archiving intermediate potentials.
DUMP 100 newPotsDir    # dump pot.tadah_i every 100th potential, regardless of the loss
BDUMP 1 newBestPotsDir # archive a potential whenever it improves on the previous best

Running the example

Invoke the nested fitting run with:

tadah hpo --config config.train --validation config.val --hpotargets config.hpo

The program iterates until it reaches the maximum number of evaluations set in config.hpo or until you stop it manually (e.g. with <Ctrl-C>). Whenever it finds a parameter set that improves upon the current best loss, it writes the corresponding potential to best_pot.tadah.

Output files

The optimiser produces three main log files that record every iteration

loss.tadah – value of each individual loss term and the total loss.
params.tadah – hyper-parameter vector for the current iteration.
outvars.tadah – any additional output variables (by default each loss term).

For convenience, each of these files has a best counterpart that is updated only when a new optimum is reached:

best_loss.tadah
best_params.tadah
best_outvars.tadah

These “best_” logs allow you to track the progress of the optimisation without having to parse the full history.