Examples
Example 0 - Basics: Tadah!MLIP
tadah is a command-line tool designed to simplify training and evaluation of machine learning models. It provides a user-friendly interface for managing configurations, datasets, and training processes. tadah is the main entry point for subcommands that perform specific tasks within your machine learning workflow.
In this example, we demonstrate how to:
Access help information using the
tadahbinary.Use the
explainsubcommand to understand the purpose of different subcommands and their options.
Before getting started, ensure that Tadah! is installed and accessible through your $PATH. If it is not, either add it to your $PATH or use the full path to the Tadah! binary in your commands.
Note
If you need installation instructions or system prerequisites, refer to Installation.
Accessing Tadah! Help
The tadah binary provides a help command to display information about available subcommands and their options. This is especially useful for first-time users or those needing a refresher on Tadah!’s capabilities.
tadah --help
or:
tadah -h
These commands will print a list of subcommands and their short descriptions.
Subcommand Hierarchy
Tadah! supports multiple levels of subcommands for better organization and clarity:
Top-level subcommands: These include commands like
trainordata.Second-level subcommands: Certain commands (like
data) have subcommands of their own (e.g.,split,balance).
For instance, tadah data split is an example of a second-level subcommand where data is the top-level subcommand and split is a subset of data-related operations.
Subcommand Help
For help with a specific subcommand, such as train:
tadah train --help
Tadah! explain command
Tadah! offers an explain subcommand to help you understand the purpose of each subcommand and its options. The explain subcommand takes a single argument:
The argument represents a query about either a particular (sub)command or an option.
It prints an extended description, including usage, purpose, and any additional helpful information.
To specify which subcommand or option to explain use a dot-separated format. Each dot separates the subcommand(s), and the specific option of interest.
For example, to learn about the train command’s --task option:
tadah explain train.task
For a second-level subcommand, such as tadah data split:
tadah explain data.split
And to learn more about the --dbfile option of tadah data split:
tadah explain data.split.dbfile
Runtime Errors
Tadah! provides helpful runtime errors, such as when a configuration file is missing a necessary key. While we strive for reliability, no code is entirely bug-free. If you encounter any issues, please report them on Tadah!’s GitLab:
Example 1 - Basics: Training and Prediction
This example demonstrates how to train a model using Tadah!MLIP. The config.train file defines
the necessary parameters for the model.
To use this example, ensure you have access to all required files available in the Git repository at: https://git.ecdf.ed.ac.uk/tadah/tadah.mlip/-/tree/main/examples/CLI/ex_1
Training
To train the model, use the following command:
tadah train -c config.train -v 2
-v 2: Enables verbosity for detailed output during the training process.
Prediction
Once the model is trained, you can run predictions on any Tadah!MLIP dataset. Here we simply reuse the training dataset for demonstration purposes.
tadah predict -p pot.tadah -d tdata.db -v 2 -A
-A: Generates useful analytics such as RMSE and R^2.-v 2: Enables verbosity for detailed output during prediction.
Additional Notes
The dataset contains over 10,000 structures, so the process may take a while when using the
-For-Sflags to train or predict with forces and stresses, respectively.
Example 2 - Basics: Nested Fitting Procedure
This example shows how to optimise a model’s hyper-parameters (HPs) with Tadah!MLIP. The workflow is deliberately simple and is intended only as a minimal, self-contained demonstration of the nested fitting (also called hyper-parameter optimisation, HPO) capability.
Scope
In this tutorial we optimise the HPs only against a validation data set. Doing so will generally lead to some degree of over-fitting to that validation set. How to add physics-based constraints or extra test sets to mitigate over-fitting is covered in a separate example.
Files provided
config.train– training-stage settings for the underlying potential.
# For a description of KEYS and corresponding values, see the documentation:
# https://tadah.readthedocs.io/
DBFILE tdata.db # Training dataset
INIT2B true # Use two-body descriptor
MODEL M_BLR BF_Linear # Use linear model
TYPE2B D2_Blip 4 4 Ta Ta # Use D2_Blip descriptor
RCTYPE2B Cut_Cos # Cutoff function for two-body descriptor
RCUT2B 5.3 # Cutoff distance
SGRID2B GEOM 4 0.1 1.0 # Automatic generation blips for SGRID
CGRID2B LIN 4 1.0 5.3 # Automatic generation blips for CGRID
LAMBDA 1e-7 # Control regularization parameter
OALGO 3 # Select optimization algorithm
BIAS true # Append constant to descriptor
config.val– list of configurations that form the validation data set.
# The validation datasets are used to obtain
# MAE, RMSE and R^2 on energies or forces.
# Specify one or more validation datasets with DBFILE keyword
# Path can be either absolute or relative to the program current
# working directory.
DBFILE vdata.db
config.hpo– instructions that tell Tadah!MLIP how to perform HPO (choice of optimiser, loss function, search space, logging policy, …).
# 1. Define the optimizer (see the documentation for the full list).
# Here we use dlib’s Global Search Function (also known as MaxLIPO+TR).
OPTIMIZER
LIB DLIB
ALGO GFS
MAXEVAL 2000 # maximum number of iterations
FTOL_ABS 1e-8 # numerical convergence threshold for local minima
ENDOPTIMIZER
# 2. Use the L2-norm loss function.
LOSS L2
# 3. Basic performance constraint: include the energy RMSE in the loss function with a target of 0 and a weight of 1.0.
# The RMSE is measured on the validation dataset defined in config.val.
PC_ERMSE 0 1.0
# 4. Define search-space constraints.
OPTIM CGRID2B (1-4) 0.0 5.5
OPTIM SGRID2B (1-4) 0.1 1.0
# 5. Output-control parameters. hpo always saves the current best model to `best_pot.tadah`.
# Log files are written as (best_)loss.tadah, (best_)params.tadah, and (best_)outvar.tadah.
OUTPUT 10 # write to logs every 10th iteration, regardless of the loss value
BOUTPUT 1 # always write to the best_ logs
# 6. Directories and dump rate for archiving intermediate potentials.
DUMP 100 newPotsDir # dump pot.tadah_i every 100th potential, regardless of the loss
BDUMP 1 newBestPotsDir # archive a potential whenever it improves on the previous best
Running the example
Invoke the nested fitting run with:
tadah hpo --config config.train --validation config.val --hpotargets config.hpo
The program iterates until it reaches the maximum number of evaluations set in
config.hpo or until you stop it manually (e.g. with <Ctrl-C>).
Whenever it finds a parameter set that improves upon the current best loss,
it writes the corresponding potential to best_pot.tadah.
Output files
The optimiser produces three main log files that record every iteration
loss.tadah– value of each individual loss term and the total loss.params.tadah– hyper-parameter vector for the current iteration.outvars.tadah– any additional output variables (by default each loss term).
For convenience, each of these files has a best counterpart that is updated only when a new optimum is reached:
best_loss.tadahbest_params.tadahbest_outvars.tadah
These “best_” logs allow you to track the progress of the optimisation without having to parse the full history.