Configuration File

Note

This page is auto-generated from the blueprint TOML on every docs build. If you spot a mismatch between this reference and runtime behaviour, please report it on the Tadah!MLIP GitLab.

Note

This configuration file is NOT used by the HPO (Hyperparameter Optimization) module of Tadah!MLIP. The HPO module uses a different configuration file format, which is documented in the Nested Fitting.

This section describes the format of the configuration file used by Tadah!MLIP.

For example, the configuration file can control the training process, specifying one or more datasets for use during the training stage. It defines cutoff functions and corresponding radii along with the regression model and descriptor choices.

Important

Indexing of items that take positional indices (e.g., INDEX 1, CLI flag --index 1) starts from 1, not 0.

Key/Value Pairs

The primary structure in a configuration file is the KEY/VALUE pair. Each KEY/VALUE pair must be on a separate line, with the KEY appearing first. The KEY is always a string, followed by its VALUE. The format and type of a VALUE depend on the specific KEY.

Common Usage

Typically, only a subset of KEYS is needed to train a model. Tadah!MLIP will use default values for some keys. An error will occur if a required KEY with no default value is missing:

[user@host:~] $ tadah train -c config.train
terminate called after throwing an instance of 'std::runtime_error'
  what():  Key not found: DBFILE
Aborted (core dumped)

This message indicates that the dbfile KEY was not specified in the config.train file. To resolve this, add the dbfile key and its corresponding value to config.train.

Key Specifics

The meaning of some keys can vary with the chosen command. Check the documentation for that specific command, model or descriptor to see which keys are required and how they are interpreted. While we strive to keep key meanings consistent across Tadah!MLIP, occasional differences may still occur.

Comments

Use the # symbol to add comments in the configuration file. Both inline and full-line comments are supported.

Key Values and Formats

Some KEYS can have multiple values, specified in one of two ways:

  • Single line:

    KEY VALUE1 VALUE2 VALUE3
    
  • Multiple lines:

    KEY VALUE1
    KEY VALUE2
    KEY VALUE3
    

Value Limits

Each keyword takes a fixed number of values. Passing the wrong count can raise an error, but enforcement is not yet fully consistent. While this is being improved, keep these points in mind:

  • Too many values – Tadah!MLIP may reject the input with a clear message or quietly discard the extras; the latter can later surface as obscure run-time failures (even an occasional segmentation fault).

  • Too few values – usually triggers an error, although a crash remains possible in rare corner cases.

In short, give every keyword exactly the number of values it expects—no more, no less—to avoid unpleasant surprises.

Supported KEYS

This section contains all KEYS currently used by Tadah!MLIP.

ALPHA

[DOUBLE] <double>

Max number of values: 1 Default: 1.0

Description:

Weight precision hyper-parameter. This is the starting guess for the evidence approximation algorithm.

Example 1:

ALPHA 0.23

ATOM

[{STRING}] <element> …>

Max number of values: 118

Description:

Chemical elements. Example 1:

"Kr"

AUDIT

[STRING] <string>

Max number of values: 1 Default: off

Description:

Pre-flight audit mode. ‘off’ (default) skips the audit. ‘warn’ emits diagnostics but proceeds. ‘error’ makes any FAIL-level finding fatal. The audit’s data scan is sampled by default (see AUDIT_SAMPLE). Example 1:

off

Example 2:

warn

Example 3:

error

AUDIT_SAMPLE

[INT] <integer>

Max number of values: 1 Default: 256

Description:

Number of training structures sampled (deterministic random) for the pre-flight audit’s dataset stats. 0 means use the entire StructureDB. Has no effect when AUDIT is ‘off’. Example 1:

256

Example 2:

1024

Example 3:

0

BASIS

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Basis vectors for non-linear Kernel Ridge Regression. They represent the features or functions used to map input data into a higher-dimensional feature space.

Example 1:

2.0 -4.65 0.4

Example 2:

-1.0

BETA

[DOUBLE] <double>

Max number of values: 1 Default: 1.0

Description:

Noise precision hyper-parameter. This is the starting guess for the evidence approximation algorithm.

Example 1:

BETA 0.0001

CEMBFUNC

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Position parameters for an embedding function. Used by certain many-body descriptors (e.g., F_RLR). When using DM_mJoin, supply one or more lists of parameters matching those in SEMBFUNC.

Example 1:

CEMBFUNC 0.14 0.45 1.00 1.1

CGRID2B

[{DOUBLE}] <double> [<double> …], [{STRING INT DOUBLE DOUBLE}] (<algorithm> <n> <start> <stop>) […]

Max number of values: 2147483647

Description:

Controls the center positions for radial basis functions (two-body). The parameter list may be provided manually or generated automatically. When using the meta descriptor D2_mJoin, specify one or more lists of centers corresponding to each descriptor. The number of centers should typically match the number of width parameters (SGRID2B) and remain below the cutoff distance. Alternatively, use the algorithm keyword followed by parameters to generate centers automatically (e.g., LOG or LIN).

Example 1:

CGRID2B LIN 10 0 6

Example 2:

CGRID2B 1.0 2.0

Example 3:

CGRID2B   1.0 2.0
CGRID2B   1.5 2.5

CGRIDMB

[{DOUBLE}] <center> …, [{STRING INT DOUBLE DOUBLE}] <algorithm> <N> <START> <STOP>

Max number of values: 2147483647

Description:

Specifies the center positions for many-body radial basis functions. Centers may be provided manually or generated automatically. When using the DM_mJoin meta descriptor, supply one or more lists of centers for each concatenated descriptor. Alternatively, include an algorithm such as ‘L’ (logarithmic) or ‘U’ (uniform spacing) followed by parameters.

Example 1:

CGRIDMB LIN 4 0 6.2

Example 2:

CGRIDMB 0.5 0.7

Example 3:

CGRIDMB   0.5 0.7
CGRIDMB   0.6 0.8

DIMER

[BOOL DOUBLE BOOL] <boolean> <double> <boolean>

Max number of values: 3 Default: false / 0 / false

Description:

Control for DIMER models. Users should not modify this key. Example 1:

DIMER true 1.104 true

EWEIGHT

[DOUBLE] <double>

Max number of values: 1 Default: 1.0

Description:

Global energy scaling factor. Energies are always scaled by 1/(number of atoms). Additional configuration-level scaling factors can apply. Combined factor = EWEIGHT*(config eweight)/(#atoms).

Example 1:

EWEIGHT 0.96

FIXINDEX

[{INDEX_PATTERN}] <index>[,<index>…], [**] <start>-<stop>, [**] <start>-<stop>:<step>

Max number of values: 2147483647

Description:

Indices of weights to be fixed in optimization. Must be used with FIXWEIGHT. dim(FIXINDEX) = dim(FIXWEIGHT). Allows flexible selection of indices. Supports single indices, ranges (e.g., start-stop), lists, or intervals (start-stop:step). Indices are 1-based. Repeated indices are removed automatically.

Example 1:

1,3,5

Example 2:

1-4,7,9

Example 3:

1-10:2

FIXWEIGHT

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Values for weights to be fixed in optimization. Must be used with FIXINDEX. dim(FIXINDEX) = dim(FIXWEIGHT). The i-th value in FIXWEIGHT corresponds to the i-th index in FIXINDEX.

Example 1:

0.12 1.0 2.00

FWEIGHT

[DOUBLE] <double>

Max number of values: 1 Default: 1.0

Description:

Global force scaling factor. Each force component is scaled by 1/(#atoms)/3. Additional config-level scaling factors can apply. Combined factor = FWEIGHT*(config fweight)/(#atoms)/3.

Example 1:

FWEIGHT 1e-2

HEALTH_LOG

[STRING] <string>

Max number of values: 1 Default: summary

Description:

HPO per-iter health monitor verbosity. ‘summary’ (default) rate-limits warnings to ~1 per 100 evals per (block, kind). ‘full’ emits every offending iter. ‘off’ disables the monitor. Example 1:

summary

Example 2:

full

Example 3:

off

INIT2B

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

If set to true, the two-body descriptor will be calculated. Example 1:

INIT2B true

INITMB

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

If set to true, the many-body descriptor will be calculated. Example 1:

INITMB true

LAMBDA

[INT] <int>, [DOUBLE] <double>, [INT DOUBLE] <int> <double>

Max number of values: 2 Default: 0

Description:

Controls the regularization parameter λ for BLR and KRR. If N=0, no regularization. If N>0, λ is set to that value. If N<0, an evidence approximation is used to estimate λ. For LAMBDA 0, you can provide a second number (double) that sets the effective rank threshold (default 1e-8).

Example 1:

LAMBDA -1

Example 2:

LAMBDA 1e-4

Example 3:

LAMBDA 0

Example 4:

LAMBDA 0 1e-12

MBLOCK

[UINT] <unsigned integer>

Max number of values: 1 Default: 64

Description:

ScalaPACK row block size MB. Example 1:

20

MODEL

[STRING STRING] MODEL FUNCTION, [STRING STRING STRING] MODEL FUNCTION OPTION, [STRING STRING UINT] MODEL FUNCTION OPTION

Max number of values: 3

Description:

Defines the training model and function. MODEL can be any class inheriting from M_Base (e.g., M_KRR, M_BLR). FUNCTION must be a valid child class of Function_Base (e.g., Kern_Linear, BF_Linear, BF_Polynomial2). Various combinations (KRR with different kernels, BLR with various basis functions) are possible.

Example 1:

MODEL M_BLR BF_Linear

Example 2:

MODEL M_BLR BF_Polynomial2

Example 3:

MODEL M_KRR Kern_Linear

MPARAMS

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Specifies additional numeric parameters for certain models. Some models require extra parameters. Refer to the model-specific documentation for details. Many models do not need any extra parameters.

Example 1:

MPARAMS 0.1

Example 2:

MPARAMS 0.1 0.2 0.3

MPIWPCKG

[UINT] <unsigned integer>

Max number of values: 1 Default: 50

Description:

The number of structures in a single MPI work package. Example 1:

20

NBLOCK

[UINT] <unsigned integer>

Max number of values: 1 Default: 64

Description:

ScalaPACK column block size NB. Example 1:

20

NMEAN

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Mean values from descriptor normalization. Obtained after standardizing the columns of the DesignMatrix (see NORM).

Example 1:

2.0 -4.65 0.4

Example 2:

-1.0

NORM

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Standardize descriptors. Set to true to standardize descriptors, typically relevant if energies are used for fitting.

Example 1:

true

Example 2:

false

NSTDEV

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Standard deviations from descriptor normalization. Obtained after standardizing the columns of the DesignMatrix (see NORM). The vector size equals the number of columns.

Example 1:

2.0 -4.65 0.4

Example 2:

-1.0

OALGO

[INT] <option>, [INT DOUBLE] <option> <value>

Max number of values: 2 Default: 1

Description:

This key controls the optimization algorithm used to train the model. The default algorithm is Option 1 (GELSD) with a conditioning parameter (\(rcond\)) of \(1 \times 10^{-8}\). The regularization parameter is handled separately by the LAMBDA key.

Available options:

1 - GELSD: Utilizes the LAPACK routine DGELSD to solve linear least squares problems using the Singular Value Decomposition (SVD). This method is robust and can handle rank-deficient systems. It computes the minimum-norm solution and is suitable for ill-conditioned problems. The parameter M controls the reciprocal of the conditioning number (\(\text{rcond}\)). It defaults to \(1 \times 10^{-8}\) if not specified. Setting M = -1 uses machine precision, which may be unstable for ill-conditioned problems.

2 - GELS: Uses the LAPACK routine DGELS to solve linear least squares problems via QR or LQ factorization. This method assumes that the matrix has full rank and is the fastest among the options. It is suitable for well-conditioned problems but less robust for ill-conditioned or rank-deficient matrices.

3 - Custom Implementation Similar to DGELS (Uses SVD): Employs a custom algorithm similar to DGELS but utilizes SVD. This method allows the use of the evidence approximation algorithm and computes the covariance matrix \(\Sigma\), providing additional statistical information about the solution. Like option 1, the parameter M controls the reciprocal of the conditioning number (\(\text{rcond}\)), defaulting to \(1 \times 10^{-8}\). Setting M = -1 uses machine precision, which might be unstable for ill-conditioned problems.

4 - Cholesky Decomposition: Solves the normal equations \(A^\top A x = A^\top b\) using Cholesky decomposition. This method is efficient for well-conditioned, full-rank matrices but may be less stable for ill-conditioned or rank-deficient problems because it squares the condition number of the matrix. It does not require additional parameters.

Performance Comparison:

  • Option 2 is the fastest but assumes a full-rank matrix and is less robust for ill-conditioned problems.

  • Option 4 is also efficient but may suffer from numerical instability in ill-conditioned or rank-deficient problems due to squaring of the condition number.

  • Option 1 offers a balance between speed and robustness, handling rank-deficient and ill-conditioned problems better than options 2 and 4.

  • Option 3 is the slowest due to the additional computations but provides valuable extra information like the covariance matrix.

Usage:

  • To select an algorithm, set the OALGO key followed by the option number.

  • For options 1 and 3, you can specify the conditioning parameter M after the option number.

  • Option 4 does not require any additional parameters.

Regularization Parameter:

  • The regularization parameter is controlled by the LAMBDA key, which should be set separately to apply regularization to the model.

Examples:

  • OALGO 1 # Uses GELSD with default conditioning parameter (M = 1e-8)

  • OALGO 1 -1 # Uses GELSD with machine precision (M = -1)

  • OALGO 2 # Uses GELS

  • OALGO 3 1e-10 # Uses custom implementation with M = 1e-10

  • OALGO 4 # Uses Cholesky Decomposition

Notes:

  • Conditioning Parameter M: Controls the reciprocal of the condition number (\(\text{rcond}\)). A smaller M (closer to machine precision) includes more singular values in the solution, which may be necessary for certain problems but can introduce instability if the matrix is ill-conditioned.

  • Machine Precision (M = -1): Using machine precision can lead to numerical instability in ill-conditioned problems. It is recommended to use a positive M value to exclude negligible singular values that could adversely affect the solution.

  • Covariance Matrix \(\Sigma\): Option 3 provides the covariance matrix, which can be useful for statistical analyses and understanding the uncertainty in the estimated parameters.

  • Cholesky Decomposition: Option 4 solves the normal equations using Cholesky decomposition. It is efficient but may be numerically unstable for ill-conditioned or rank-deficient problems due to squaring the condition number. It is best used when the matrix is well-conditioned and of full rank.

Additional Information:

  • Regularization with LAMBDA:

    • The LAMBDA key controls the regularization parameter used in the training process. It should be specified separately to apply regularization to the model.

    • Regularization helps prevent overfitting by adding a penalty for larger parameter values.

  • LAPACK Routines:

    • DGELSD: Computes the minimum-norm solution to a real linear least squares problem using the SVD of the coefficient matrix.

    • DGELS: Solves overdetermined or underdetermined real linear systems involving an \(M \times N\) matrix, using a QR or LQ factorization of the matrix.

    • DPOTRF and DPOTRS: Used in the Cholesky decomposition to factorize a symmetric positive-definite matrix and solve the resulting linear system.

  • When to Use Each Option:

    • Use Option 1 (GELSD) when you need robustness against rank deficiency and moderate performance.

    • Use Option 2 (GELS) for the fastest performance on well-conditioned, full-rank matrices where robustness is less of a concern.

    • Use Option 3 (Custom SVD Implementation) when you require additional outputs like the covariance matrix and are willing to trade off performance for more comprehensive results.

    • Use Option 4 (Cholesky Decomposition) when you have a well-conditioned, full-rank matrix and need an efficient solution, but be cautious of potential numerical instability in ill-conditioned or rank-deficient problems.

Example 1:

OALGO 1

Example 2:

OALGO 1 -1

Example 3:

OALGO 2

Example 4:

OALGO 3 1e-10

Example 5:

OALGO 4

RCTYPE2B

[{STRING}] Cut_<name>

Max number of values: 2147483647

Description:

Specifies the cutoff function type(s) for two-body descriptor(s). Provide a single cutoff type (e.g., Cut_Cos) for one descriptor or multiple types corresponding to each descriptor when using the D2_mJoin meta descriptor.

Example 1:

RCTYPE2B Cut_Cos

Example 2:

RCTYPE2B Cut_Cos Cut_Tanh

RCTYPEMB

[{STRING}] Cut_<name>

Max number of values: 2147483647

Description:

Specifies the cutoff function type(s) for many-body descriptor(s). Provide a single cutoff type (e.g., Cut_Cos) for one descriptor or a series of types—each for a corresponding descriptor when using the DM_mJoin meta descriptor.

Example 1:

RCTYPEMB Cut_Cos

Example 2:

RCTYPEMB Cut_Cos Cut_Tanh

RCUT2B

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Specifies the cutoff distance(s) for two-body descriptor(s). Provide a single value for one descriptor or multiple values—one for each descriptor when using the D2_mJoin meta descriptor.

Example 1:

RCUT2B 3.0

Example 2:

RCUT2B 3.0 7.5

RCUTENV

[DOUBLE] <double>

Max number of values: 1

Description:

Envelope cutoff distance used by F_mEnv (DM_mBlipEnv). Width of the inverse-cutoff envelope (Cut_SinInv) wrapped around the F_Blip embedding by the DM_mBlipEnv descriptor: the envelope ramps from 0 at rho=0 to 1 at rho=RCUTENV, suppressing the embedding contribution where the underlying Gaussian basis is largest.

Example 1:

RCUTENV 1.5

RCUTMB

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Specifies the cutoff distance(s) for many-body descriptor(s). Provide a single value for a standalone descriptor or multiple values corresponding to each descriptor when using the DM_mJoin meta descriptor.

Example 1:

RCUTMB 4.9

Example 2:

RCUTMB 4.9 8.0

SBASIS

[UINT] <unsigned integer>

Max number of values: 1

Description:

Number of basis functions for the DesignMatrix. Many models do not require this. If specified, it sets the number of basis functions used in the design matrix.

Example 1:

10

Example 2:

102

SEMBFUNC

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Shape parameters for an embedding function. Used by certain many-body descriptors (e.g., F_RLR). When using the DM_mJoin descriptor, provide lists of parameters corresponding to each descriptor, ensuring consistency with CEMBFUNC.

Example 1:

SEMBFUNC 0.14 0.45 1.00 1.1

SETFL

[STRING] <string>

Max number of values: 1

Description:

Path to the setfl file with the EAM potential. Example 1:

Ta1_Ravelo_2013.eam.alloy

SGRID2B

[{DOUBLE}] <double> [<double> …], [{STRING INT DOUBLE DOUBLE}] (<algorithm> <n> <start> <stop>) […]

Max number of values: 2147483647

Description:

Specifies the width parameters for two-body radial basis functions. These widths can be supplied manually or auto-generated. When using the meta descriptor D2_mJoin, provide one or more lists of width values for each concatenated descriptor. The number of widths must match the number of centers (CGRID2B). Alternatively, specify the algorithm keyword with parameters to generate widths automatically (e.g., LOG or LIN).

Example 1:

SGRID2B LIN 3 0 1.0

Example 2:

SGRID2B GEOM 6 0.1 10

Example 3:

SGRID2B   0.01 0.02 0.03
SGRID2B   GEOM 6 0.1 10

SGRIDMB

[{DOUBLE}] <width> …, [{STRING INT DOUBLE DOUBLE}] <algorithm> <N> <START> <STOP>

Max number of values: 2147483647

Description:

Specifies the width parameters for many-body radial basis functions. Values may be provided manually or generated automatically. When using the DM_mJoin meta descriptor, provide one or more lists of widths for each descriptor. Ensure consistency with the centers defined in CGRIDMB. Alternatively, use an algorithm (e.g., LOG or LIN) and its parameters to generate widths automatically.

Example 1:

SGRIDMB LIN 3 0 1.0

Example 2:

SGRIDMB 0.01 0.02 0.03

Example 3:

SGRIDMB   0.01 0.02 0.03
SGRIDMB   0.02 0.03 0.04

SIGMA

[INT {DOUBLE}] <integer> <double> …

Max number of values: 2147483647

Description:

The Σ matrix used in Bayesian Linear Regression. An N×N matrix in column-major order. Applies to M_BLR.

Example 1:

`2 1.2 2.2 2.3 3.3`

SWEIGHT

[DOUBLE] <double>

Max number of values: 1 Default: 1.0

Description:

Global stress scaling factor. Each stress component is scaled by 1/6. Additional config-level scaling can apply. Combined factor = SWEIGHT*(config sweight)/6.

Example 1:

SWEIGHT 1e-1

TYPE2B

[STRING {[INT | DOUBLE]} {STRING STRING}] D2_<name> {[param]} {ELEMENT ELEMENT}, [STRING {STRING STRING}], [STRING]

Max number of values: 2147483647

Description:

Specifies the two-body descriptor type(s) to be used. For a single descriptor, provide its type (e.g., D2_LJ). To concatenate multiple descriptors, use the meta descriptor D2_mJoin followed by the individual descriptor parameters. Elements should be provided in pairs.

Example 1:

TYPE2B D2_LJ Kr Kr

Example 2:

TYPE2B    D2_mJoin
  TYPE2B    D2_MIE 11 6 Ti Ti
  TYPE2B    D2_Blip 6 6 Ti Nb Nb Nb

TYPEMB

[STRING UINT UINT {STRING STRING}] DM_<name> {[param]} {ELEMENT ELEMENT}, [STRING UINT UINT UINT {STRING STRING}], [STRING UINT UINT UINT UINT UINT {STRING STRING}], [STRING]

Max number of values: 2147483647

Description:

Specifies the many-body descriptor type(s) to be used. For a single descriptor, provide its type (e.g., DM_EAD). To combine multiple descriptors, use the meta descriptor DM_mJoin followed by the individual descriptor parameters. Elements should be provided in pairs.

Example 1:

TYPEMB DM_Blip 0 6 6 Ti Ti

Example 2:

TYPEMB    DM_mJoin
  TYPEMB    DM_Blip 1 6 6 Ti TI
  TYPEMB    DM_Blip 0 6 6 Ti Nb Nb Nb

WATOM

[{DOUBLE}] <double> [<double> …]

Max number of values: 118

Description:

Weights sorted by atomic number, from lowest Z to highest. WATOMS.size() must match ATOMS.size().

Example 1:

2.0 -4.65 0.4

Example 2:

-1.0

WEIGHTS

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Machine-learned coefficients for the model. These are species-dependent weights, obtained during optimization. Defaults to atomic numbers if unspecified.

Example 1:

WEIGHTS 0.12 1.2 0.3

analytics

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Perform analytics. Example 1:

true

Example 2:

false

append

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Append to the existing file. Example 1:

true

Example 2:

false

atompair

[{STRING}] <element1> <element2>

Max number of values: 2

Description:

Pair of chemical elements. Example 1:

"Kr Kr"

bias

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Per-species intercept. When true, adds N=size(ATOM) one-hot intercept columns to the design matrix; the regression learns one constant energy per species and LAMMPS adds it back per atom. Required when NORM=true with MODEL=BF_Linear. Example 1:

true

Example 2:

false

bondenergy

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Calculate bond energy instead of per atom value. Example 1:

true

Example 2:

false

chunk

[{UINT}] <unsigned integer> [<unsigned integer> …]

Max number of values: 2147483647

Description:

Specify chunk sizes. Example 1:

20 5 3

Example 2:

10

config

[STRING] <file>

Max number of values: 1

Description:

Path to a configuration file. Example 1:

config.tadah

Example 2:

../config.tadah

Example 3:

/path/to/config.tadah

dbfile

[{STRING}] <string> [<string> …]

Max number of values: 2147483647

Description:

Path(s) to Tadah! database file(s). Absolute or relative path to the Tadah! database file(s). The relative path is interpreted relative to the current working directory. Multiple dataset paths can be provided either as space-separated tokens or by repeating this key.

Example 1:

dbfile /path/to/dbfile

Example 2:

dbfile /path/to/dbfile1 /path/to/dbfile2

derivative

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Calculate derivative of the function. Example 1:

true

Example 2:

false

dft-file

[{STRING}] <string> [<string> …]

Max number of values: 2147483647

Description:

Input DFT file(s). A single file or multiple files (space-separated). Used to extract reference data for training. Supported formats: VASP (OUTCAR, vasprun.xml), CASTEP (.castep, .md, .geom).

Example 1:

run1.outcar

Example 2:

run1.outcar run2.outcar

efilter

[DOUBLE DOUBLE] <E_min_per_atom> <E_max_per_atom>

Max number of values: 2 / 2

Description:

Drop configurations whose per-atom energy is outside [E_min, E_max] (eV). Outlier filter applied at load time before any energy-shift derivation or training-weight assignment, so outliers do not poison ESHIFT_ATOM / ESHIFT_DBATOM / EWEIGHT_TEMP. The threshold is compared against E/N_atoms (per-atom energy). Both bounds must be supplied. To disable, omit the key.

Example 1:

-12.0 -2.0

error

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Generate error estimates. Example 1:

true

Example 2:

false

eshift

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Per-atom reference energy to subtract from each configuration. Per-element reference energies. If there are multiple species, the number of values must match the number of species (sorted by Z). At load time the total energy of each configuration is reduced by sum_Z N_Z * ESHIFT[Z], so an isolated-atom config with energy E_atom and ESHIFT[Z]=E_atom yields a post-shift energy of zero. Used by tadah train, tadah predict, tadah hpo, and tadah data balance. Persisted into pot.tadah for prediction round-trip.

Example 1:

0.5

Example 2:

0.5 -0.1

eshift_atom

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Derive ESHIFT from isolated-atom configurations in the dataset (mean per Z). Scans the loaded dataset for single-atom configurations (natoms == 1), groups them by atomic number, and sets ESHIFT[Z] to the mean per-Z energy. If a species has no isolated-atom config in the dataset, ESHIFT[Z] = 0 for that species and a WARNING is logged. If multiple isolated-atom configs of the same Z disagree by more than 1e-3 eV, an INFO line records the spread. Mutually exclusive with explicit ESHIFT and ESHIFT_DBATOM.

Example 1:

true

eshift_dbatom

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Derive ESHIFT by least-squares atomic-energy fit over the database. Fits per-element reference energies by least squares: minimise ||y - M beta||^2 where y[i] is the total energy of configuration i and M[i, k] is the count of species k in configuration i. The fitted beta_k becomes ESHIFT[Z(k)]. More robust than ESHIFT_ATOM when the dataset has no isolated-atom configs but does have compositional diversity. Mutually exclusive with explicit ESHIFT and ESHIFT_ATOM.

Example 1:

true

even

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Equal-size partition. Example 1:

true

Example 2:

false

eweight_temp

[DOUBLE] <double>

Max number of values: 1

Description:

Boltzmann reweighting temperature in Kelvin (multiplies eweight). After ESHIFT is applied, multiplies each configuration’s eweight by exp(-(E/N - E_min)/(kB * T)) where E_min is the minimum per-atom energy in the dataset and kB = 8.617333262e-5 eV/K. Emphasises low-energy configurations. Composes multiplicatively with the per-structure eweight already in the dataset file. Omit the key to disable.

Example 1:

300

Example 2:

1000

ffilter

[DOUBLE] <double>

Max number of values: 1

Description:

Drop configurations where any atomic force magnitude exceeds this value (eV/Å). Outlier filter applied at load time. A configuration is dropped if any single atom has ‖F‖ > FFILTER. Useful for catching unconverged SCF or otherwise broken DFT runs.

Example 1:

20.0

force

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Include forces. Example 1:

true

Example 2:

false

format

[STRING] <fmt>

Max number of values: 1

Description:

Output format (e.g., vasp, castep, lammps). Example 1:

castep

Example 2:

lammps

Example 3:

vasp

hpotarget

[STRING] <file>

Max number of values: 1

Description:

HPO target file. Example 1:

hpotargets.txt

index

[{INDEX_PATTERN}] <index>[,<index>…], [**] <start>-<stop>, [**] <start>-<stop>:<step>

Max number of values: 2147483647

Description:

Index pattern. Allows flexible selection of dataset indices. Supports single indices, ranges (e.g., start-stop), lists, or intervals (start-stop:step). Indices are 1-based. Repeated indices are removed automatically.

Example 1:

1,3,5

Example 2:

1-4,7,9

Example 3:

1-10:2

lscale

[DOUBLE] <double>

Max number of values: 1 Default: 1.0

Description:

Uniform length rescale factor applied to atomic positions, cell, and reference forces at load time. Multiplies atomic positions and cell vectors by this factor at the moment a dataset is loaded for training, prediction, or HPO. Reference forces are divided by the factor (chain rule on E(r)); stresses (stored as virial in energy units) are invariant under uniform length rescaling. The chosen factor is persisted into pot.tadah so future tadah predict and tadah hpo runs apply the same transformation. Use –no-lscale at predict time to override.

LSCALE is a training-side concept: the LAMMPS pair_style does NOT re-apply LSCALE. The user is expected to provide LAMMPS positions at the scale that matches the trained model (e.g. experimental lattice).

Example 1:

1.0030

merge

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Merge deduplication results into one file. Example 1:

true

Example 2:

false

no_eshift

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

(predict) Ignore any ESHIFT recorded in the loaded potential file. At predict time, override the ESHIFT values stored in pot.tadah. Use when the dataset you are predicting on is already at the shifted baseline (or you just want raw model output without any reference energy subtraction).

Example 1:

true

no_lscale

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

(predict) Ignore any LSCALE recorded in the loaded potential file. At predict time, override the LSCALE value stored in pot.tadah. Use when the dataset you are predicting on is already at the trained-model scale.

Example 1:

true

numeric

[UINT] <unsigned integer>

Max number of values: 1 Default: 12

Description:

Numeric output precision. Sets the number of decimal places for output.

Example 1:

12

option

[STRING] <arg>

Max number of values: 1

Description:

Positional argument Example 1:

arg

outfile

[{STRING}] <string> [<string> …]

Max number of values: 2147483647

Description:

Output file. The output file to be written. Multiple files can be specified if the command produces more than one output.

Example 1:

output.tadah

percent

[{UINT}] <unsigned integer> [<unsigned integer> …]

Max number of values: 2147483647

Description:

Specify percentage partition. Example 1:

20 5 3

Example 2:

10

potential

[STRING] <file>

Max number of values: 1

Description:

Trained model file. Example 1:

pot.tadah

quantity

[STRING] <string>

Max number of values: 1

Description:

Generic quantity. Example 1:

validString

Example 2:

/path/to/file

random

[UINT] <unsigned integer>

Max number of values: 1

Description:

Randomly sample N entries. Example 1:

5

range

[DOUBLE DOUBLE INT] <START> <STOP> <NPOINTS>

Max number of values: 3 / 3

Description:

Plotting range [start stop npoints]. Example 1:

0.1 9.5 100

rescale

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Enable rescaling of training weights. Example 1:

true

Example 2:

false

shuffle

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Randomize entries before splitting. Example 1:

true

Example 2:

false

stress

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Include stresses. Example 1:

true

Example 2:

false

structure

[{STRING}] <string> [<string> …]

Max number of values: 2147483647

Description:

Unified structural input(s). Supported file formats: .cif (Crystallographic Information File), VASP (POSCAR/CONTCAR), and CASTEP (.cell). The online option fetches structures from databases (MP, COD, NOMAD). Multiple structures can be space-separated or repeated. A mix of files and online sources is allowed.

Example 1:

crystal.cif

Example 2:

crystal1.cif crystal2.cell

Example 3:

mp-42 crystal.cif

task

Description:

A file containing task(s) to be executed. The task file is a convenient way to specify multiple tasks without having to provide all the command-line arguments for each task. The task file should be in the same format as the configuration file, but it can also include additional information such as the task name and any specific parameters for that task. A task in a task file begins with the keyword ‘TASK’ followed by the task name. The task name is simply a command to be executed or both command and subcommand. The lines following the TASK keyword should contain parameters required for the task specified above. For example, CLI –verbose 2 is ‘VERBOSE 2’ in the task file.

# Example TASK file containing two tasks: # Global options NUMERIC 14 # output precision VERBOSE 2 # verbosity level

TASK predict DBFILE db1.tadah db2.tadah db3.tadah DBFILE db4.tadah db5.tadah db6.tadah FORCE true ANALYTICS true

TASK data print STRUCTURE crystal1.cif crystal2.cif

Example 1:

path/to/tasks.tadah

threshold

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Floating point comparison threshold. Example 1:

1e-4

type

[{STRING}] <string> [<string> …]

Max number of values: 2147483647

Description:

Generic types. Example 1:

string1 string2

Example 2:

string1 /path/to/file

uncertainty

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Output uncertainty estimates. Example 1:

true

Example 2:

false

uniform

[UINT] <unsigned integer>

Max number of values: 1

Description:

Sample uniformly every N-th entry. Example 1:

10

validation

[{STRING}] <string> [<string> …]

Max number of values: 2147483647

Description:

Validation dataset file(s). Example 1:

valid.tadah

verbose

[UINT] <unsigned integer>

Max number of values: 1 Default: 1

Description:

Verbosity level. 0-2: ERROR, WARNING, INFO. Verbosity level. 0: ERROR, 1: WARNING, 2: INFO. The verbosity level controls the amount of information printed during execution. Higher levels provide more detailed output.

Example 1:

2

wdbfile

[{DOUBLE}] <double> [<double> …]

Max number of values: 2147483647

Description:

Per-dataset weight multipliers, one per DBFILE entry. Multiplies eweight, fweight, and sweight of every configuration in the corresponding DBFILE by the given factor. Use to bias training toward or away from particular datasets. Composes multiplicatively with WDBFILE_AUTO.

Example 1:

1.0 0.5 0.1

wdbfile_auto

[DOUBLE] <double>

Max number of values: 1 Default: 0.0

Description:

Auto size-balance datasets: per-config weight multiplied by 1/N_i^alpha. Rebalances per-dataset contributions to the training loss by multiplying each configuration’s weight by N_i^(-alpha), where N_i is the number of (post- filter) configurations in dataset i. alpha=0 disables (default). alpha=0.5 is the recommended starting point (sqrt-inverse, soft balance). alpha=1 fully equalises aggregate dataset contribution. Composes multiplicatively with user-given WDBFILE.

Example 1:

0.5

Example 2:

1.0

zero_com_force

[BOOL] <boolean>

Max number of values: 1 Default: false

Description:

Subtract per-config mean force so each configuration has zero net force. Per configuration, subtracts the mean force from each atom so that the sum of forces over the configuration is exactly zero. Standard DFT post-processing trick to remove residual translational forces from incomplete relaxation/SCF.

Example 1:

true