Configuration File

This section describes the format of the configuration file used by Tadah!.

The configuration file controls the training process, specifying one or more datasets for use during the training stage. It defines cutoff functions and corresponding radii along with the regression model and descriptor choices.

Tadah! supports two- and many-body descriptors, allowing for separate descriptors with their corresponding cutoff radii.

Key/Value Pairs

The primary structure in a configuration file is the KEY/VALUE pair. Each KEY/VALUE pair must be on a separate line, with the KEY appearing first. The KEY is always a string, followed by its VALUE. The format and type of a VALUE depend on the specific KEY.

Common Usage

Typically, only a subset of KEYS is needed to train a model. Tadah! will use default values for some keys. An error will occur if a required KEY with no default value is missing:

[user@host:~] $ tadah train -c config.train
terminate called after throwing an instance of 'std::runtime_error'
  what():  Key not found: DBFILE
Aborted (core dumped)

This message indicates that the DBFILE KEY was not specified in the config.train file. To resolve this, add the DBFILE key and its corresponding value to config.train.

Key Specifics

The meaning of some KEYS may change based on the model or descriptor in use. Refer to SupportedKEYS for each model or descriptor to understand the required KEYS and their explanations.

Comments

Use the # symbol to add comments in the configuration file.

Multiple Values

Some KEYS can have multiple values, specified in one of two ways:

  • Single line:

    KEY VALUE1 VALUE2 VALUE3
    
  • Multiple lines:

    KEY VALUE1
    KEY VALUE2
    KEY VALUE3
    

Value Limits

If the number of values exceeds the maximum allowed, an error will occur. For example, specifying RCTYPE2B twice when only one value is allowed will result in:

[user@host:~] $ tadah train -c config.train
terminate called after throwing an instance of 'std::runtime_error'
  what():  Repeated key RCTYPE2B Cut_Cos
Aborted (core dumped)

Supported KEYS:

This section contains all KEYS currently used by Tadah!.

ALPHA

ALPHA [double] N

Max number of values: 1

Default: 1.0

Description:

Weight precision hyper-parameter. Starting guess used in the evidence approximation algorithm.

Example: 1

ALPHA 0.23

ATOMS

ATOMS [string] N

Max number of values: 2147483647

Description:

List of unique atoms sorted by Z. This key is set by the library. Users should not set it.

BASIS

BASIS [double] N1 N2 N3 …

Max number of values: 2147483647

Description:

Basis vectors used by the non-linear Kernel Ridge Regression model. In this context, they represent the features or functions used to map input data into a higher-dimensional feature space to capture non-linear relationships.

BETA

BETA [double] N

Max number of values: 1

Default: 1.0

Description:

Noise precision hyper-parameter. Starting guess used in the evidence approximation algorithm.

Example: 1

BETA 0.0001

BIAS

BIAS [bool] true | false

Max number of values: 1

Default: true

Description:

Controls whether to append 1 to every descriptor. Increases DSIZE by 1.

Example: 1

BIAS false

CEMBFUNC

CEMBFUNC [double] N1 N2 N3 …

Max number of values: 2147483647

Description:

A number of position parameters of the embedding function. Used by some many-body descriptors, it controls where the x-intercept is in F_RLR.

Example: 1

CEMBFUNC 0.14 0.45 1.00 1.1

CGRID2B

Max number of values: 2147483647

Description:

This KEY controls the position parameters used by the radial basis functions of a two-body descriptor, e.g., the position of a Gaussian function. The parameter list can be provided manually or generated automatically. This key is often used together with SGRID2B. It is usually the case that both CGRID2B and SGRID2B must be the same size. In most cases, the maximum value should be smaller than the cutoff distance used for the two-body descriptor. Note that not all descriptors use this parameter, e.g., D2_LJ has a fixed grid.

CGRIDMB

Max number of values: 2147483647

Description:

See CGRID2B for a description.

DBFILE

DBFILE [string] /path/to/dbfile …

Max number of values: 2147483647

Description:

Absolute or relative path to the database file. The relative path is to the script working directory. More than one dataset can be included, either by listing paths in the same line separated by spaces or by repeating the KEY multiple times.

Example: 1

DBFILE /path/to/dbfile

Example: 2

DBFILE /path/to/dbfile1 /path/to/dbfile2

DIMER

DIMER [bool] F [double] BOND_LENGTH [bool] B

Max number of values: 3

Default: false 0 false

Description:

Control for DIMER models. Users should not modify this key.

Example: 1

DIMER true 1.104 true

DSIZE

DSIZE [int] N

Max number of values: 1

Description:

The total size of the descriptor. 2B + 3B + MB + bias.

ESTDEV

ESTDEV [int] N

Max number of values: 1

Description:

Tadah internal key.

EWEIGHT

EWEIGHT [double] N

Max number of values: 1

Default: 1.0

Description:

Global energy scaling factor for all configurations used in the training process. Note that energies are always scaled by 1/number of atoms. Individual scaling factors for every configuration can be set in a dataset file. The combined scaling factor is: EWEIGHT*(configuration eweight)/(number of atoms)

Example: 1

EWEIGHT 0.96

FIXINDEX

FIXINDEX [int] N1 N2 N3 …

Max number of values: 2147483647

Description:

Indices of weights to be fixed during the optimisation process. Requires FIXWEIGHT.

Example: 1

1 4 5

FIXWEIGHT

FIXWEIGHT [double] w1 w2 w3 …

Max number of values: 2147483647

Description:

Values for weights to be fixed during the optimisation process. Requires FIXINDEX.

Example: 1

1.0 1.0 0.5

FORCE

FORCE [bool] true | false

Max number of values: 1

Default: false

Description:

Set to true to calculate force descriptors and/or use forces during the training process.

FSTDEV

FSTDEV [int] N

Max number of values: 1

Description:

Tadah internal key.

FWEIGHT

FWEIGHT [double] N

Max number of values: 1

Default: 1.0

Description:

Global force scaling factor for all configurations used in the training process. Note that each force component is always scaled by 1/(number of atoms)/3. Individual scaling factors for every configuration can be set in a dataset file. The combined scaling factor is: FWEIGHT*(configuration fweight)/(number of atoms)/3

Example: 1

FWEIGHT 1e-2

INIT2B

INIT2B [bool] true | false

Max number of values: 1

Default: false

Description:

If set to true, the two-body descriptor will be calculated.

Example: 1

INIT2B true

INIT3B

INIT3B [bool] true | false

Max number of values: 1

Default: false

Description:

This is a dummy flag as Ta-dah! does not calculate three-body descriptors. Three-body interactions can be included with some of the many-body descriptors.

INITMB

INITMB [bool] true | false

Max number of values: 1

Default: false

Description:

If set to true, the many-body descriptor will be calculated.

Example: 1

INITMB true

LAMBDA

LAMBDA [int | double | int double] N | N M

Max number of values: 2

Default: 0

Description:

This KEY controls the regularisation parameter \(\lambda\) for both M_BLR and M_KRR. If N=0, no regularisation is applied. If N>0, \(\lambda\) is set to this value. If N<0, an evidence-approximation algorithm estimates the value of \(\lambda\). For LAMBDA 0, second number, M (double), can be specified, i.e. LAMBDA 0 1e-12, which helps to determine the effective rank of matrix :\(\Phi\) (default 1e-8).

Example: 1

LAMBDA -1

Example: 2

LAMBDA 1e-4

Example: 3

LAMBDA 0

Example: 4

LAMBDA 0 1e-12

MBLOCK

MBLOCK [int] N

Max number of values: 2147483647

Default: 64

Description:

ScalaPACK row block size MB.

MODEL

MODEL [string] MODEL [string] FUNCTION

Max number of values: 3

Description:

This key defines the MODEL to be used for training. MODEL can be any class which inherits from M_Base. FUNCTION can be any child class of Function_Base.

Example: 1

MODEL M_BLR BF_Linear

Example: 2

MODEL M_BLR BF_Polynomial2

Example: 3

MODEL M_KRR Kern_Linear

MPARAMS

MPARAMS [double] N1 N2 N3 …

Max number of values: 2147483647

Description:

List of parameters used by some models. See model description for more details. Note that many models do not require this parameter at all.

Example: 1

MPARAMS 0.1

MPIWPCKG

MPIWPCKG [int] N

Max number of values: 2147483647

Default: 50

Description:

The number of structures in a single MPI work package.

NBLOCK

NBLOCK [int] N

Max number of values: 2147483647

Default: 64

Description:

ScalaPACK column block size NB.

NMEAN

NMEAN [double] N1 N2 N3 …

Max number of values: 2147483647

Description:

Mean values for the columns of the DesignMatrix. This vector is obtained during standardisation (see NORM).

NORM

NORM [bool] true | false

Max number of values: 1

Default: false

Description:

Set to true to standardise descriptors. Note that this usually makes sense only when energies are used for fitting.

Example: 1

NORM true

NSTDEV

NSTDEV [double] N1 N2 N3 …

Max number of values: 2147483647

Description:

Standard deviations obtained during standardisation (see NORM) of the columns of the DesignMatrix. The size of the vector is equal to the number of columns.

OUTPREC

OUTPREC [int] N

Max number of values: 1

Default: 10

Description:

Number of decimal places used when dumping a potential file.

Example: 1

OUTPREC 12

RCTYPE2B

RCTYPE2B [string] Cut_NAME

Max number of values: 2147483647

Description:

Cutoff type to be used with a two-body descriptor.

Example: 1

RCTYPE2B Cut_Cos

RCTYPE3B

RCTYPE3B [string] Cut_NAME

Max number of values: 2147483647

Description:

Dummy. See INIT3B.

RCTYPEMB

RCTYPEMB [string] Cut_NAME

Max number of values: 2147483647

Description:

Cutoff type to be used with a many-body descriptor.

Example: 1

RCTYPEMB Cut_Cos

RCUT2B

RCUT2B [double] N

Max number of values: 2147483647

Description:

Cutoff distance used by the two-body descriptor.

Example: 1

RCUT2B 6.7

RCUT2BMAX

RCUT2BMAX [int] N

Max number of values: 1

Description:

Tadah internal key.

RCUT3B

RCUT3B [double] N

Max number of values: 2147483647

Description:

Dummy. See INIT3B.

RCUT3BMAX

RCUT3BMAX [int] N

Max number of values: 1

Description:

Tadah internal key.

RCUTMAX

RCUTMAX [int] N

Max number of values: 1

Description:

Tadah internal key.

RCUTMB

RCUTMB [double] N

Max number of values: 2147483647

Description:

Cutoff distance used by the many-body descriptor.

Example: 1

RCUTMB 4.9

RCUTMBMAX

RCUTMBMAX [int] N

Max number of values: 1

Description:

Tadah internal key.

SBASIS

SBASIS [int] N

Max number of values: 1

Description:

The number of basis functions to use when constructing the DesignMatrix. Note that many models do not require this parameter at all.

Example: 1

SBASIS 10

SEMBFUNC

SEMBFUNC [double] N1 N2 N3 …

Max number of values: 2147483647

Description:

A number of shape parameters of the embedding function. Used by some many-body descriptors, it controls the depth of the function in F_RLR.

Example: 1

SEMBFUNC 0.14 0.45 1.00 1.1

SETFL

SETFL [string] /path/to/dbfile

Max number of values: 1

Description:

Path to the setfl file with the EAM potential.

Example: 1

SETFL Ta1_Ravelo_2013.eam.alloy

SGRID2B

Max number of values: 2147483647

Description:

Control the number of shape parameters for the radial basis functions for a two-body descriptor, e.g., the width of the Gaussian function. Similarly to CGRID2B, the parameter list can be provided or generated automatically. This KEY is usually employed together with CGRID2B.

SGRIDMB

Max number of values: 2147483647

Description:

See SGRID2B for a description.

SIGMA

SIGMA [int] N [double] D …

Max number of values: 2147483647

Description:

The \(\Sigma\) matrix used in Bayesian Linear Regression M_BLR. The matrix is \(N\times N\) and stored in column-major order.

Example: 1

`Sigma 2 1.2 2.2 2.3 3.3`

SIZE2B

SIZE2B [int] N

Max number of values: 1

Description:

Size of 2B descriptor.

SIZE3B

SIZE3B [int] N

Max number of values: 1

Description:

Size of 3B descriptor.

SIZEMB

SIZEMB [int] N

Max number of values: 1

Description:

Size of MB descriptor.

SSTDEV

SSTDEV [int] N

Max number of values: 6

Description:

Tadah internal key.

STRESS

STRESS [bool] true | false

Max number of values: 1

Default: false

Description:

Set to true to calculate stress descriptors and/or use virial stress during the training process.

SWEIGHT

SWEIGHT [double] N

Max number of values: 1

Default: 1.0

Description:

Global stress scaling factor for all configurations used in the training process. Note that each stress component is always scaled by 1/6. Individual scaling factors for every configuration can be set in a dataset file. The combined scaling factor is: SWEIGHT*(configuration sweight)/6

Example: 1

SWEIGHT 1e-1

TYPE2B

TYPE2B [string] D2_NAME

Max number of values: 2147483647

Description:

Type of a two-body descriptor to be used. Every two-body descriptor inherits from D2_Base.

Example: 1

TYPE2B D2_LJ

Example: 2

TYPE2B D2_Blip

TYPE3B

TYPE3B [string] D3_NAME

Max number of values: 2147483647

Description:

Dummy. See INIT3B.

TYPEMB

TYPEMB [string] DM_NAME

Max number of values: 2147483647

Description:

Type of a many-body descriptor to be used. Every many-body descriptor inherits from DM_Base.

Example: 1

TYPEMB DM_EAD

VERBOSE

VERBOSE [int] N

Max number of values: 1

Default: 0

Description:

Set verbosity level. For N>0, it provides detailed output for diagnostic purposes.

Example: 1

VERBOSE 1

WATOMS

WATOMS [double] N

Max number of values: 2147483647

Description:

Weights sorted by the atomic number of the corresponding atom, with the lowest Z value first. WATOMS.size() == ATOMS.size()

WEIGHTS

WEIGHTS [double] N1 N2 N3 …

Max number of values: 2147483647

Description:

The machine-learned coefficients for the model, obtained during the optimisation process. These are species-dependent weights and can be set by the user. The default weight is an atomic number.

Example: 1

WEIGHTS 0.12 1.2 0.3