Configuration File
This section describes the format of the configuration file used by Tadah!.
The configuration file controls the training process, specifying one or more datasets for use during the training stage. It defines cutoff functions and corresponding radii along with the regression model and descriptor choices.
Tadah! supports two- and many-body descriptors, allowing for separate descriptors with their corresponding cutoff radii.
Key/Value Pairs
The primary structure in a configuration file is the KEY/VALUE pair. Each KEY/VALUE pair must be on a separate line, with the KEY appearing first. The KEY is always a string, followed by its VALUE. The format and type of a VALUE depend on the specific KEY.
Common Usage
Typically, only a subset of KEYS is needed to train a model. Tadah! will use default values for some keys. An error will occur if a required KEY with no default value is missing:
[user@host:~] $ tadah train -c config.train
terminate called after throwing an instance of 'std::runtime_error'
what(): Key not found: DBFILE
Aborted (core dumped)
This message indicates that the DBFILE KEY was not specified in the config.train
file. To resolve this, add the DBFILE key and its corresponding value to config.train
.
Key Specifics
The meaning of some KEYS may change based on the model or descriptor in use. Refer to SupportedKEYS for each model or descriptor to understand the required KEYS and their explanations.
Multiple Values
Some KEYS can have multiple values, specified in one of two ways:
Single line:
KEY VALUE1 VALUE2 VALUE3
Multiple lines:
KEY VALUE1 KEY VALUE2 KEY VALUE3
Value Limits
If the number of values exceeds the maximum allowed, an error will occur. For example, specifying RCTYPE2B twice when only one value is allowed will result in:
[user@host:~] $ tadah train -c config.train
terminate called after throwing an instance of 'std::runtime_error'
what(): Repeated key RCTYPE2B Cut_Cos
Aborted (core dumped)
Supported KEYS:
This section contains all KEYS currently used by Tadah!.
ALPHA
- ALPHA [double] N
Max number of values: 1
Default: 1.0
Description:
Weight precision hyper-parameter. Starting guess used in the evidence approximation algorithm.
Example: 1
ALPHA 0.23
ATOMS
- ATOMS [string] N
Max number of values: 2147483647
Description:
List of unique atoms sorted by Z. This key is set by the library. Users should not set it.
BASIS
- BASIS [double] N1 N2 N3 …
Max number of values: 2147483647
Description:
Basis vectors used by the non-linear Kernel Ridge Regression model. In this context, they represent the features or functions used to map input data into a higher-dimensional feature space to capture non-linear relationships.
BETA
- BETA [double] N
Max number of values: 1
Default: 1.0
Description:
Noise precision hyper-parameter. Starting guess used in the evidence approximation algorithm.
Example: 1
BETA 0.0001
BIAS
- BIAS [bool] true | false
Max number of values: 1
Default: true
Description:
Controls whether to append 1 to every descriptor. Increases DSIZE by 1.
Example: 1
BIAS false
CEMBFUNC
- CEMBFUNC [double] N1 N2 N3 …
Max number of values: 2147483647
Description:
A number of position parameters of the embedding function. Used by some many-body descriptors, it controls where the x-intercept is in
F_RLR
.Example: 1
CEMBFUNC 0.14 0.45 1.00 1.1
CGRID2B
Max number of values: 2147483647
Description:
This KEY controls the position parameters used by the radial basis functions of a two-body descriptor, e.g., the position of a Gaussian function. The parameter list can be provided manually or generated automatically. This key is often used together with SGRID2B. It is usually the case that both CGRID2B and SGRID2B must be the same size. In most cases, the maximum value should be smaller than the cutoff distance used for the two-body descriptor. Note that not all descriptors use this parameter, e.g.,
D2_LJ
has a fixed grid.
CGRIDMB
DBFILE
- DBFILE [string] /path/to/dbfile …
Max number of values: 2147483647
Description:
Absolute or relative path to the database file. The relative path is to the script working directory. More than one dataset can be included, either by listing paths in the same line separated by spaces or by repeating the KEY multiple times.
Example: 1
DBFILE /path/to/dbfile
Example: 2
DBFILE /path/to/dbfile1 /path/to/dbfile2
DIMER
- DIMER [bool] F [double] BOND_LENGTH [bool] B
Max number of values: 3
Default: false 0 false
Description:
Control for DIMER models. Users should not modify this key.
Example: 1
DIMER true 1.104 true
DSIZE
- DSIZE [int] N
Max number of values: 1
Description:
The total size of the descriptor. 2B + 3B + MB + bias.
ESTDEV
- ESTDEV [int] N
Max number of values: 1
Description:
Tadah internal key.
EWEIGHT
- EWEIGHT [double] N
Max number of values: 1
Default: 1.0
Description:
Global energy scaling factor for all configurations used in the training process. Note that energies are always scaled by 1/number of atoms. Individual scaling factors for every configuration can be set in a dataset file. The combined scaling factor is: EWEIGHT*(configuration eweight)/(number of atoms)
Example: 1
EWEIGHT 0.96
FIXINDEX
- FIXINDEX [int] N1 N2 N3 …
Max number of values: 2147483647
Description:
Indices of weights to be fixed during the optimisation process. Requires FIXWEIGHT.
Example: 1
1 4 5
FIXWEIGHT
- FIXWEIGHT [double] w1 w2 w3 …
Max number of values: 2147483647
Description:
Values for weights to be fixed during the optimisation process. Requires FIXINDEX.
Example: 1
1.0 1.0 0.5
FORCE
- FORCE [bool] true | false
Max number of values: 1
Default: false
Description:
Set to true to calculate force descriptors and/or use forces during the training process.
FSTDEV
- FSTDEV [int] N
Max number of values: 1
Description:
Tadah internal key.
FWEIGHT
- FWEIGHT [double] N
Max number of values: 1
Default: 1.0
Description:
Global force scaling factor for all configurations used in the training process. Note that each force component is always scaled by 1/(number of atoms)/3. Individual scaling factors for every configuration can be set in a dataset file. The combined scaling factor is: FWEIGHT*(configuration fweight)/(number of atoms)/3
Example: 1
FWEIGHT 1e-2
INIT2B
- INIT2B [bool] true | false
Max number of values: 1
Default: false
Description:
If set to true, the two-body descriptor will be calculated.
Example: 1
INIT2B true
INIT3B
- INIT3B [bool] true | false
Max number of values: 1
Default: false
Description:
This is a dummy flag as Ta-dah! does not calculate three-body descriptors. Three-body interactions can be included with some of the many-body descriptors.
INITMB
- INITMB [bool] true | false
Max number of values: 1
Default: false
Description:
If set to true, the many-body descriptor will be calculated.
Example: 1
INITMB true
LAMBDA
- LAMBDA [int | double | int double] N | N M
Max number of values: 2
Default: 0
Description:
This KEY controls the regularisation parameter \(\lambda\) for both
M_BLR
andM_KRR
. IfN=0
, no regularisation is applied. IfN>0
, \(\lambda\) is set to this value. IfN<0
, an evidence-approximation algorithm estimates the value of \(\lambda\). For LAMBDA 0, second number, M (double), can be specified, i.e. LAMBDA 0 1e-12, which helps to determine the effective rank of matrix :\(\Phi\) (default 1e-8).Example: 1
LAMBDA -1
Example: 2
LAMBDA 1e-4
Example: 3
LAMBDA 0
Example: 4
LAMBDA 0 1e-12
MBLOCK
- MBLOCK [int] N
Max number of values: 2147483647
Default: 64
Description:
ScalaPACK row block size MB.
MODEL
- MODEL [string] MODEL [string] FUNCTION
Max number of values: 3
Description:
This key defines the
MODEL
to be used for training.MODEL
can be any class which inherits fromM_Base
.FUNCTION
can be any child class ofFunction_Base
.Example: 1
MODEL M_BLR BF_Linear
Example: 2
MODEL M_BLR BF_Polynomial2
Example: 3
MODEL M_KRR Kern_Linear
MPARAMS
- MPARAMS [double] N1 N2 N3 …
Max number of values: 2147483647
Description:
List of parameters used by some models. See model description for more details. Note that many models do not require this parameter at all.
Example: 1
MPARAMS 0.1
MPIWPCKG
- MPIWPCKG [int] N
Max number of values: 2147483647
Default: 50
Description:
The number of structures in a single MPI work package.
NBLOCK
- NBLOCK [int] N
Max number of values: 2147483647
Default: 64
Description:
ScalaPACK column block size NB.
NMEAN
- NMEAN [double] N1 N2 N3 …
Max number of values: 2147483647
Description:
Mean values for the columns of the
DesignMatrix
. This vector is obtained during standardisation (see NORM).
NORM
- NORM [bool] true | false
Max number of values: 1
Default: false
Description:
Set to true to standardise descriptors. Note that this usually makes sense only when energies are used for fitting.
Example: 1
NORM true
NSTDEV
- NSTDEV [double] N1 N2 N3 …
Max number of values: 2147483647
Description:
Standard deviations obtained during standardisation (see NORM) of the columns of the
DesignMatrix
. The size of the vector is equal to the number of columns.
OUTPREC
- OUTPREC [int] N
Max number of values: 1
Default: 10
Description:
Number of decimal places used when dumping a potential file.
Example: 1
OUTPREC 12
RCTYPE2B
- RCTYPE2B [string] Cut_NAME
Max number of values: 2147483647
Description:
Cutoff type to be used with a two-body descriptor.
Example: 1
RCTYPE2B Cut_Cos
RCTYPE3B
- RCTYPE3B [string] Cut_NAME
Max number of values: 2147483647
Description:
Dummy. See INIT3B.
RCTYPEMB
- RCTYPEMB [string] Cut_NAME
Max number of values: 2147483647
Description:
Cutoff type to be used with a many-body descriptor.
Example: 1
RCTYPEMB Cut_Cos
RCUT2B
- RCUT2B [double] N
Max number of values: 2147483647
Description:
Cutoff distance used by the two-body descriptor.
Example: 1
RCUT2B 6.7
RCUT2BMAX
- RCUT2BMAX [int] N
Max number of values: 1
Description:
Tadah internal key.
RCUT3B
- RCUT3B [double] N
Max number of values: 2147483647
Description:
Dummy. See INIT3B.
RCUT3BMAX
- RCUT3BMAX [int] N
Max number of values: 1
Description:
Tadah internal key.
RCUTMAX
- RCUTMAX [int] N
Max number of values: 1
Description:
Tadah internal key.
RCUTMB
- RCUTMB [double] N
Max number of values: 2147483647
Description:
Cutoff distance used by the many-body descriptor.
Example: 1
RCUTMB 4.9
RCUTMBMAX
- RCUTMBMAX [int] N
Max number of values: 1
Description:
Tadah internal key.
SBASIS
- SBASIS [int] N
Max number of values: 1
Description:
The number of basis functions to use when constructing the
DesignMatrix
. Note that many models do not require this parameter at all.Example: 1
SBASIS 10
SEMBFUNC
- SEMBFUNC [double] N1 N2 N3 …
Max number of values: 2147483647
Description:
A number of shape parameters of the embedding function. Used by some many-body descriptors, it controls the depth of the function in
F_RLR
.Example: 1
SEMBFUNC 0.14 0.45 1.00 1.1
SETFL
- SETFL [string] /path/to/dbfile
Max number of values: 1
Description:
Path to the setfl file with the EAM potential.
Example: 1
SETFL Ta1_Ravelo_2013.eam.alloy
SGRID2B
Max number of values: 2147483647
Description:
Control the number of shape parameters for the radial basis functions for a two-body descriptor, e.g., the width of the Gaussian function. Similarly to CGRID2B, the parameter list can be provided or generated automatically. This KEY is usually employed together with CGRID2B.
SGRIDMB
SIGMA
- SIGMA [int] N [double] D …
Max number of values: 2147483647
Description:
The \(\Sigma\) matrix used in Bayesian Linear Regression
M_BLR
. The matrix is \(N\times N\) and stored in column-major order.Example: 1
`Sigma 2 1.2 2.2 2.3 3.3`
SIZE2B
- SIZE2B [int] N
Max number of values: 1
Description:
Size of 2B descriptor.
SIZE3B
- SIZE3B [int] N
Max number of values: 1
Description:
Size of 3B descriptor.
SIZEMB
- SIZEMB [int] N
Max number of values: 1
Description:
Size of MB descriptor.
SSTDEV
- SSTDEV [int] N
Max number of values: 6
Description:
Tadah internal key.
STRESS
- STRESS [bool] true | false
Max number of values: 1
Default: false
Description:
Set to true to calculate stress descriptors and/or use virial stress during the training process.
SWEIGHT
- SWEIGHT [double] N
Max number of values: 1
Default: 1.0
Description:
Global stress scaling factor for all configurations used in the training process. Note that each stress component is always scaled by 1/6. Individual scaling factors for every configuration can be set in a dataset file. The combined scaling factor is: SWEIGHT*(configuration sweight)/6
Example: 1
SWEIGHT 1e-1
TYPE2B
- TYPE2B [string] D2_NAME
Max number of values: 2147483647
Description:
Type of a two-body descriptor to be used. Every two-body descriptor inherits from
D2_Base
.Example: 1
TYPE2B D2_LJ
Example: 2
TYPE2B D2_Blip
TYPE3B
- TYPE3B [string] D3_NAME
Max number of values: 2147483647
Description:
Dummy. See INIT3B.
TYPEMB
- TYPEMB [string] DM_NAME
Max number of values: 2147483647
Description:
Type of a many-body descriptor to be used. Every many-body descriptor inherits from
DM_Base
.Example: 1
TYPEMB DM_EAD
VERBOSE
- VERBOSE [int] N
Max number of values: 1
Default: 0
Description:
Set verbosity level. For
N>0
, it provides detailed output for diagnostic purposes.Example: 1
VERBOSE 1
WATOMS
- WATOMS [double] N
Max number of values: 2147483647
Description:
Weights sorted by the atomic number of the corresponding atom, with the lowest Z value first. WATOMS.size() == ATOMS.size()
WEIGHTS
- WEIGHTS [double] N1 N2 N3 …
Max number of values: 2147483647
Description:
The machine-learned coefficients for the model, obtained during the optimisation process. These are species-dependent weights and can be set by the user. The default weight is an atomic number.
Example: 1
WEIGHTS 0.12 1.2 0.3
Comments
Use the # symbol to add comments in the configuration file.