.. _nested_fitting: Nested Fitting ============== .. warning:: This section is under construction. The content is incomplete and may still contain errors. We are working on it and will update it soon. .. figure:: images/nested_fitting.jpg :alt: Nested Fitting Procedure :width: 95% :align: center Workflow of the :term:`nested fitting procedure`: the outer optimizer proposes new :term:`hyperparameters`, the model is retrained, and each trial potential is validated against :term:`performance constraints` that feed back into the global loss. Nested fitting is Tadah!MLIP’s automated, two-level fitting workflow: * **inner loop** – ordinary regression that determines the :term:`learned parameters` :math:`\mathbf w` for a fixed model definition; * **outer loop** – a global optimizer that samples from the :term:`search space constraints` (SSH), retrains the model, and judges the result against user-defined :term:`performance constraints` (PC). Letting a computer search the :term:`hyperparameter` space helps to * escape the “good on the validation set, unstable in MD” trap; * trade accuracy for speed (or vice versa) in a reproducible way; * fold real-world performance constraints—elastic constants, phase stability, surface energies, …—directly into the loss function. This page gives the necessary background and demonstrates how to set up Tadah!MLIP for nested fitting. Background ---------- A traditional MLIP workflow stops once regression converges on a training/validation split. Unfortunately, a potential that interpolates perfectly may still * disintegrate during an MD run * produce an incorrect equation of state * predict the wrong ground-state crystal * extrapolate poorly beyond the training set Usual remedies, such as manual hyperparameter tuning or enlarging the training set, are both tedious and not guaranteed to succeed. Nested fitting tackles the problem by *measuring* emergent properties during the fit. Each trial potential is dropped into a short LAMMPS run; the resulting quantities are compared to their targets and the discrepancy contributes to a **global loss** .. math:: L_{\text{total}} = \sum_{i} w_i\; \mathcal L_i\!\bigl(|y_i-\hat y_i|\bigr), where :math:`w_i` is the user-supplied weight, :math:`y_i` the target, :math:`\hat y_i` the prediction, and :math:`\mathcal L_i` one of the built-in loss functions (L1, L2, Huber, Tukey, …). The outer optimizer explores the hyperparameter space :math:`\Theta` declared with the ``OPTIM`` directive: .. math:: \theta^\star = \operatorname*{arg\,min}_{\theta\in\Theta} L_{\text{total}}(\theta). Because the model is retrained for every candidate :math:`\theta`, the procedure is computationally expensive but also powerful: it can expose regions of hyperparameter space that yield stable, accurate, and fast potentials. However, it is not a silver bullet: **Success still depends on sensible choices of performance constraints and search space limits**. .. _hpo_quick_start: Quick Start ----------- Nested fitting requires a configuration file that defines the inital model (the same file as for regular training), a validation file (simple list of datasets to validate against) and a hpo targets file that defines the outer-loop. The command to run the nested fitting is: :: tadah hpo --config --hpotarget --validation See :ref:`cli_example_2` for the simple example of nested fitting. More examples will be added in the future. In the meantime, you can contact us for help with setting up nested fitting for your specific use case. .. _hpo_manual: Manual ------ This manual covers the content of the configuration file format which is used to drive the nested fitting procedure. As well as some general information about the nested fitting procedure. In general the following steps are required to run the nested fitting procedure: #. Define the initial model in the :term:`training configuration` file. #. List validation dataset(s) in the validation file using :ref:`DBFILE` keyword. #. Define the :term:`nested fitting configuration` file with the following blocks: * ``LOSS`` to define the default loss function * ``OPTIMIZER`` block to define the outer optimizer * ``OPTIM`` line(s) to define the :term:`SSH` * ``PC_`` line(s) to define the :term:`PC`: * ``PC_ERMSE`` for validation RMSE of energies (see :ref:`performance_constraints` for more) * ``PC_LAMMPS`` line(s) to define the physics-informed performance constraints .. _loss_functions: Loss functions .............. The keyword ``LOSS`` is used to define the default loss function for the nested fitting procedure. Individual ``PC_LAMMPS`` simulations can override it. :: LOSS [] The currently supported loss functions are: ================ =============== ============================== Name Extra params Comment ================ =============== ============================== ``L1`` — :math:`|x|` ``L2`` — :math:`x^2` ``HUBER`` ``δ`` quadratically near zero, linear beyond δ ``TUKEY`` ``c`` redescending, zero influence for |x|>c ``LOG_COSH`` — smooth alternative to L1 ``RMSLE`` — log-scaled L2 (non-negative targets) ================ =============== ============================== Choosing the optimizer ....................... The optimizer for the outer loop is specified in the :term:`nested fitting configuration` file. The optimizer could be either global, local, or a hybrid of the two. The latter case is only supported by a handful of optimizers. Tadah!MLIP supports optimizers from the following libraries: * NLOPT https://nlopt.readthedocs.io We aim to support all NLOPT algorithms which do not require analytical derivatives, i.e. all algorithms which names begin with either ``GN_`` or ``LN_``. See the `NLOPT documentation `_ * Dlib http://dlib.net * Local numerical optimizers (BFGS, LBFGS, CG, BOBYQA) [Work in progress...] * Global MaxLIPO+TR aka Global Function Search (GFS) * Tadah!MLIP provides basic random search and grid search optimizers. The optimizer is specified in the OPTIMIZER block of the :term:`nested fitting configuration` file. .. code-block:: none OPTIMIZER ... ENDOPTIMIZER If an optimizer is a hybrid of global and local, the nested ``LOCAL`` block must be specified. The local block follows the same syntax as the global block, but it is used to configure the local optimizer that will be used for the local search. .. code-block:: none OPTIMIZER ... ... LOCAL ... ... ENDLOCAL ENDOPTIMIZER Note that usually optimizers support only a subset of the keys listed below. +---------------------------+------------------------------------------------------------------------------+ | key [type] | description [available options] | +===========================+==============================================================================+ | ``LIB`` [string] | name of the optimization library [``DLIB``, ``NLOPT``, ``TADAH``] | +---------------------------+------------------------------------------------------------------------------+ | ``ALGO`` [string] | name of the algorithm from the selected library (see below) | +---------------------------+------------------------------------------------------------------------------+ | ``MAXEVAL`` [uint] | maximum number of evaluations of the loss function | +---------------------------+------------------------------------------------------------------------------+ | ``MAXTIME`` [uint] | maximum time in seconds to run the optimizer | +---------------------------+------------------------------------------------------------------------------+ | ``STOPVAL`` [uint] | stop the optimization when the loss is below this value | +---------------------------+------------------------------------------------------------------------------+ | ``FTOL_REL`` [float] | relative tolerance for the loss function | +---------------------------+------------------------------------------------------------------------------+ | ``FTOL_ABS`` [float] | absolute tolerance for the loss function | +---------------------------+------------------------------------------------------------------------------+ | ``XTOL_REL`` [float] | relative tolerance for the hyperparameters | +---------------------------+------------------------------------------------------------------------------+ | ``XTOL_ABS`` [float] | absolute tolerance for the hyperparameters | +---------------------------+------------------------------------------------------------------------------+ | ``POPULATION`` [uint] | number of individuals in the population (for population-based optimizers) | +---------------------------+------------------------------------------------------------------------------+ | ``STEP`` [uint] | step size for the optimizer | +---------------------------+------------------------------------------------------------------------------+ | ``SEED`` [uint] | random seed for the optimizer (default: current time) | +---------------------------+------------------------------------------------------------------------------+ | ``VECTOR_ARR`` [uint] | storege required by some NLOPT algorithms | +---------------------------+------------------------------------------------------------------------------+ | ``THREADS`` [uint] | number of threads for the Dlip::MaxLIPO+TR optimizer (default: 1) | +---------------------------+------------------------------------------------------------------------------+ | ``PARAM`` [string float] | generic key value pair for passing additional parameters to the optimizer | +---------------------------+------------------------------------------------------------------------------+ | ``LOG_HP`` [float] | threshold for the log-scaling of the search space (experimental) | +---------------------------+------------------------------------------------------------------------------+ +-----------+----------------------------------------------------------------------------------------------+ | LIB | ALGO (algorithms from the selected library) | +-----------+----------------------------------------------------------------------------------------------+ | ``TADAH`` | ``RANDOM``, ``GRID`` | +-----------+----------------------------------------------------------------------------------------------+ | ``NLOPT`` | ``GN_DIRECT_L``, ``GN_CRS2_LM``, ``G_MLSL_LDS``, ... and many more see NLOP docs | +-----------+----------------------------------------------------------------------------------------------+ | ``DLIB`` | ``BFGS``, ``LBFGS``, ``CG``, ``BOBYQA``, ``GFS`` (MaxLIPO+TR) [work in progress...] | +-----------+----------------------------------------------------------------------------------------------+ .. _search_space_constraints: Search space constraints ........................ The search space constraints are defined in the :term:`nested fitting configuration` file using the ``OPTIM`` directive. The ``OPTIM`` lines specify the hyperparameters that the outer optimizer will explore. Each ``OPTIM`` refers to a single KEY from a configuration file which defines the initial model, (*config.train*), which is then treated as an optimisation variable. Since KEY can contain multiple values, the ``OPTIM`` lines can also specify indices to select a subset of the values. The ``OPTIM`` line can be repeated for the same KEY in case you want to supply different low and high bounds for different indices. ``OPTIM`` syntax: .. code-block:: none OPTIM (indices) ``indices`` must be specified in the brackets ``(indices...)`` and follow the order in the config file which defines the initial model. Indices can be comma-separated lists, ranges ``a-b`` or strides ``a-b:s`` (e.g. ``(1,4,7-10:2)``). Indices start from 1. ```` are the floating point numbers for the lower and upper bounds. .. _performance_constraints: Performance constraints ....................... The simplest performance constraints are computed for energies, forces and stresses with the *mean absolute error* (MAE), *root-mean-squared error* (RMSE) and the coefficient of determination (*R²*). They are evaluated on the validation set. These constraints use the default :ref:`loss_functions` selected in the :term:`nested fitting configuration` file with the ``LOSS`` keyword. Currently available built-ins are * **Energy** ``PC_ERMSE`` ``PC_ErRMSE`` ``PC_MAE`` ``PC_ERSQ`` * **Force** ``PC_FRMSE`` * **Stress** ``PC_SRMSE`` Physical constraints ^^^^^^^^^^^^^^^^^^^^ **Physics-informed performance constraints** are introduced with the ``PC_LAMMPS`` directive in the :term:`nested fitting configuration` file. Tadah!MLIP uses LAMMPS as the work-horse that turns a trial potential into the macroscopic numbers we care about; each ``PC_LAMMPS`` launches a separate LAMMPS run. Example ~~~~~~~ .. code-block:: none PC_LAMMPS --script in.mysim \ --varloss myVar 0 100 \ --varloss myOtherVar 145 10 ``in.mysim`` is a regular LAMMPS input script that defines one or more *equal-style* variables. ``--varloss myVar 0 100`` tells Tadah! to include the variable *myVar* in the global loss with target 0 and weight 100. If no loss type is given, the default ``LOSS`` from the configuration file is used; you can override it for an individual variable by appending the loss name and its parameters. A long ``PC_LAMMPS`` line may be split with a trailing back-slash (``\``) for readability. Required options ^^^^^^^^^^^^^^^^ ``--script `` LAMMPS input file containing the variable definitions. ``--varloss [loss-type p₁ p₂ …]`` Evaluate loss for this variable and target and add it to the total loss with the given weight. Optional extra tokens pick a specific loss for that variable. Variable varloss must be defined in the LAMMPS script. The value is read from the LAMMPS at the end of the run. Additional options ^^^^^^^^^^^^^^^^^^ ``--invar `` Inject a variable (equivalent to LAMMPS’ ``-var``), allowing the same script to be reused for several structures or pressures. ``--outvar `` Record the variable in ``outvar.tadah`` without affecting the loss—useful for diagnostic plots. ``--failscore `` Override the global ``FAILSCORE``. If LAMMPS crashes or the loss exceeds this value, the optimiser receives *FAILSCORE* for that evaluation. .. _output_files: Output files ............ Unless you change it with ``OUTDIR`` [work in progress...] every artefact produced by the outer optimiser is written to **the directory in which you started ``tadah hpo``**. Five groups of files are created: Main log files ~~~~~~~~~~~~~~ Each file begins with a header line that starts with ``#`` and describes the columns. One line per iteration. First column is always step number. ===================== ========================================================== File Columns ===================== ========================================================== ``loss.tadah`` Wall-time, all individual loss terms followed by total loss. ``params.tadah`` The current hyperparameter vector in the order defined by your ``OPTIM`` lines. ``outvar.tadah`` Additional variables requested with ``--outvar`` plus all variables used in ``--varloss``. ===================== ========================================================== The rate of logging is controlled by ``OUTPUT N`` write a new line every *N* iterations (default: ``10``). Best-only snapshots ~~~~~~~~~~~~~~~~~~~ Whenever a new minimum of the global loss is found the corresponding rows are copied to companion files: * ``best_loss.tadah`` * ``best_params.tadah`` * ``best_outvar.tadah`` These contain **one single line**: the best result so far. Use them to monitor progress in real time without parsing the full history. ``BOUTPUT N`` write a new line every *N* iterations (default: ``1``). Potential archives ~~~~~~~~~~~~~~~~~~ The currest best potential is always saved to ``best_pot.tadah``. The trained potential itself can be archived automatically. ``DUMP `` Save **every** trial potential every *N* iterations to *DIR*. Files are named ``pot_.tadah``. The directory is created if needed. Set *N* = 0 to disable (default). ``BDUMP `` Same as ``DUMP`` but only for improvements of the best potential. Useful when you care only about the Pareto front. Practical tips ~~~~~~~~~~~~~~ * Large optimisation runs can generate thousands of potentials. Keep an eye on disk usage when enabling ``DUMP``. * The log files are plain text and append-only. They can be tailed while the optimisation is running: .. code-block:: bash tail -f loss.tadah * Each row in ``params.tadah`` matches the order of the ``OPTIM`` directives exactly. This makes it straightforward to resume a run or replay a specific parameter set with ``tadah train``. .. _performance_parallelism: Performance & parallelism ......................... Tadah!MLIP comes in two flavours, and the parallel strategy you can exploit depends on which one you compiled. Desktop build (OpenMP) ~~~~~~~~~~~~~~~~~~~~~~ * **Inner loop** – regression and descriptor evaluation are OpenMP parallel. Set the number of threads in the usual way: .. code-block:: bash export OMP_NUM_THREADS= * **Outer loop** – some optimisers can evaluate several hyper-parameter sets concurrently. Control this with the ``THREADS`` key in the ``OPTIMIZER`` block (currently supported by Dlib MaxLIPO+TR). Rule of thumb:: THREADS × OMP_NUM_THREADS ≤ number of physical cores Exceeding this limit will not crash the run, but the OS will oversubscribe cores and overall performance will drop. * **LAMMPS runs** – every ``PC_LAMMPS`` script is executed with a *serial* LAMMPS binary; Tadah!MLIP will attempt to run multiple of them in parallel, each on a separate core (max OMP_NUM_THREADS). MPI build ~~~~~~~~~ The MPI variant parallelises the *inner* regression across **all** ranks (host–client pattern). It must be linked against the MPI version of LAMMPS. Each LAMMPS calculation is spawned independently from the main communicator: ``--ncpu `` on a ``PC_LAMMPS`` line requests that *m* ranks form a mini-MPI world and execute the script. Several ``PC_LAMMPS`` lines may run side-by-side, each with its own ``--ncpu`` value. The sum of all requested ranks must not exceed *available ranks – 1* (one rank is reserved for the host). .. note:: The MPI launcher is functional but still experimental; improved error handling and dynamic load balancing are in development. Example:: srun -n 64 tadah hpo … # 64 MPI ranks available … PC_LAMMPS --script in.elastic --ncpu 8 … # spawns mpirun -n 8 lammps … Practical advice ~~~~~~~~~~~~~~~~ * For inexpensive models you usually gain more by increasing ``THREADS`` than by adding OpenMP threads—context-switch overhead is lower. * For very large training sets the regression dominates; in that case set ``THREADS = 1`` and devote the cores to OpenMP (desktop build) or MPI. * Measure, do not guess: a few short test runs sweeping ``OMP_NUM_THREADS`` and ``THREADS`` over {1, 2, 4, …} will quickly reveal the sweet spot on your machine.