Labeling, testing, consistency

Labeling model versions

The MLRepo offers the possibility to label a certain model version. This gives the user the possibility to mark certain models, e.g. labeling the model that goes into production or labeling the model which has the best error measure as a candidate for a future release. Labels are not only just nice to identify certain models more easily than remembering the version number, they are also supported by other methods form pailab: As we will see in this tutorial, consistency checks are applied to all labeled method, regression tests may be defined using labeled models or plotting methods will include the labeled models and explicitly highlight them.

Setting a label is quite simple using the pailab.MLRepo.set_label() method

        from pailab import LAST_VERSION
        ml_repo.set_label('prod', 'DecisionTreeRegressor/model',
                          model_version=LAST_VERSION, message='we found our first production model')

Note that we have used the LAST_VERSION keyword. Instead of specifying the exact version string, nearly all methods who need a version do also accept the LAST_VERSION and FIRST_VERSION keywords.

The set_label method creates just an object of pailab.repo_objects.Label and stores it in the repository. Therefore listing all labels in the repo can be performed by using get_names again:


We can see what model and model version a label refers to by just getting the label object and checking the name and version attributes:

>>label = ml_repo.get('prod')

Automated testing

There is a lot of debate whether unit testing or regression testing would make sense for ML. However, everyone should decide on his own for his project if it would make sense for his problems or not and pailab supports automated testing for those who want to apply it.

A test basically consists of two parts:

  • A definition of the principal test containing the type of test and a definition for what data and models the tests are created,
  • the tests itself which are also jobs executed by the MLRepo’s internal JobRunner.


As a user, you normally just define the test using the respective and you do not instantiate an object from the class on your own.

Regression tests

We define a set of regression tests using Here, pailab’s RegressionTest compares specified error measures of a model with error measures of a reference model (typically the one in production, maybe labeled ‘prod’ ;-) )

        reg_test =
            reference='prod', models=None, data=None, labels=None,
            measures=[MeasureConfiguration.MAX],  tol=1000) = 'reg_test'
        ml_repo.add(reg_test, message='regression test definition')

We may run the test by calling pailab.ml_repo.MLRepo.run_test()

        tests = ml_repo.run_tests()

where tests is a list of tuples, each containing the name of the test as well as the respective version:

[('DecisionTreeRegressor/tests/reg_test/test_data', '5b71ad5a-516f-11e9-bf7c-fc084a6691eb'),
('DecisionTreeRegressor/tests/reg_test/training_data', '5b8b46ca-516f-11e9-990d-fc084a6691eb')]

The attribute result of the test object contains the result of the test (if it was successful or not):

>>test = ml_repo.get('DecisionTreeRegressor/tests/reg_test/test_data')

Consistency checks

Pailab’s -submodule provides functionality to check for consistency and quality issues as well as for outstanding tasks (such as rerunning a training after the training set has been changed).

Model consistency

There are different checks to test model consistency such as if the tests of a model are up to date and succeeded or if the latest model is trained on the latest training data. All model tests are performed for labeled models and the latest model only.

The following checks are performed: - Is the latest model calibrated on the latest parameters and training data - Are all labeled models (including latest model) evaluated on the latest available training and test data - Are all measures of all labeled models computed on the latest data - Have all tests been run on the labeled models

        import as checker
        inconsistencies =

The variable inconsistencies contains a list of all inconsistencies found. In our case the list is currently empty since there are no inconsistencies:


Now we change a model parameter but do not start a new training

        param = ml_repo.get('DecisionTreeRegressor/model_param')
        param.sklearn_params['max_depth'] = 2
        version = ml_repo.add(param)

We run the consistency check again:

[{'DecisionTreeRegressor/model:last': {'latest model version not on latest inputs':
{'DecisionTreeRegressor/model_param': {'modifier version': 'cdc3fed4-5192-11e9-a7fd-fc084a6691eb',
        'latest version': 'cfe1b9fa-5192-11e9-b360-fc084a6691eb'}}}}]

Now we get a list containing one dictionary that contains the model inconsistencies. In our case, the dictionary shows one inconsistency: There are model inputs the latest calibrated model of 'DecisionTreeRegressor/model' has not yet been calibrated on. It also shows us that the model parameter 'DecisionTreeRegressor/model_param' is the input that is newer then the one used in the latest version.

We can fix this issue by running a new training:


Rerun training fixes the training but leads to new problems. Now after having retrained, the evaluation of the new model on the data sets as well as the computation of the defined error measures are now missing:

[{'DecisionTreeRegressor/model:last': {
    'evaluations missing': {
        'training_data': '429f88ba-524d-11e9-98f6-fc084a6691eb',
        'test_data': '429f88ba-524d-11e9-98f6-fc084a6691eb'},
    'measures not calculated':

Now we may fix these issues by calling first pailab.repo.MLRepo.run_evaluation() and then pailab.repo.MLRepo.run_measures() or we can simply call run_evaluation only, setting the parameter run_descendants to True. By doing so, the MLRepo resolves all steps of the build pipeline following the model evaluation

>>print( []

Training and test data consistency

pailab does also perform checks w.r.t. the training and test data. Here, one check is if test and training data overlap. To illustrate this, we add a second test data set to the repo which overlaps with the training data. Note that we first run the evaluation on the new data set so that we do not see again the errors that evaluation or error measures are missing for this data

        test_data_2 = DataSet('raw_data/boston_housing', 0, 50,
                              repo_info={RepoInfoKey.NAME: 'test_data_2',
                                         RepoInfoKey.CATEGORY: MLObjectType.TEST_DATA}

Now, performing the check shows a lot of inconsistencies a check:

[{'test_data_2': {'training and test data overlap': {'test_data_2': 'ecdc36ee-5465-11e9-92e2-fc084a6691eb', 'training_data': 'e6ac4eba-5465-11e9-b956-fc084a6691eb'}}}]


We may also check the overall test status. Here we have to call

        'Test for model DecisionTreeRegressor/model,
        version 323c05e8-5483-11e9-88ea-fc084a6691eb on latest data test_data missing.',
        'Test for model DecisionTreeRegressor/model, version 323c05e8-5483-11e9-88ea-fc084a6691eb on latest data test_data_2 missing.',
        'Test for model DecisionTreeRegressor/model, version 323c05e8-5483-11e9-88ea-fc084a6691eb on latest data training_data missing.'
        'Test for model DecisionTreeRegressor/model, version 301bcc1e-5483-11e9-82a2-fc084a6691eb on latest data test_data_2 missing.'

We see that the latest model as well as the ‘prod’ labeled model gives us some errors since the regression tests have not ben run on all data sets. We can fix these messages by simply calling run_tests: