Basics

In this section we explain pailab’s basic building blocks. For a quick introduction how to work with pailab and to get a first impression of the functionality we refer to work to the Repo initialization, training, evaluation.

Overview

pailab’s core is the pailab.ml_repo.repo.MLRepo class which is what we call the machine learning repository. The repository stores and versions all objects needed in the machine learning development cycle. There are three fundamental differences to version control systems such as git or svn for classical software development:

  • Instead of source code, objects are checked into the repository. Here, each object must inherit or at least implement the respective methods from pailab.ml_repo.repo_objects.RepoObject so that it can be handled by the repository. Furthermore, each such object belongs to a certain category (pailab.ml_repo.repo.MLObjectType), so that the repository may perform certain checks and allow to automatize the ml build pipeline.
  • Each object is split into a part with standard data and a part with (large) numerical data and both parts are stored separately in different storages. Here, the normal data is stored in a storage derived from pailab.ml_repo.repo_store.RepoStore whereas the numerical data is stored via a pailab.ml_repo.repo_store.NumpyStore.
  • The execution of different jobs such as model training and evaluation or error computation is triggered via the MLRepo. Here, the MLRepo simply uses a JobRunner to execute the jobs.

As we see, we need three ingredients to initialize an MLRepo:

  • RepoStore (handles the object data)
  • NumpyStore (handles numpy part of the object data)
  • JobHandler (runs the jobs such as training or evaluation)

Basic functionality

The MLRepo offers four main functionalities

All other methods are just using these methods and provide a little bit more convenience in daily work.

Add objects

When adding an object to the repository the MLRepo automatically adds additional information to the object, stored in the attribute repo_info. The repo_info attribute is an instance of the pailab.ml_repo.repo_objects.RepoInfo() and contains information such as the version number (this number is autogenerated from MLRepo), the objects name, a description of the object, a commit message, the author i.e. the user who added the object. So to add an object to an instance ml_repo of the MLRepo class you just call:

ml_repo.add(obj, message = "just an example")

which adds the object where message is attached to the object in the repo_info attribute as well. Note that the method returns the version which was attached to the object so that:

version = ml_repo.add(obj, message = "just an example")

version contains the version number after the execution.

We can also add multiple objects at the same time storing them in a list, that is:

versions = ml_repo.add([obj1, obj2], message = "just an example")

where version is now a list of versions for the different objects added.

Get objects

To retrieve objects from the storage one can use the pailab.ml_repo.repo.MLRepo.get() method. Here, different possibilities to specify what object to get exist. If one wants to retrieve a special version of an object you can call get with the object’s name and the specified version:

obj = ml_repo.get('obj_name', version = 'aasfdg-111-ezrhf')

If one wants to retrieve the first or last version, instead of typing the specific cryptic version string, we may use the keywords repo_store.repo.RepoStore.LAST_VERSION and repo_store.repo.RepoStore.FIRST_VERSION:

obj = ml_repo.get('obj_name', version = RepoStore.FIRST_VERSION)

returns the first version of the object with name obj_name. Instead of specifying a single version, we can also retrieve a bunch of different versions just using a list of versions:

objs = ml_repo.get('obj_name', version = ['aasfdg-111-ezrhf', RepoStore.FIRST_VERSION])

returns a list of the different versions of the object with name obj_name.

Another frequent use case is to retrieve an object that has been created using a different object with a specific version. For example, you may be interested to get mean squared error (mse) for a specific model. Then, you can just use get to retrieve the mse object containing the mse for the specified model by calling get specifying the modifier_versions:

obj = ml_repo.get('obj_name', version=None,  modifier_versions={'obj2_name': 'aasfdg-111-ezrhf'})

which returns the object with name obj_name and the version which has been constructed using the object ‘obj2_name’ with the specified version. Analogously to above one can also ask for all objects which haven been constructed using an object with object’s version in a list of specified versions, i.e.:

objs = ml_repo.get('obj_name', version=None,  modifier_versions={'obj2_name': ['aasfdg-111-ezrhf', 'bbhuuu-123-ooo'})

Important

If an object does not exist, the method may either throw an exception or return an empty list, depending on the argument throw_error_not_exist, i.e.:

obj = ml_repo.get('obj_name', version = 'aasfdg-111-ezrhf', throw_error_not_exist = True)

throws an exception if an object with this name and version does not exist.

If we just want to check if exactly one object with a specified version or modification information exists, we may call the method setting the argument throw_error_not_unique to True, which means that an exception is thrown if there are more then one object satisfying the condition.

As we have discussed, an object is split into two parts that are stored separately: The ‘small’ data and the ‘big’ data of the object. By default, the get method returns the object leaving out the big data part. If one wants to get the complete object, one must set the argument full_object to True:

ml_repo.get('obj_name', version = 'aasfdg-111-ezrhf', full_object = True)

Getting names

To list al objects of a certain category stored in the repo, we can use pailab.ml_repo.repo.MLRepo.get_names()::, e.g. to get the ids of all test data objects in the repo

names = ml_repo.get_names(MLObjectType.TEST_DATA)

Running Jobs

All functions which may be called from the MLRepo, e.g. the function to train a certain model, are also objects which are stored within the repo. Typically, these objects inherit from the pailab.ml_repo.repo.MLRepo.Job. To run such a job using the repo, you can use pailab.repo.MLRepo.run() method:

job_info = ml_repo.run(job)

This line of code will add the Job object job to the repo and then call the add method of the ml_repo’s internal JobRunner. The job_info variable will either be a tuple of the job’s name and the job’s version number in the repo or a string containing a message that no input data has been changed since last run of the job.

Note

Before the run method adds the job to the repo, it checks if the job has a method check_rerun and if so, calls it to decide if the job really has to be executed (if the method returns True). If the result of check_rerun is False, the method does not run the job.

Tip

Note that for many situations such as training or evaluating a model there are respective methods wrapping pailab.ml_repo.repo.MLRepo.run() which should be preferred. However, all these method will, after creation of the needed Job object, at the end call run.

Add a model

A model is defined by specifying

  • preprocessing methods (in a certain order, optional)
  • a function to evaluate the calibrated model on a given dataset
  • a function to train the model given data, training and model parameter
  • training parameter
  • model parameter (optional)

This specification is done by setting the identifies of the objects into an instance of pailab.repo_objects.Model. Let us assume you have added a preprocessor with name 'uniform_scaling', a function to evaluate the model named 'eval', a function to train the model 'train' and training parameter 'training_param', you can then add a new model to the repo by:

model = repo_objects.Model(preprocessors = 'uniform_scaling',
                        eval_function='eval', train_function='train',
                        training_param='training_param')
model.repo_info.name = 'name_of_model'
ml_repo.add(model, message='my first model')

As we see in this example, you do not have to specify a model parameter if your training function does not need one. Also, you do not have to specify the preprocessing, if you do no want to apply a preprocessing technique. Another possibility is to use the method add_model that may be more convenient:

ml_repo.add_model('name_of_model', model_eval = 'eval', model_training = 'train',
                training_param = 'training_param',
                preprocessors = 'uniform_scaling')

If there is only one evaluation function in the MLRepo and you want to use it, the method add_model will do the job for you, i.e. just do not specify the evaluation function:

ml_repo.add_model('name_of_model', model_training = 'train',
                training_param = 'training_param',
                preprocessors = 'uniform_scaling')

which now checks if there is only one eval function in the ml_repo and in this case uses this unique function. If there is more then one function or no function, an exception is thrown. This logic applies to all members except the preprocessing: If you do not define a preprocessing, no preprocessing will be applied. So, to specify a model under the assumption that the respective components are in the repository using the preprocessing from above, we may call:

ml_repo.add_model('name_of_model', preprocessors = 'uniform_scaling')

Setting up an MLRepo

In-memory

The easiest way to start using pailab is instantiate MLRepo using all defaults, except the user which must be specified, otherwise an exception is thrown.

        ml_repo = MLRepo(user='test_user')

This results in an MLRepo that handles everything in memory only, using pailab.ml_repo.memory_handler.RepoObjectMemoryStorage and pailab.ml_repo.memory_handler.NumpyMemoryStorage so that after closing the MLRepo, all data will be lost. Therefore this should be only considered for testing or rapid and dirty prototyping. Note that in this case, the JobRunner used is the pailab.job_runner.job_runner.SimpleJobRunner which simply runs all jobs sequential on the local machine in the same python thread the MLRepo has been constructed (synchronously).

Disk

To initialize an MLRepo so that the objects are stored on disk, we need to setup the respective storages within the MLRepo. One way to achieve this is to define the respective configurations in a dictionary and initialize the MLRepo with this dictionary. An example for such a configuration is given by

        config = {
            'user': 'test_user',
            'workspace': 'tmp',
            'repo_store':
            {
                'type': 'disk_handler',
                'config': {
                    'folder': 'tmp/objects',
                    'file_format': 'json'
                }
            },
            'numpy_store':
            {
                'type': 'hdf_handler',
                'config': {
                    'folder': 'tmp/repo_data',
                    'version_files': True
                }
            },
            'job_runner':
            {
                'type': 'simple',
                'config': {}
            }
        }

First we see that there is a user and also a workspace defined in the dictionary. The workspace is a directory where the configuration and settings are stored so that when you instantiate the MLRepo again, you just need to specify the workspace and not the whole settings again. The RepoStore used within the MLRepo is defined via the dictionary belonging to the repo_store key. Here we see that the configuration consists of describing the type of store (here we use the disk_handler which simply stores the objects on disk) and the settings for this storage. In our example the objects are stored in json format in the folder example_1/objects. The NumpyStore internally used is selected so that the big data will be stored in hdf5 files.

Now we simply instantiate the MLRepo using this configuration.

        ml_repo = MLRepo(config=config)

To instantiate the MLRepo and directly save the respective config you have to set the parameter save_config

        ml_repo = MLRepo(config=config, save_config=True)

Saving the config you may instantiate the MLRepo another time simply by

        ml_repo = MLRepo(workspace='tmp')

git

The previous example stored the objects simply as json files on disk. There is the possibility to use git to manage the files. Here, you just have to replace the type by ‘git_handler’, i.e. just change in the configuration dictionary above the type from disk_handler to git_handler. The MLRepo will then use the pailab.ml_repo.git_handler.RepoObjectGitStorage as repo.

If you have a remote git repository which you want to use, you have to clone the repository first and then specify the directory of the cloned repo as directory of the git_handler.

Integrate a new model

As we have seen in Add a model, a model needs

  • preprocessing methods
  • a function to evaluate the calibrated model on a given dataset
  • a function to train the model given data, training and model parameter
  • training parameter
  • model parameter (optional)

When a model is fully specified, you can call pailab.repo.MLRepo.run_training() to train the model. The calibrated model as a result of the training function is then stored within the repo in the category CALIBRATED_MODEL. Therefore, to integrate a new model, you have to specify the two methods to train and evaluate the model, the training parameter class (if needed also the model parameter) and finally the object containing the calibrated model.

Example

In this section we show one way to get your custom model into pailab using a very simple and , we discuss a very simple, illustrative and Let us assume that you have implemented a new ml algorithm encapsulated in a class SuperML



class SuperML:
    @repo_object_init()
    def __init__(self):
        self._value = None

    def train(self, data_x, data_y, median=True):
        if median:
            self._value = np.median(data_y)
        else:
            self._value = np.mean(data_y)

    def eval(self, data):
        return self._value

This algorithm either uses the mean computed on a given dataset or the median depending on the given Boolean argument median as a forecast (a very simple algorithm :-) ). Note that we have already used the @repo_object_init() decorator from pailab.repo_objects to add the methods and attributes needed to use this class within the repo. Another possibility would have been to directly implement the respective methods and attributes within the class on ourselves. See pailab.repo_objects.RepoObject which defines the respective interface. A third alternative would have been to implement a wrapper class which simply contains our SuperML class and implements all RepoObject functionality (an example for this can be found in the pailab.externals.tensorflow_keras_interface module). The way you choose depends on your flavor as well as on the question if the ML algorithm is your own development or if you may be wrapping just another ML module.

So, what is missing to put the new model into the MLRepo are the eval and train functions as well as the class storing the training parameter (which is simply one Boolean).



class SuperMLTrainingParam:
    @repo_object_init()
    def __init__(self):
        self.median = True


def train(training_param, data_x, data_y):
    result = SuperML()
    result.train(data_x, data_y, training_param.median)
    return result


def eval(model, data):
    return model.eval(data)

After we defined the respective functions, we have to expose their definitions to the MLRepo. Here, we could either construct the respective objects pailab.repo_objects.Function using their special categories MODEL_EVAL_FUNCTION or TRAINING_FUNCTION or just use pailab.repo.MLRepo.add_eval_function() and pailab.repo.MLRepo.add_training_function()

            ml_repo.add_eval_function(train,
                                      repo_name='my_eval_func')
            ml_repo.add_training_function(eval, repo_name='my_eval_func')

In addition, we add a first set of training parameter

            training_param = SuperMLTrainingParam()
            training_param.median = True
            ml_repo.add(
                training_param, message='my first training parameter for my own super ml algorithm')

Finally, we call add_model to define the overall model. Since our repo contains only one eval and train function as well as one unique training parameter we may call add_model without specifying them.

            ml_repo.add_model('my_model')

Important

The MLRepo’s underlying RepoStore needs to store the objects of the model and training parameter classes. For this, the RepoStore calls the method to_dict to obtain a dictionary of the objects attributes which is then serialized. Here, one has to take care that all objects in this dictionary are serializable w.r.t. RepoStore’s format. If your classes use simple standard python types, you may not need to adjust anything.