Repo initialization, training, evaluation

Creating a new repository

We first create a new repository for our task. The repository is the central key around all functionality is built. Similar to a repository used for source control in classical software development it contains all data and algorithms needed for the machine learning task. The repository needs storages for

  • numerical objects such as arrays and matrices representing data, e.g. the input data or data from the valuation of the models
  • small objects (of part of objects after cutting out the numerical objects), e.g. training parameter, model parameter.

To keep things simple, we may simply use the default constructor of the MLRepo creating in memory storages.

        ml_repo = MLRepo(user='test_user')

Note that the memory interfaces used in this tutorial are useful for testing or playing around but may not be your choice for real life applications (except that you are willing to start your work again after your computer has been rebooted :-) ). In this case, you could use a simple storage using json format to store small data in files while using a storage saving the numpy data in hdf5 files. In this case you have to specify this in a respective configuration dictionary.

        config = {'user': 'test_user',
                  'workspace': repo_path,
                  'repo_store':
                  {
                      'type': 'disk_handler',
                      'config': {
                          'folder': repo_path,
                          'file_format': 'pickle'
                      }
                  },
                  'numpy_store':
                  {
                      'type': 'hdf_handler',
                      'config': {
                          'folder': repo_path,
                          'version_files': True
                      }
                  },
                  'job_runner':
                  {
                      'type': 'simple',
                      'config': {
                          'throw_job_error': True
                      }
                  }
                  }
        ml_repo = MLRepo(user='test_user', config=config)

In addition to the storages the repository needs a reference to a JobRunner which the platform can use to execute different jobs needed during your ML development process. As long as we do not specify another JobRunnner, the MLRepo uses the most simple pailab.job_runner.job_runner.SimpleJobRunner as default, that executes everything sequential in the same thread the repository runs in. There are two possibilities to set the JobRunner. You may use the configuration settings as shown above. In this case, the pailab.job_runner.job_runner_factory.JobRunnerFactory`is used to create the respective `JobRunner within MLRepo’s constructor. Another possibility you may use (e.g. if you implemented your on JobRunner and you do not want to integrate it into the factory), you may simply instantiate the respective JobRunner and set it into the MLRepo’s job_runner attribute

        job_runner = SimpleJobRunner(None)
        job_runner.set_repo(ml_repo)
        ml_repo._job_runner = job_runner

Note

The ‘’MLRepo’’ uses the pailab.job_runner.job_runner.SimpleJobRunner as default, you do only have to set a JobRunner as shown above if you want to use a different one.

Adding training and test data

The data in the repository is handled by two different data objects:

  • pailab.ml_repo.repo_objects.RawData is the object containing real data.
  • pailab.ml_repo.repo.DataSet is the object containing the logical data, i.e. a reference to a RawData object together with a specification, which data from the RawData will be used. Here, one can specify a fixed version of the underlying RawData object (then changes to the RawData will not affect the derived DataSet) or a fixed or floating subset of the RawData by defining start and end index cutting the derived data just out of the original data.

Normally, for training and testing we will use DataSet. So, we first have to add the data in form of a RawData object and then define the respective DataSets based on this RawData.

Adding RawData

We first read the data from a csv file using pandas.

            import pandas as pd
            data = pd.read_csv('./examples/boston_housing/housing.csv')

Now data holds a pandas dataframe and we have to extract the respective x and y values as numpy matrices to use them to create the RawData object.

        input_variables = ['RM', 'LSTAT', 'PTRATIO']
        target_variables = ['MEDV']
        x = data.loc[:, input_variables].values
        y = data.loc[:, target_variables].values

Using the numpy objects we can now create the RawData object and add it to the repo.

        from pailab import RawData, RepoInfoKey

        raw_data = RawData(x, input_variables, y, target_variables, repo_info={
                           RepoInfoKey.NAME: 'raw_data/boston_housing'})
        ml_repo.add(raw_data)

Adding DataSet

Now, base on the RawData, we can add the training and test data sets.

        # create DataSet objects for training and test data
        training_data = DataSet('raw_data/boston_housing', 0, 300,
                                repo_info={RepoInfoKey.NAME: 'training_data', RepoInfoKey.CATEGORY: MLObjectType.TRAINING_DATA})
        test_data = DataSet('raw_data/boston_housing', 301, None,
                            repo_info={RepoInfoKey.NAME: 'test_data',  RepoInfoKey.CATEGORY: MLObjectType.TEST_DATA})
        # add the objects to the repository
        version_list = ml_repo.add(
            [training_data, test_data], message='add training and test data')

Note

We have to define the type of object via setting a value from :py:class:pailab.repo.MLObjectType for the RepoInfoKey.CATEGORY key object. The category is used by the MLRepo to support certain automatizations and checks.

The version_list variable is a dictionary that maps the object names of the added objects to their version.

Adding a model

The next step to do machine learning would be to define a model which will be used in the repository. A model consists of the following pieces

  • a function for evaluating the model
  • a function for model training
  • a model parameter object holding the model parameters
  • a training parameter object defining training parameters

We would like to use the DecisionTreeRegressor from sklearn in our example below. In this case, we do not have to define the pieces defined above, since pailab provides a simple interface to sklearn defined in the module pailab.externals.sklearn_interface. This interface provides a method pailab.externals.sklearn_interface.add_model() to add an arbitrary sklearn model as a model which can be handled by the repository. The method pailab.externals.sklearn_interface.add_model() creates internally a pailab.repo_objects.Model object defining the objects listed above and adds it to the repository. We refer to Add a model for details on setting up the model object and to :ref:`integrating_model`for details how to integrate your own algorithm or other external ml platforms.

        import pailab.externals.sklearn_interface as sklearn_interface
        from sklearn.tree import DecisionTreeRegressor
        sklearn_interface.add_model(
            ml_repo, DecisionTreeRegressor(), model_param={'max_depth': 5})

Train the model

Now, model training is very simple, since you have defined training and testing data as well as methods to value and fit your model and the model parameter. So, you can just call pailab.ml_repo.repo.MLRepo.run_training() on the repository, and the training is performed automatically. The training job is executed via the JobRunner you specified setting up the repository. All method of the repository involving jobs return the job id when adding the job to the JobRunner so that you can control the status of the task and see if it successfully finished.

        job_id = ml_repo.run_training()

The variable job_id contains a tuple with the id the job is stored in the repo and the respective version:

>> print(job_id)
('DecisionTreeRegressor/jobs/training', '69c7ce4a-512a-11e9-ab9f-fc084a6691eb')

This information can be used to retrieve the underlying job object. The job object contains certain useful information such as the status of the job, i.e. if it is waiting, running or if it has been finished, the time the job has been started or messages of errors that occurred during execution:

>>job = ml_repo.get(job_id[0], job_id[1])
>>print(job.state)
finished
>>print(job.started)
2100-03-28 08:23:41.668922

Note

The jobs are only executed if they have not yet been run on the input. So that if we call run_training again, we get a message that the job has already been run:

>> job_id = ml_repo.run_training()
>> print(job_id)
No new training started: A model has already been trained on the latest data.

We can check that the training was successful by checking whether a calibrated object for the specified model has been created. For this, we simply list all object names of objects from the category MLObjectType.CALIBRATED_MODEL stored within the repo using the :py:meth:pailab.repo.MLRepo.get_names method:

>>print(ml_repo.get_names(MLObjectType.CALIBRATED_MODEL))
['DecisionTreeRegressor/model']

As we see, an object with name 'DecisionTreeRegressor/model' has been created and stored in the repo.

Model evaluation and error measurement

Evaluate a model

To measure errors and to provide plots the model must be evaluated on all test and training datasets. This can simply be accomplished by calling pailab.ml_repo.repo.MLRepo.run_evaluation().

        job_id = ml_repo.run_evaluation()

This method has now applied the model’s evaluation method to all test and training data stored in the repository and also stored the results. Similar to the model training we may list all results using the get_names method:

>>print(ml_repo.get_names(MLObjectType.EVAL_DATA))
['DecisionTreeRegressor/eval/sample2', 'DecisionTreeRegressor/eval/sample1']

As we see, we have two different objects containing the evaluation of the model, one for each dataset stored. Note that we can check what model and data has been used to create these evaluations. We just have to look at the modification_info attribute of the repo_info data attached to each object stored in the MLRepo:

>>eval_data = ml_repo.get('DecisionTreeRegressor/eval/sample2')
>>print(eval_data.repo_info.modification_info)
{'DecisionTreeRegressor/model': '69c86a46-512a-11e9-b7bd-fc084a6691eb',
'DecisionTreeRegressor': '687c5da8-512a-11e9-b0b4-fc084a6691eb',
'sample2': '6554763b-512a-11e9-938e-fc084a6691eb',
'eval_sklearn': '687bc058-512a-11e9-8b3e-fc084a6691eb',
'DecisionTreeRegressor/model_param': '687c5da7-512a-11e9-99c4-fc084a6691eb'}

The modification_info attribute is a dictionary that maps all objects involved in the creation of the respective object to their version that has been used to derive the object. We can directly see the versions of the calibrated model 'DecisionTreeRegressor/model' as well as the version of the underlying data set 'sample2'.

Define error measures

Now we may add certain error measures to the repository

        ml_repo.add_measure(MeasureConfiguration.MAX)
        ml_repo.add_measure(MeasureConfiguration.R2)

which can be evaluated by

        job_ids = ml_repo.run_measures()

As before, we get an overview of all measures computed and stored in the repository (as a repo object, see pailab.ml_repo.repo_objects.measure) using the get_names method:

>>print(ml_repo.get_names(MLObjectType.MEASURE))
['DecisionTreeRegressor/measure/test_data/max',
'DecisionTreeRegressor/measure/test_data/r2',
'DecisionTreeRegressor/measure/training_data/max',
'DecisionTreeRegressor/measure/training_data/r2']

Retrieving measures

The computed value is stored in the measurement object in the attribute value:

.. literalinclude:: ../../tests/tutorial_test.py
language:python
start-after:get measures
end-before:end getting measures

prints the value of the measurement.

Creating a list of all objects

One can simply get an overview over all objects stored in the repository by calling pailab.ml_repo.repo.MLRepo.get_names() to retrieve a list of names of all objects of a specific category (see pailab.ml_repo.repo.MLObjectType). The following line will loop over all categories and print the names of all objects within this category contained in the repository.