ml_repo

class MLRepo(workspace=None, user=None, config=None, save_config=False)

Repository for doing machine learning

The repository and his extensions provide a solid fundament to do machine learning science supporting features such as:
  • auditing/versioning of the data, models and tests
  • best practice standardized plotting for investigating model performance and model behaviour
  • automated quality checks
Parameters:
  • workspace ([type]) – [description]. Defaults to None.
  • user (str) – the user. Defaults to None.
  • config (dict) – the configuration to use. Defaults to None.
  • save_config (bool) – determines whether to save the configuration or not. Defaults to False.
add(repo_object, message='', category=None)

Add a repo_object or list of repo objects to the repository.

Raises an exception if the category of the object is not defined in the object and if it is not defined with the category argument. It raises an exception if an object with this id does already exist.

Parameters:
  • repo_object (RepoObject) – repo_object or list of repo_objects to be added, will be modified so that it contains the version number
  • message (str) – commit message. Defaults to ‘’.
  • category (MLObjectType) – Category of repo_object which overwrites the objects category.. Defaults to None.
Returns:

str or dictionary – version number of object added or dictionary of names and versions of objects added

add_eval_function(f, repo_name=None)

Add the function to evaluate the model

Parameters:
  • module_name (str) – module where function is located
  • function_name (str) – function name
  • repo_name (str) – identifier of the repo object used to store the information, if None, the name is set to module_name.function_name. Defaults to None.
add_measure(measure, coordinates=None)

Add a measure to the repository

If the measure already exists, it returns the message
Parameters:
  • measure (str) – string defining the measure, i.e MAX,…
  • coordinates (list of str) – list of strings defining the coordinates (by name) used for the measure, if None, all coordinates will be used. Defaults to None.
add_model(model_name, model_eval=None, model_training=None, model_param=None, training_param=None, preprocessors=None)

Add a new model to the repo

Parameters:
  • model_name (str) – identifier of the model
  • model_eval (str) – identifier of the evaluation function in the repo to evaluate the model, if None and there is only one evaluation function in the repo, this function will be used
  • model_training (str) – identifier of the training function in the repo to train the model, if None and there is only one evaluation function in the repo, this function will be used
  • model_param (str) – identifier of the model parameter in the repo, if None and there is exactly one ModelParameter in teh repo, this will be used,. Defaults to None. otherwise it is assumed that no model_params are needed
  • training_param (str) – identifier of the training parameter, if None and there is only one training_parameter object in the repo, . Defaults to None. this will be used. If an empty string is given as training parameter, we assume that the algorithm does not need a training pram.
  • preprocessors (list) – list of preprocessors to be execute. Defaults to None. this is a list of strings
add_preprocessing_fitting_function(f, repo_name=None)

Add function to fit a preprocessor

Parameters:
  • module_name (str) – module where function is located
  • function_name (str) – function name
  • repo_name (tring) – identifier of the repo object used to store the information, if None, the name is set to module_name.function_name. Defaults to None.
add_preprocessing_transforming_function(f, repo_name=None)

Add function to transform the data by a preprocessor

Parameters:
  • module_name (str) – module where function is located
  • function_name (str) – function name
  • repo_name (str) – identifier of the repo object used to store the information, if None, the name is set to module_name.function_name. Defaults to None.
add_preprocessor(preprocessor_name, transforming_function=None, fitting_function=None, preprocessor_param=None)

Add a new preprocessor to the repo

Parameters:
  • preprocessor_name (str) – identifier of the preprocessor
  • transforming_function (str) – identifier of the transforming function in the repo, if None and there is only one transforming function in the repo, this function will be used
  • fitting_function (str) – identifier of the fitting function in the repo to fit the preprocessor, if None the preprocessor does not need to be fitted
  • preprocessor_param (str) – identifier of the preprocessor parameter. Defaults to None.
Raises:

Exception – Raises an error if the preprocessing transforming function is not in repo

add_raw_data(name, data, input_names=None, data_y=None, target_names=None, file_format=None, axis=1)

Adds a RawData object to the repository.

This methods creates/reads from the given data/file a RawData object and adds it to the repository.

Examples

Read data from csv file ‘test_data.csv’ and use columns with headers ‘x0’, ‘x1’ as input data and column with label ‘x2’ as target, store the results under name ‘my_data’:

>>ml_repo.add_raw_data('my_data', 'test_data.csv', ['x0', 'x1], file_format = 'csv')

Create data from a DataFrame test where the columns ‘x0’, ‘x1’ are used as input and no target is specified:

>>ml_repo.add_raw_data('my_data', test, ['x0', 'x1])
Parameters:
  • name (str) – Name of RawData in repository (if name does not start with ‘raw_data/’ this is added.
  • data (str, numpy ndarray or pandas DataFrame) – Eithr a pandas DataFarme, a numpy ndarray or a string that is interpreted as filename of the underling data.
  • input_names (iterable of str, optional) – List of the input variables names. Defaults to None.
  • data_y (str or numpy ndarray, optional) – Either a numpy ndarray or a string defining the filename of th y-data (not valid if file_format==’csv’). Defaults to None.
  • target_names (iterable of str, optional) – List of the target variables names. Defaults to None.
  • file_format ('csv' or 'numpy', optional) – File type which can be either csv or numpy (numpy means an ndarray stored with numpy.save). Defaults to None.
  • axis (int, optional) – If only an ndarray is given but target variables are defined, this array will b split into weo arrays (one for input, one for target) along this axis. Defaults to 1.
Returns:

version number of RawData object added

Return type:

str

add_test_data(name, raw_data_name, start_index=0, end_index=None, raw_data_version='last')

Add test data as a DataSet to the repository.

This method defines a DataSet and adds it to the repository. A DataSet is a logical unit based on a RawData object and defines the range of data that is taken from the respective RawData data.
Parameters:
  • name (str) – Name of respective object in repository.
  • raw_data_name (str) – Name of the underlying RawData object that is used as basis.
  • start_index (int, optional) – Start index where test data starts from underlying RawData. Defaults to 0.
  • end_index (int, optional) – End index where test data end. Defaults to None.
  • raw_data_version (str) – Version of underlying RawData (if ‘last’, always the latest RawData will be used to derive the respective DataSet). Defaults to ‘last’
add_training_data(name, raw_data_name, start_index=0, end_index=None, raw_data_version='last')

Add training data as a DataSet to the repository.

This method defines a DataSet and adds it to the repository. A DataSet is a logical unit based on a RawData object and defines the range of data that is taken from the respective RawData data.
Parameters:
  • name (str) – Name of respective object in repository.
  • raw_data_name (str) – Name of the underlying RawData object that is used as basis.
  • start_index (int, optional) – Start index where training data starts from underlying RawData. Defaults to 0.
  • end_index (int, optional) – End index where training data end. Defaults to None.
  • raw_data_version (str) – Version of underlying RawData (if ‘last’, always the latest RawData will be used to derive the respective DataSet). Defaults to ‘last’
add_training_function(f, repo_name=None)

Add function to train a model

Parameters:
  • module_name (str) – module where function is located
  • function_name (str) – function name
  • repo_name (tring) – identifier of the repo object used to store the information, if None, the name is set to module_name.function_name. Defaults to None.
delete(name, version)

Delete a specific object.

It deletes the object. If other objects were modified by this object, it throws an exception that first the modified objects must be deleted.

Parameters:
  • name (str) – name of the object
  • version (str) – version of the object
Raises:

Exception – If the object has depending objects, it can not be deleted and an error is thrown.

get(name, version='last', full_object=False, modifier_versions=None, obj_fields=None, repo_info_fields=None, throw_error_not_exist=True, throw_error_not_unique=True)

Get repo objects. It throws an exception, if an object with the name does not exist.

Parameters:
  • name (str) – the object name
  • version (str) – object version, default is latest (-1). If the fields are nested (an element of a dictionary which is an element of a dictionary, use path notation to the element, i.e. p/elem1/elem2 to get p[elem1][elem2]). Defaults to repo_store.RepoStore.LAST_VERSION.
  • full_object (bool) – flag to determine whether the numpy objects are loaded (True->load). Defaults to False.
  • modifier_versions ([type]) – [description]. Defaults to None.
  • obj_fields ([type]) – [description]. Defaults to None.
  • repo_info_fields ([type]) – [description]. Defaults to None.
  • throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
  • throw_error_not_unique (bool) – true - throw error if item is not unique, else return []. Defaults to True.
Raises:

Exception – raises an exception if no object with the specific name is found

Returns:

RepoObject or list thereof – The repo object

static get_calibrated_model_name(model_name)

For a model name the calibrated model name is returned

Parameters:model_name (str) – model name
Returns:string – the calibrated model name
get_commits(version_start='first', version_end='last')

gets the commits

Parameters:
  • version_start (str) – only display versions after version_start. Defaults to repo_store.RepoStore.FIRST_VERSION.
  • version_end (str) – only display versions up to version_end. Defaults to repo_store.RepoStore.LAST_VERSION.
Returns:

list of commit infos – returns a list of commit infots

static get_eval_name(model, data)

Return name of the object containing evaluation results

Parameters:
  • model (ModelDefinition object or str) –
  • {RawData or DataSet object or str} -- (data) –
Returns:

string – name of valuation results

get_history(name, repo_info_fields=None, obj_member_fields=None, version_start='first', version_end='last')

Return a list of histories of object member variables without bigobjects

Parameters:
  • name (str) – the object name
  • repo_info_fields (list of strings) – List of fields from repo_info which will be returned in the dictionary. If List contains flag ‘ALL’, all fields will be returned.. Defaults to None.
  • obj_member_fields (list of strings) – List of member atributes from repo_object which will be returned in the dictionary. If List contains flag ‘ALL’, all attributes will be returned.. Defaults to None.
  • version_start (str) – only display versions after version_start. Defaults to repo_store.RepoStore.FIRST_VERSION.
  • version_end (str) – only display versions up to version_end. Defaults to repo_store.RepoStore.LAST_VERSION.
Returns:

str or list of strings – returns a list of the objects

get_ml_repo_store()

Return the storage for the ml repo

Returns:RepoStore – the storage for the RepoObjects
get_names(ml_obj_type)

Get the list of names of all repo_objects from a given repo_object_type in the repository.

Parameters:ml_obj_type (MLObjectType) – MLObjectType specifying the types of objects names are returned for.
Returns:list of strings – list of object names for the given category.
get_numpy_data_store()

Return the numpy data store of the ml repo

Returns:numpy_handler – the numpy repo
get_training_data(version='last', full_object=True, model=None, model_version='last')

Returns training data for a model.

It returns the training data in the repo for a specified model. If there is only one set of training data in the repo, this set will be returned. Otherwise, the model is loaded and the training data is used as defined in the model. If in this case a model is not specified the method throws an exception.

Parameters:
  • version (str) – version of data object. Defaults to repo_store.RepoStore.LAST_VERSION.
  • full_object (bool) – if True, the complete data is returned including numpy data. Defaults to True.
  • model (str) – Name of model definition for which the training data will be returned.
  • model_version (str) – Version of model definition for which teh trainin data will be returned.
pull()

Pull changes from an external repo

push()

Push changes to an external repo.

run(job)

Executes a job

Parameters:job (Job) – The job object to be executed
Returns:[type] – Return the name and version of the job or a message that the job does not need to be rerun
run_evaluation(model=None, message=None, model_version='last', datasets={}, predecessors=[], run_descendants=False, labels=None)

Evaluate the model on all datasets.

Parameters:
  • model (str) – name of model to evaluate, if None and only one model exists. Defaults to None.
  • message (str) – message inserted into commit, if None: an automated message is created. Defaults to None.
  • model_version (str) – version of model to be evaluated.. Defaults to repo_store.RepoStore.LAST_VERSION.
  • datasets (dict) – dictionary of datasets (names and version numbers) on which the model is evaluated. . Defaults to {}.
  • predecessors (list) – list of jobs which shall have been completed successfull before the evaluation is started. Default is all datasets from testdata on latest version.. Defaults to [].
  • run_descendants (bool) – if True also run all decendant jobs. Defaults to False.
  • labels ([type]) – [description]. Defaults to None.
Returns:

list of strings – a list of the job ids

run_measures(model=None, message=None, model_version='last', datasets={}, measures={}, predecessors=[], labels=None)

Run the measures

Parameters:
  • model (str) – name of model to evaluate, if None and only one model exists. Defaults to None.
  • message (str) – message inserted into commit, if None: an automated message is created. Defaults to None.
  • model_version (str) – version of model to be evaluated.. Defaults to repo_store.RepoStore.LAST_VERSION.
  • datasets (dict) – dictionary of datasets (names and version numbers) on which the model is evaluated. . Defaults to {}.
  • predecessors (list) – list of jobs which shall have been completed successfull before the evaluation is started. Default is all datasets from testdata on latest version.. Defaults to [].
  • run_descendants (bool) – if True also run all decendant jobs. Defaults to False.
  • labels ([type]) – [description]. Defaults to None.
Returns:

list of strings – a list of the job ids

run_tests(test_definitions=None, predecessors=[])

Run tests for a specific model version.

Parameters:
  • test_definitions (list or set) – List or set of names of the test definitions which shall be executed. If None, all test definitions are executed.. Defaults to None.
  • predecessors (list) – list of jobs which shall have been completed successfull before the evaluation is started. Default is all datasets from testdata on latest version.. Defaults to [].
Returns:

str – ticket number of job

run_training(model=None, message=None, model_version='last', training_function_version='last', training_data_version='last', training_param_version='last', model_param_version='last', run_descendants=False)

Run the training algorithm.

Parameters:
  • model (str) – the identifyer of the model. Defaults to None.
  • message (str) – the commit message. Defaults to None.
  • model_version (str) – the version of the model. Defaults to repo_store.RepoStore.LAST_VERSION.
  • training_function_version (str) – the version of the training function. Defaults to repo_store.RepoStore.LAST_VERSION.
  • training_data_version (str) – the version of the training data. Defaults to repo_store.RepoStore.LAST_VERSION.
  • training_param_version (str) – the version of the training parameter. Defaults to repo_store.RepoStore.LAST_VERSION.
  • {str} --the version of the model parameter. Defaults to repo_store.RepoStore.LAST_VERSION. (model_param_version) –
  • run_descendants (bool) – if True also run all decendant jobs. Defaults to False.
Returns:

[type] – return name and version or message

set_label(label_name, model=None, model_version='last', message='')

Label a certain model version.

It checks if a model with this version really exists and throws an exception if such a model does not exist. This method labels a certain model version.
Parameters:
  • label_name (str) – the label name
  • model (str) – the identifyer of the model. Defaults to None.
  • model_version (str) – model version for which the label is set.. Defaults to repo_store.RepoStore.LAST_VERSION.
  • message (str) – commit message. Defaults to ‘’.

repo_objects

This module contains a bunch of different RepoObjects.

In principal, all objects that can be stored within pailab’s MLRepo are called a RepoObject. So, if you need a new object apart from those documented here, you just have to implement the respective interfaces, so that the object can be processed by pailab. This may be accomplished in three different ways:

  • Inherit your class from the pailab.repo_objects.RepoObject class. This may not be very pythonic, but it easily shows you which interfaces you definitively have to implement.
  • If you have a very simple object you may use the decorator pailab.repo_objects.repo_object_init in conjunction with your classe’s constructor to make your class a RepoObject.
  • Just implement the methods needed (again look at pailab.repo_objects.RepoObject to what has to be defined).
class CommitInfo(message, author, objects, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)

Stores each commit including the commit message and the objects commited.

:param : param message (string): commit message :param : param author (string): author :param objects: dictionary of names of committed objects and version numbers :type objects: dictionary

class DataSet(raw_data, start_index=0, end_index=None, raw_data_version='last', repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)

Class used to define data used e.g. for training or testing.

This class refers to some RawData object and a start- and endindex. The repository

Parameters:
  • raw_data (str) – id of raw_data the dataset refers to
  • start_index (int) – index of first entry of the raw data used in the dataset. Defaults to 0.
  • end_index (int or None) – end_index of last entry of the raw data used in the dataset (if None, all including last element are used). Defaults to None.
  • raw_data_version (str) – version of RawData object the DataSet refers to. Defaults to ‘last’.
  • repo_info (RepoInfo) – dictionary of the repo info}). Defaults to RepoInfo().
Raises:

Exception – raises an exception if the start index is after the end index

set_data(raw_data)

Set the data from the given raw_data.

Parameters:raw_data (RawData) – the raw data used to set the data from
Raises:Exception – if end_index id less than start_index
class Function(f, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)
create()

Returns the function object

Returns:the function object
Return type:function object
get_version()

returns the version

Returns:[type] – the module version
class Label(model_name, model_version, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)

RepoObject to label a certain model version

class Measure(value, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)

the measure repo object

class MeasureConfiguration(measures, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)

RepoObject defining a configuration for all measures which shall be computed.

L2 = 'l2'
MAX = 'max'
MSE = 'mse'
R2 = 'r2'
add_measure(measure, coords=None)

add a measure to the repo object

Parameters:
  • measure ([type]) – the measure
  • coords ([type]) – the coordinates. Defaults to None.
static get_name(measure_def)

function to return a name of the measure

Parameters:measure_def (MeasureConfiguration) – the measure definition
Returns:str – the name of the measure
class Model(preprocessors=None, eval_function=None, train_function=None, train_param=None, model_param=None, training_data=None, test_data=None, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)
get_test_data(ml_repo)

Returns all test data in the repo relevant for this model.

Parameters:ml_repo (MLRepo) – The repository from which the test data is taken
Returns:list of names of the test data that applied to this model
class Preprocessor(transforming_function, fitting_function=None, preprocessing_param=None, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)

Preprocessor class

class RawData(x_data, x_coord_names, y_data=None, y_coord_names=None, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)

Class to store numpy data.

class RepoInfo(kwargs={}, name=None, version=None, category=None, modification_info=None)

Contains all repo relevent information

This class contains all repo relevant information such as version, name, descriptions. It must be a member of all objects which are handled by the repo.

get_dictionary()

Return repo info as dictionary

set_fields(kwargs)

Set repo info fields from a dictionary

Parameters:kwargs (dict) – additional arguments
class RepoInfoKey

Enums to describe all possible repository informations.

AUTHOR = 'author'
BIG_OBJECTS = 'big_objects'
CATEGORY = 'category'
CLASSNAME = 'classname'
COMMIT_DATE = 'commit_date'
COMMIT_MESSAGE = 'commit_message'
DESCRIPTION = 'description'
MODIFICATION_INFO = 'modification_info'
NAME = 'name'
VERSION = 'version'
class RepoObject(repo_info)

Base class for objects which are handled b the repository.

from_dict(repo_obj_dict)

set object from a dictionary

Parameters:repo_object_dict (dict) – dictionary with the object data
numpy_from_dict(repo_numpy_dict)

sets the attributes of the numpy dictionary

Parameters:repo_numpy_dict (dict) – dictionary with the object data
numpy_to_dict()

function to get the attributes as a dictionary

Returns:dict – dictionary of the attributes
to_dict()

Return a data dictionary for a given repo_object without the big data objects

Returns:dict – dictionary of data
class Result(data, big_data=None, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)

the result repo object

numpy_from_dict(repo_numpy_dict)

sets the attributes of the numpy dictionary

Parameters:repo_numpy_dict (dict) – dictionary with the object data
numpy_to_dict()

returns the big data object

Returns:[type] – the big data
create_repo_obj(obj)

Create a repo_object from a dictionary.

This function creates a repo object from a dictionary in a factory-like fashion. It uses the obj[‘repo_info’][‘classname’] within the dictionary and constructs the class using get_object_from_classname. It throws an exception if the dictionary does not contain an ‘repo_info’ key.

Parameters:obj (dict) – dictionary containing all informations for a repo_object.
Raises:Exception – raises an exception if the dictionary is not a repo dictionary
Returns:[type] – the object of the specified class
create_repo_obj_dict(obj)

Create from a repo_object a dictionary with all values to handle the object within the repo

Parameters:obj (RepoObject) – repository object
Returns:dict – returns the dictionary
get_object_from_classname(classname, data)

Returns an object instance for given classname and data dictionary.

Parameters:
  • classname (str) – Full classname as string including the modules, e.g. repo.Y if class Y is defined in module repo.
  • data (dict) – dictionary of data used to initialize the object instance.
Returns:

[type] – Instance object of class.

class repo_object_init(big_objects=[])

Decorator class to modify a constructor so that the class can be used within the ml repository as repo_object.

from_dict(repo_obj_dict)

set object from a dictionary

Parameters:
  • repo_object (RepoObject) – repo_object which will be set from the dictionary
  • repo_object_dict (dict) – dictionary with the object data
init_repo_object(init_self, repo_info)

initialiser for repo objects

Parameters:
  • init_self ([type]) – [description]
  • repo_info (dict) – the repository info
numpy_from_dict(repo_numpy_dict)

function to transform a dictionary to a numpy

Parameters:
  • repo_obj (RepoObject) – the repo object
  • repo_numpy_dict (numpy dict) – the repo numpy dictionary
numpy_to_dict()

function to get the attributes as a dictionary

Parameters:repo_obj (RepoObject) – the repo object
Returns:dict – dictionary of the attributes
to_dict()

Return a data dictionary for a given repo_object

Parameters:repo_obj (RepoObject) – A repo_object, i.e. object which provides the repo_object interface
Returns:dict – dictionary of data

object types

class MLObjectType

Enum describing all ml object types.

The MLObjectType is assigned to each object in the MLRepo. It is used to structure all objects and to support consistency checks and automatic pipelines, the following types are defined:

  • EVAL_DATA: evaluation data (result from evaluation of a model)
  • RAW_DATA: raw data, i.e. simple numpy structures most often used to derive test or training data from the RawData
  • TRAINING_DATA: training data used for model training
  • TEST_DATA: data used for model testing
  • TEST: concrete test
  • TEST_DEFINITION: definition of a test which is applied to respective data and models to obtain a test
  • MODEL_PARAM: model parameter
  • TRAINING_PARAM: training parameter
  • TRAINING_FUNCTION: function to train e certain model
  • MODEL_EVAL_FUNCTION: function to evaluate a certain model
  • PREPROCESSOR_PARAM: preprocessing parameter
  • PREPROCESSOR: definition of a preprocessor
  • PREPROCESSING_FITTING_FUNCTION: function to fit the preprocessor
  • PREPROCESSING_TRANSFORMING_FUNCTION: function to apply preprocessing to data
  • LABEL: model label
  • MODEL: definition of a model
  • CALIBRATED_MODEL: object containing a calibrated instande of a model
  • COMMIT_INFO: internally used to store commit messages
  • MAPPING: internally used mapping object to map an object’s name to the object’s category
  • MEASURE: computed measure (e.g. norm of error)
  • MEASURE_CONFIGURATION: the configuration of all measures applied to the model
  • RESULT: object holding results
  • JOB: a job
  • TRAINING_STATISTIC: object holding training statistics, e.g. training history
  • CACHED_VALUE: cached return values of time consuming functions

repo stores

Base classes

class NumpyStore

class to handle big objects

add(name, version, numpy_dict)

Add numpy data from an object to the storage.

Parameters:
  • name (str) – Name (as string) of object
  • version (str) – object version
  • numpy_dict (numpy dict) – numpy dictionary
append(name, version_old, version_new, numpy_dict)

Append data to an existing object

Parameters:
  • name (str) – name of data object to be returned
  • version_old (str) – version of the object where the data will be appended
  • version_new (str) – version of the new objct after appending the data
  • numpy_dict (dict) – dictionary containing the values
get(name, version, from_index=0, to_index=None)

get the numpy object for a name and a version, rows can be used

Parameters:
  • name (str) – identifier of the object
  • version (str) – version of the object
  • from_index (int) – the index from which the data should be taken. Defaults to 0.
  • to_index (int or None) – the index to which the data is returned (None means till the end). Defaults to None.
Returns:

numpy array – the numpy object to return

pull()

Pull changes from an external repo

push()

Push changes to an external repo.

class RepoScriptStore
add(script_file)

Add a script to the storage.

Parameters:script_file (string) – file (incluing path) of script
get(name, versions=None)
class RepoStore
LAST_VERSION = 'last'

Abstract base class for all storages which can be used in the ML repository

add(obj)

Add an object to the storage.

Parameters:obj (RepoObject|list(RepoObject)) – repository object or list o repository objects
Raises:Exception if an object with same name already exists.
get(name, versions=None, modifier_versions=None, obj_fields=None, repo_info_fields=None, throw_error_not_exist=True, throw_error_not_unique=True)

Get a dictionary/list of dictionaries fulffilling the conditions.

Returns a list of objects matching the name and whose
-version is in the given list of versions -modifiers match the version number/are in the list of version numbers of the given modifiers
Parameters:
  • name (str) – object id
  • versions (list, version_number, tuple) – either a list of versions or a single version of the objects to be returned,. Defaults to None. if None, the condition on version is ignored. If a tuple is given, this tuple defines a version intervall, i.e. all versions between the first and last entry (both including) are returned. In addition FIRST_VERSION and LAST_VERSION can be used for versions to access the last/first version.
  • modifier_versions (dictionary) – modifier ids together with version specs which are matched by the returned object.. Defaults to None.
  • obj_fields (list of str or str) – list of strings identifying the fields which will be returned in the dictionary, if None, no fields are returned, if set to ‘all’, all fields will be returned . Defaults to None.
  • repo_info_fields (list of str or str) – list of strings identifying the fields of the repo_info dict which will be returned in the dictionary, if None, no fields are returned, if set to ‘all’, all fields will be returned. Defaults to None.
  • throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
  • throw_error_not_unique (bool) – true - throw error if item is not unique, else return []. Defaults to True.
Returns:

RepoObject or list thereof – The repo object

get_first_version(name, throw_error_not_exist=True)

Return version number of first (in a temporal sense) object in storage

Parameters:
  • name (str) – object name for which the version is returned
  • throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:

NotImplementedError – [description]

get_latest_version(name, throw_error_not_exist=True)

Return latest version number of object in the storage

Parameters:
  • name (str) – object name
  • throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Returns:

str – latest version number

get_names(ml_obj_type)

Return the names of all objects belonging to the given category.

Parameters:ml_obj_type (str) – Value of MLObjectType-Enum specifying the category for which all names will be returned
get_version(name, offset, throw_error_not_exist=True)

Return versionnumber for the given offset

If offset >= 0 it returns the version number of the offset version, if <0 it returns according to the python list logic the version number of the (offset-1)-last version

Parameters:
  • name (str) – name of object
  • offset (int) – offset
  • throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
object_exists(name, version='last')

Returns True if an object with the given name and version exists.

Parameters:
  • name (string) – object name
  • version (version number) – version number. Defaults to LAST_VERSION.
pull()

Pull changes from an external repo

push()

Push changes to an external repo.

replace(obj)

Overwrite existing object without incrementing version

Parameters:obj (RepoObject) – repo object to be overwritten

Memory storages

class NumpyMemoryStorage

Bases: pailab.ml_repo.repo_store.NumpyStore

add(name, version, numpy_dict)

Add numpy data from an object to the storage.

Parameters:
  • name (str) – identifier (as string) of object
  • version (str) – object version
  • numpy_dict (numpy dict) – numpy dictionary
append(name, version_old, version_new, numpy_dict)

appends an numpy dictionary to an existing object

Parameters:
  • name (str) – identifier of the object
  • version_old (str) – the old version of the object
  • version_new (str) – the new version of the object
  • numpy_dict (numpy dict) – the numpy dictionary to append
Raises:

Exception – raises an exception if the object does not exist

get(name, version, from_index=0, to_index=None)

get the numpy object for a name and a version, rows can be used

Parameters:
  • name (str) – identifier of the object
  • version (str) – version of the object
  • from_index (int) – the index from which the data should be taken. Defaults to 0.
  • to_index (int or None) – the index to which the data is returned (None means till the end). Defaults to None.
Raises:
  • Exception – raises an exception if no object with the name exists
  • Exception – raises an exception if no object and with the version exists
Returns:

numpy array – the numpy object to return

class RepoObjectMemoryStorage

Bases: pailab.ml_repo.repo_store.RepoStore

The repo object memory storage. This class is used to store repo object (excluding large objects) in the memory. The importance of the handler is mostly for testing purposes.

get_first_version(name, throw_error_not_exist=True)

Determine the first version of the object

Parameters:
  • name (str) – identifier of the object
  • throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:

Exception – Raises an exception if the object does not exists

Returns:

str – the first version string of the object

get_latest_version(name, throw_error_not_exist=True)

Determine the latest version of the object

Parameters:
  • name (str) – identifier of the object
  • throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:

Exception – Raises an exception if the object does not exists

Returns:

str – the latest version string of the object

get_names(category)

Return the names of all objects belonging to the given category.

Parameters:ml_obj_type (str) – Value of MLObjectType-Enum specifying the category for which all names will be returned
Returns:list of str – a list of all objects in the category
get_version(name, offset, throw_error_not_exist=True)

Return the newest version up to offset versions

Parameters:
  • name (str) – the identifier of the object
  • offset (int) – the offset
  • throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:
  • Exception – raises an error if the offset is higher than the number of versions available
  • Exception – raises an exception if the object does not exists and throw_error_not_exist == True
Returns:

str – the version

replace(obj)

Overwrite existing object without incrementing version

Parameters:obj (RepoObject) – repo object to be overwritten

RepoObjectDiskStorage

class RepoObjectDiskStorage(folder, file_format='pickle')

The RepoObjectDiskStorage class

check_integrity()

Checks if files are missing or have not yet been added

Returns:dictionary – contains sets of missing files and/or set of files not yet added
close_connection()

Closes the database connection

get_config()

return the configuration

Returns:dict – a dictionary of the configuration
get_first_version(name, throw_error_not_exist=True)

Determine the first version of the object

Parameters:
  • name (str) – identifier of the object
  • throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:

Exception – Raises an exception if the object does not exists

Returns:

str – the first version string of the object

get_latest_version(name, throw_error_not_exist=True)

Determine the latest version of the object

Parameters:
  • name (str) – identifier of the object
  • throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:

Exception – Raises an exception if the object does not exists

Returns:

str – the latest version string of the object

get_names(ml_obj_type)

Return the names of all objects belonging to the given category.

Parameters:ml_obj_type (str) – Value of MLObjectType-Enum specifying the category for which all names will be returned
Returns:list of str – a list of all objects in the category
get_version(name, offset, throw_error_not_exist=True)

Return the newest version up to offset versions

Parameters:
  • name (str) – the identifier of the object
  • offset (int) – the offset
  • throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:

Exception – raises an exception if the object does not exists and throw_error_not_exist == True

Returns:

str – the version

get_version_condition(name, versions, version_column, time_column)

returns the condition part of the versions for the sql statement

Parameters:
  • name (str) – not used
  • versions (str or list of str) – a or the versions to condition on
  • version_column (str) – version column name
  • time_column (str) – time column name
Returns:

str – the condition for the versions

replace(obj)

Overwrite existing object without incrementing version

Parameters:obj (RepoObject) – repo object to be overwritten

RepoObjectGitStorage

class RepoObjectGitStorage(remote=None, **kwargs)

Object storage with git support.

This storge stores all objects on disk in a local git storage. It provides functionality to push and pull from another git repo. Note that it handles all files in the same way as RepoObjectDiskStorage does (using methods from this storage).

commit(message, force=True)

Commits the changes

Parameters:
  • message (str) – Commit message
  • force (bool) – If False, objecs will only be commited if integrity check succeeded.
Raises:

Exception – raises an exception if the integrity check fails

pull(remote_name='origin')

Pull from the remote git repository

Parameters:

remote_name (str) – the name of the remote git repository. Defaults to ‘origin’.

Raises:
  • Exception – raises an exception if the remote name is not available
  • Exception – raises an error if the pull fails
push(remote_name='origin')

pushes the changes to the remote git repository

Parameters:remote_name (str) – name of the remote repository. Defaults to ‘origin’.
Raises:Exception – raises an exception if the remote does not exist
replace(obj)

Overwrite existing object without incrementing version

Parameters:obj (RepoObject) – the repo object to be overwritten

NumpyHDFStorage

Module defining classes to store numpy data in hdf5 files.

This module provides implementations of the pailab.ml_repo.repo_store.NumpyStore using hdf5 file format.

class NumpyHDFRemoteStorage(folder, remote_store=None, sync_get=False, sync_add=False)

Storage working like NumpyHDFStorage locally but in addition provides synchronization with a remote.

This storage stores numpy data in hdf5 files in a directory. It works very similar to the NumpyHDFStorage with the difference that it synchronizes the data with a given remote (downloads and uploads the respective files).

Example

This example shows how to setup the storage so that the data is stored in a local directory and it can be synchronized ith googl ecloud storage:

>>> numpy = NumpyHDFRemoteStorage('C:\tmp\data')
>>> from pailab.ml_repo.remote_gcs import RemoteGCS
>>> remote = RemoteGCS(bucket='my_data')
>>> numpy.set_remote(remote)
Parameters:
  • folder (str) – folder where data is stored
  • remote_store (obj or dict) – object representing a remote storage (e.g. pailab.ml_repo.remote_gcs.RemoteGCS for the google cloud storage) or dictionary defining the remote params so that it can be created
  • sync_get (bool) – If True, tries to download data automatically if it does not exist locally, otherwise it checks only locally
  • sync_add (bool) – If True, added data will be directly uploaded to the remote
add(name, version, numpy_dict)

Add numpy data from an object to the storage.

Parameters:
  • name (str) – the identifier of the object to add
  • version (str) – the object version
  • numpy_dict (numpy dict) – the numpy dictionary to add
get(name, version, from_index=0, to_index=None)

get the numpy object for a name and a version, rows can be used

Parameters:
  • name (str) – identifier of the object
  • version (str) – version of the object
  • from_index (int) – the index from which the data should be taken. Defaults to 0.
  • to_index (int or None) – the index to which the data is returned (None means till the end). Defaults to None.
Raises:
  • Exception – raises an exception if no object with the name exists
  • Exception – raises an exception if no object and with the version exists
Returns:

numpy array – the numpy object to return

pull()

Pull changes from an external repo

push()

Push changes to an external repo.

class NumpyHDFStorage(folder, version_files=False)

Storage using hdf5 files to store numpy data.

Example

Setup storage using folder C:\temp\data:

>>> store = NumpyHDFStorage('C:\temp\data')
Parameters:
  • folder (str) – main directory where the files will be stored
  • version_files (bool) – If True, each version is contained in a separate file, otherwise all versions are in one file. If you like to work in a distributed environmnt (e.g. multiple users working in parallel) you should set this parameter to True so that no file merge is necessary. . Defaults to False.
add(**kw)

Add numpy data from an object to the storage.

Parameters:
  • name (str) – the identifier of the object to add
  • version (str) – the object version
  • numpy_dict (numpy dict) – the numpy dictionary to add
append(**kw)

append data to the an existing object

Parameters:
  • name (str) – the object identifier
  • version_old (str) – the previous object version
  • version_new (str) – the next object version
  • numpy_dict (numpy dict) – the data to add as a numpy dictionary
get(**kw)

get the numpy object for a name and a version, rows can be used

Parameters:
  • name (str) – identifier of the object
  • version (str) – version of the object
  • from_index (int) – the index from which the data should be taken. Defaults to 0.
  • to_index (int or None) – the index to which the data is returned (None means till the end). Defaults to None.
Raises:
  • Exception – raises an exception if no object with the name exists
  • Exception – raises an exception if no object and with the version exists
Returns:

numpy array – the numpy object to return

object_exists(name, version)

checks whether the object exists

Parameters:
  • name (str) – the identifier of the object
  • version (str) – the version of the object
Returns:

bool – returns true if the object exists

trace(aFunc)

Trace entry, exit and exceptions.