ml_repo¶

class MLRepo(workspace=None, user=None, config=None, save_config=False, name='NONE')¶

Repository for doing machine learning

The repository and his extensions provide a solid fundament to do machine learning science supporting features such as:

auditing/versioning of the data, models and tests
best practice standardized plotting for investigating model performance and model behaviour
automated quality checks

Parameters:	workspace ([type]) – [description]. Defaults to None. user (str) – the user. Defaults to None. config (dict) – the configuration to use. Defaults to None. save_config (bool) – determines whether to save the configuration or not. Defaults to False.

add(repo_object, message='', category=None)¶

Add a repo_object or list of repo objects to the repository.

Raises an exception if the category of the object is not defined in the object and if it is not defined with the category argument. It raises an exception if an object with this id does already exist.

Parameters:	repo_object (RepoObject) – repo_object or list of repo_objects to be added, will be modified so that it contains the version number message (str) – commit message. Defaults to ‘’. category (MLObjectType) – Category of repo_object which overwrites the objects category.. Defaults to None.
Returns:	str or dictionary – version number of object added or dictionary of names and versions of objects added

add_eval_function(f, repo_name=None)¶

Add the function to evaluate the model

Parameters:	module_name (str) – module where function is located function_name (str) – function name repo_name (str) – identifier of the repo object used to store the information, if None, the name is set to module_name.function_name. Defaults to None.

add_measure(measure, coordinates=None)¶

Add a measure to the repository

If the measure already exists, it returns the message

Parameters:	measure (str) – string defining the measure, i.e MAX,… coordinates (list of str) – list of strings defining the coordinates (by name) used for the measure, if None, all coordinates will be used. Defaults to None.

add_model(model_name, model_eval=None, model_training=None, model_param=None, training_param=None, preprocessors=None)¶

Add a new model to the repo

Parameters:

model_name (str) – identifier of the model
model_eval (str) – identifier of the evaluation function in the repo to evaluate the model, if None and there is only one evaluation function in the repo, this function will be used
model_training (str) – identifier of the training function in the repo to train the model, if None and there is only one evaluation function in the repo, this function will be used
model_param (str) – identifier of the model parameter in the repo, if None and there is exactly one ModelParameter in teh repo, this will be used,. Defaults to None. otherwise it is assumed that no model_params are needed
training_param (str) – identifier of the training parameter, if None and there is only one training_parameter object in the repo, . Defaults to None. this will be used. If an empty string is given as training parameter, we assume that the algorithm does not need a training pram.
preprocessors (list) – list of preprocessors to be execute. Defaults to None. this is a list of strings

add_preprocessing_fitting_function(f, repo_name=None)¶

Add function to fit a preprocessor

Parameters:	module_name (str) – module where function is located function_name (str) – function name repo_name (tring) – identifier of the repo object used to store the information, if None, the name is set to module_name.function_name. Defaults to None.

add_preprocessing_transforming_function(f, repo_name=None)¶

Add function to transform the data by a preprocessor

Parameters:	module_name (str) – module where function is located function_name (str) – function name repo_name (str) – identifier of the repo object used to store the information, if None, the name is set to module_name.function_name. Defaults to None.

add_preprocessor(preprocessor_name, transforming_function=None, fitting_function=None, preprocessor_param=None)¶

Add a new preprocessor to the repo

Parameters:

preprocessor_name (str) – identifier of the preprocessor
transforming_function (str) – identifier of the transforming function in the repo, if None and there is only one transforming function in the repo, this function will be used
fitting_function (str) – identifier of the fitting function in the repo to fit the preprocessor, if None the preprocessor does not need to be fitted
preprocessor_param (str) – identifier of the preprocessor parameter. Defaults to None.

Raises:

Exception – Raises an error if the preprocessing transforming function is not in repo

add_raw_data(name, data, input_names=None, data_y=None, target_names=None, file_format=None, axis=1)¶

Adds a RawData object to the repository.

This methods creates/reads from the given data/file a RawData object and adds it to the repository.

Examples

Read data from csv file ‘test_data.csv’ and use columns with headers ‘x0’, ‘x1’ as input data and column with label ‘x2’ as target, store the results under name ‘my_data’:

>>ml_repo.add_raw_data('my_data', 'test_data.csv', ['x0', 'x1], file_format = 'csv')

Create data from a DataFrame test where the columns ‘x0’, ‘x1’ are used as input and no target is specified:

>>ml_repo.add_raw_data('my_data', test, ['x0', 'x1])

Parameters:	name (str) – Name of RawData in repository (if name does not start with ‘raw_data/’ this is added. data (str, numpy ndarray or pandas DataFrame) – Eithr a pandas DataFarme, a numpy ndarray or a string that is interpreted as filename of the underling data. input_names (iterable of str, optional) – List of the input variables names. Defaults to None. data_y (str or numpy ndarray, optional) – Either a numpy ndarray or a string defining the filename of th y-data (not valid if file_format==’csv’). Defaults to None. target_names (iterable of str, optional) – List of the target variables names. Defaults to None. file_format ('csv' or 'numpy', optional) – File type which can be either csv or numpy (numpy means an ndarray stored with numpy.save). Defaults to None. axis (int, optional) – If only an ndarray is given but target variables are defined, this array will b split into weo arrays (one for input, one for target) along this axis. Defaults to 1.
Returns:	version number of RawData object added
Return type:	str

add_test_data(name, raw_data_name, start_index=0, end_index=None, raw_data_version='last')¶

Add test data as a DataSet to the repository.

This method defines a DataSet and adds it to the repository. A DataSet is a logical unit based on a RawData object and defines the range of data that is taken from the respective RawData data.

Parameters:

name (str) – Name of respective object in repository.
raw_data_name (str) – Name of the underlying RawData object that is used as basis.
start_index (int, optional) – Start index where test data starts from underlying RawData. Defaults to 0.
end_index (int, optional) – End index where test data end. Defaults to None.
raw_data_version (str) – Version of underlying RawData (if ‘last’, always the latest RawData will be used to derive the respective DataSet). Defaults to ‘last’

add_training_data(name, raw_data_name, start_index=0, end_index=None, raw_data_version='last')¶

Add training data as a DataSet to the repository.

This method defines a DataSet and adds it to the repository. A DataSet is a logical unit based on a RawData object and defines the range of data that is taken from the respective RawData data.

Parameters:

name (str) – Name of respective object in repository.
raw_data_name (str) – Name of the underlying RawData object that is used as basis.
start_index (int, optional) – Start index where training data starts from underlying RawData. Defaults to 0.
end_index (int, optional) – End index where training data end. Defaults to None.
raw_data_version (str) – Version of underlying RawData (if ‘last’, always the latest RawData will be used to derive the respective DataSet). Defaults to ‘last’

add_training_function(f, repo_name=None)¶

Add function to train a model

Parameters:	module_name (str) – module where function is located function_name (str) – function name repo_name (tring) – identifier of the repo object used to store the information, if None, the name is set to module_name.function_name. Defaults to None.

delete(name, version)¶

Delete a specific object.

It deletes the object. If other objects were modified by this object, it throws an exception that first the modified objects must be deleted.

Parameters:	name (str) – name of the object version (str) – version of the object
Raises:	`Exception` – If the object has depending objects, it can not be deleted and an error is thrown.

get(name, version='last', full_object=False, modifier_versions=None, obj_fields=None, repo_info_fields=None, throw_error_not_exist=True, throw_error_not_unique=True)¶

Get repo objects. It throws an exception, if an object with the name does not exist.

Parameters:	name (str) – the object name version (str) – object version, default is latest (-1). If the fields are nested (an element of a dictionary which is an element of a dictionary, use path notation to the element, i.e. p/elem1/elem2 to get p[elem1][elem2]). Defaults to repo_store.RepoStore.LAST_VERSION. full_object (bool) – flag to determine whether the numpy objects are loaded (True->load). Defaults to False. modifier_versions ([type]) – [description]. Defaults to None. obj_fields ([type]) – [description]. Defaults to None. repo_info_fields ([type]) – [description]. Defaults to None. throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True. throw_error_not_unique (bool) – true - throw error if item is not unique, else return []. Defaults to True.
Raises:	`Exception` – raises an exception if no object with the specific name is found
Returns:	RepoObject or list thereof – The repo object

static get_calibrated_model_name(model_name)¶

For a model name the calibrated model name is returned

Parameters:	model_name (str) – model name
Returns:	string – the calibrated model name

get_commits(version_start='first', version_end='last')¶

gets the commits

Parameters:	version_start (str) – only display versions after version_start. Defaults to repo_store.RepoStore.FIRST_VERSION. version_end (str) – only display versions up to version_end. Defaults to repo_store.RepoStore.LAST_VERSION.
Returns:	list of commit infos – returns a list of commit infots

static get_eval_name(model, data)¶

Return name of the object containing evaluation results

Parameters:	model (ModelDefinition object or str) – {RawData or DataSet object or str} -- (data) –
Returns:	string – name of valuation results

get_history(name, repo_info_fields=None, obj_member_fields=None, version_start='first', version_end='last')¶

Return a list of histories of object member variables without bigobjects

Parameters:

name (str) – the object name
repo_info_fields (list of strings) – List of fields from repo_info which will be returned in the dictionary. If List contains flag ‘ALL’, all fields will be returned.. Defaults to None.
obj_member_fields (list of strings) – List of member atributes from repo_object which will be returned in the dictionary. If List contains flag ‘ALL’, all attributes will be returned.. Defaults to None.
version_start (str) – only display versions after version_start. Defaults to repo_store.RepoStore.FIRST_VERSION.
version_end (str) – only display versions up to version_end. Defaults to repo_store.RepoStore.LAST_VERSION.

Returns:

str or list of strings – returns a list of the objects

get_ml_repo_store()¶

Return the storage for the ml repo

Returns:	RepoStore – the storage for the RepoObjects

get_names(ml_obj_type)¶

Get the list of names of all repo_objects from a given repo_object_type in the repository.

Parameters:	ml_obj_type (MLObjectType) – MLObjectType specifying the types of objects names are returned for.
Returns:	list of strings – list of object names for the given category.

get_numpy_data_store()¶

Return the numpy data store of the ml repo

Returns:	numpy_handler – the numpy repo

get_training_data(version='last', full_object=True, model=None, model_version='last')¶

Returns training data for a model.

It returns the training data in the repo for a specified model. If there is only one set of training data in the repo, this set will be returned. Otherwise, the model is loaded and the training data is used as defined in the model. If in this case a model is not specified the method throws an exception.

Parameters:

version (str) – version of data object. Defaults to repo_store.RepoStore.LAST_VERSION.
full_object (bool) – if True, the complete data is returned including numpy data. Defaults to True.
model (str) – Name of model definition for which the training data will be returned.
model_version (str) – Version of model definition for which teh trainin data will be returned.

pull()¶: Pull changes from an external repo

push()¶: Push changes to an external repo.

run(job)¶

Executes a job

Parameters:	job (Job) – The job object to be executed
Returns:	[type] – Return the name and version of the job or a message that the job does not need to be rerun

run_evaluation(model=None, message=None, model_version='last', datasets={}, predecessors=[], run_descendants=False, labels=None)¶

Evaluate the model on all datasets.

Parameters:

model (str) – name of model to evaluate, if None and only one model exists. Defaults to None.
message (str) – message inserted into commit, if None: an automated message is created. Defaults to None.
model_version (str) – version of model to be evaluated.. Defaults to repo_store.RepoStore.LAST_VERSION.
datasets (dict) – dictionary of datasets (names and version numbers) on which the model is evaluated. . Defaults to {}.
predecessors (list) – list of jobs which shall have been completed successfull before the evaluation is started. Default is all datasets from testdata on latest version.. Defaults to [].
run_descendants (bool) – if True also run all decendant jobs. Defaults to False.
labels ([type]) – [description]. Defaults to None.

Returns:

list of strings – a list of the job ids

run_measures(model=None, message=None, model_version='last', datasets={}, measures={}, predecessors=[], labels=None)¶

Run the measures

Parameters:

model (str) – name of model to evaluate, if None and only one model exists. Defaults to None.
message (str) – message inserted into commit, if None: an automated message is created. Defaults to None.
model_version (str) – version of model to be evaluated.. Defaults to repo_store.RepoStore.LAST_VERSION.
datasets (dict) – dictionary of datasets (names and version numbers) on which the model is evaluated. . Defaults to {}.
predecessors (list) – list of jobs which shall have been completed successfull before the evaluation is started. Default is all datasets from testdata on latest version.. Defaults to [].
run_descendants (bool) – if True also run all decendant jobs. Defaults to False.
labels ([type]) – [description]. Defaults to None.

Returns:

list of strings – a list of the job ids

run_tests(test_definitions=None, predecessors=[])¶

Run tests for a specific model version.

Parameters:	test_definitions (list or set) – List or set of names of the test definitions which shall be executed. If None, all test definitions are executed.. Defaults to None. predecessors (list) – list of jobs which shall have been completed successfull before the evaluation is started. Default is all datasets from testdata on latest version.. Defaults to [].
Returns:	str – ticket number of job

run_training(model=None, message=None, model_version='last', training_function_version='last', training_data_version='last', training_param_version='last', model_param_version='last', run_descendants=False)¶

Run the training algorithm.

Parameters:

model (str) – the identifyer of the model. Defaults to None.
message (str) – the commit message. Defaults to None.
model_version (str) – the version of the model. Defaults to repo_store.RepoStore.LAST_VERSION.
training_function_version (str) – the version of the training function. Defaults to repo_store.RepoStore.LAST_VERSION.
training_data_version (str) – the version of the training data. Defaults to repo_store.RepoStore.LAST_VERSION.
training_param_version (str) – the version of the training parameter. Defaults to repo_store.RepoStore.LAST_VERSION.
{str} --the version of the model parameter. Defaults to repo_store.RepoStore.LAST_VERSION. (model_param_version) –
run_descendants (bool) – if True also run all decendant jobs. Defaults to False.

Returns:

[type] – return name and version or message

set_label(label_name, model=None, model_version='last', message='')¶

Label a certain model version.

It checks if a model with this version really exists and throws an exception if such a model does not exist. This method labels a certain model version.

Parameters:	label_name (str) – the label name model (str) – the identifyer of the model. Defaults to None. model_version (str) – model version for which the label is set.. Defaults to repo_store.RepoStore.LAST_VERSION. message (str) – commit message. Defaults to ‘’.

repo_objects¶

This module contains a bunch of different RepoObjects.

In principal, all objects that can be stored within pailab’s MLRepo are called a RepoObject. So, if you need a new object apart from those documented here, you just have to implement the respective interfaces, so that the object can be processed by pailab. This may be accomplished in three different ways:

Inherit your class from the pailab.repo_objects.RepoObject class. This may not be very pythonic, but it easily shows you which interfaces you definitively have to implement.

If you have a very simple object you may use the decorator pailab.repo_objects.repo_object_init in conjunction with your classe’s constructor to make your class a RepoObject.

Just implement the methods needed (again look at pailab.repo_objects.RepoObject to what has to be defined).

class CommitInfo(message, author, objects, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶

Stores each commit including the commit message and the objects commited.

:param : param message (string): commit message :param : param author (string): author :param objects: dictionary of names of committed objects and version numbers :type objects: dictionary

class DataSet(raw_data, start_index=0, end_index=None, raw_data_version='last', repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶

Class used to define data used e.g. for training or testing.

This class refers to some RawData object and a start- and endindex. The repository

Parameters:

raw_data (str) – id of raw_data the dataset refers to
start_index (int) – index of first entry of the raw data used in the dataset. Defaults to 0.
end_index (int or None) – end_index of last entry of the raw data used in the dataset (if None, all including last element are used). Defaults to None.
raw_data_version (str) – version of RawData object the DataSet refers to. Defaults to ‘last’.
repo_info (RepoInfo) – dictionary of the repo info}). Defaults to RepoInfo().

Raises:

Exception – raises an exception if the start index is after the end index

set_data(raw_data)¶

Set the data from the given raw_data.

Parameters:	raw_data (RawData) – the raw data used to set the data from
Raises:	`Exception` – if end_index id less than start_index

class Function(f, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶

create()¶

Returns the function object

Returns:	the function object
Return type:	function object

get_version()¶

returns the version

Returns:	[type] – the module version

class Label(model_name, model_version, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶: RepoObject to label a certain model version

class Measure(value, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶: the measure repo object

class MeasureConfiguration(measures, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶

RepoObject defining a configuration for all measures which shall be computed.

F1 = 'f1'¶

L2 = 'l2'¶

MAX = 'max'¶

MSE = 'mse'¶

PRECISION = 'precision'¶

R2 = 'r2'¶

RECALL = 'recall'¶

ROC_AUC = 'roc_auc'¶

add_measure(measure, coords=None)¶

add a measure to the repo object

Parameters:	measure ([type]) – the measure coords ([type]) – the coordinates. Defaults to None.

static get_name(measure_def)¶

function to return a name of the measure

Parameters:	measure_def (MeasureConfiguration) – the measure definition
Returns:	str – the name of the measure

class Model(preprocessors=None, eval_function=None, train_function=None, train_param=None, model_param=None, training_data=None, test_data=None, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶

get_test_data(ml_repo)¶

Returns all test data in the repo relevant for this model.

Parameters:	ml_repo (MLRepo) – The repository from which the test data is taken
Returns:	list of names of the test data that applied to this model

class Preprocessor(transforming_function, fitting_function=None, preprocessing_param=None, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶: Preprocessor class

class RawData(x_data, x_coord_names, y_data=None, y_coord_names=None, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶: Class to store numpy data.

class RepoInfo(kwargs={}, name=None, version=None, category=None, modification_info=None)¶

Contains all repo relevent information

This class contains all repo relevant information such as version, name, descriptions. It must be a member of all objects which are handled by the repo.

get_dictionary()¶: Return repo info as dictionary

set_fields(kwargs)¶

Set repo info fields from a dictionary

Parameters:	kwargs (dict) – additional arguments

class RepoInfoKey¶

Enums to describe all possible repository informations.

AUTHOR = 'author'¶

BIG_OBJECTS = 'big_objects'¶

CATEGORY = 'category'¶

CLASSNAME = 'classname'¶

COMMIT_DATE = 'commit_date'¶

COMMIT_MESSAGE = 'commit_message'¶

DESCRIPTION = 'description'¶

MODIFICATION_INFO = 'modification_info'¶

NAME = 'name'¶

VERSION = 'version'¶

class RepoObject(repo_info)¶

Base class for objects which are handled b the repository.

from_dict(repo_obj_dict)¶

set object from a dictionary

Parameters:	repo_object_dict (dict) – dictionary with the object data

numpy_from_dict(repo_numpy_dict)¶

sets the attributes of the numpy dictionary

Parameters:	repo_numpy_dict (dict) – dictionary with the object data

numpy_to_dict()¶

function to get the attributes as a dictionary

Returns:	dict – dictionary of the attributes

to_dict()¶

Return a data dictionary for a given repo_object without the big data objects

Returns:	dict – dictionary of data

class Result(data, big_data=None, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶

the result repo object

numpy_from_dict(repo_numpy_dict)¶

sets the attributes of the numpy dictionary

Parameters:	repo_numpy_dict (dict) – dictionary with the object data

numpy_to_dict()¶

returns the big data object

Returns:	[type] – the big data

create_repo_obj(obj)¶

Create a repo_object from a dictionary.

This function creates a repo object from a dictionary in a factory-like fashion. It uses the obj[‘repo_info’][‘classname’] within the dictionary and constructs the class using get_object_from_classname. It throws an exception if the dictionary does not contain an ‘repo_info’ key.

Parameters:	obj (dict) – dictionary containing all informations for a repo_object.
Raises:	`Exception` – raises an exception if the dictionary is not a repo dictionary
Returns:	[type] – the object of the specified class

create_repo_obj_dict(obj)¶

Create from a repo_object a dictionary with all values to handle the object within the repo

Parameters:	obj (RepoObject) – repository object
Returns:	dict – returns the dictionary

get_object_from_classname(classname, data)¶

Returns an object instance for given classname and data dictionary.

Parameters:	classname (str) – Full classname as string including the modules, e.g. repo.Y if class Y is defined in module repo. data (dict) – dictionary of data used to initialize the object instance.
Returns:	[type] – Instance object of class.

class repo_object_init(big_objects=[])¶

Decorator class to modify a constructor so that the class can be used within the ml repository as repo_object.

from_dict(repo_obj_dict)¶

set object from a dictionary

Parameters:	repo_object (RepoObject) – repo_object which will be set from the dictionary repo_object_dict (dict) – dictionary with the object data

init_repo_object(init_self, repo_info)¶

initialiser for repo objects

Parameters:	init_self ([type]) – [description] repo_info (dict) – the repository info

numpy_from_dict(repo_numpy_dict)¶

function to transform a dictionary to a numpy

Parameters:	repo_obj (RepoObject) – the repo object repo_numpy_dict (numpy dict) – the repo numpy dictionary

numpy_to_dict()¶

function to get the attributes as a dictionary

Parameters:	repo_obj (RepoObject) – the repo object
Returns:	dict – dictionary of the attributes

to_dict()¶

Return a data dictionary for a given repo_object

Parameters:	repo_obj (RepoObject) – A repo_object, i.e. object which provides the repo_object interface
Returns:	dict – dictionary of data

object types¶

class MLObjectType¶

Enum describing all ml object types.

The MLObjectType is assigned to each object in the MLRepo. It is used to structure all objects and to support consistency checks and automatic pipelines, the following types are defined:

EVAL_DATA: evaluation data (result from evaluation of a model)

RAW_DATA: raw data, i.e. simple numpy structures most often used to derive test or training data from the RawData

TRAINING_DATA: training data used for model training

TEST_DATA: data used for model testing

TEST: concrete test

TEST_DEFINITION: definition of a test which is applied to respective data and models to obtain a test

MODEL_PARAM: model parameter

TRAINING_PARAM: training parameter

TRAINING_FUNCTION: function to train e certain model

MODEL_EVAL_FUNCTION: function to evaluate a certain model

PREPROCESSOR_PARAM: preprocessing parameter

PREPROCESSOR: definition of a preprocessor

PREPROCESSING_FITTING_FUNCTION: function to fit the preprocessor

PREPROCESSING_TRANSFORMING_FUNCTION: function to apply preprocessing to data

LABEL: model label

MODEL: definition of a model

CALIBRATED_MODEL: object containing a calibrated instande of a model

COMMIT_INFO: internally used to store commit messages

MAPPING: internally used mapping object to map an object’s name to the object’s category

MEASURE: computed measure (e.g. norm of error)

MEASURE_CONFIGURATION: the configuration of all measures applied to the model

RESULT: object holding results

JOB: a job

TRAINING_STATISTIC: object holding training statistics, e.g. training history

CACHED_VALUE: cached return values of time consuming functions

repo stores¶

Base classes¶

class NumpyStore¶

class to handle big objects

add(name, version, numpy_dict)¶

Add numpy data from an object to the storage.

Parameters:	name (str) – Name (as string) of object version (str) – object version numpy_dict (numpy dict) – numpy dictionary

append(name, version_old, version_new, numpy_dict)¶

Append data to an existing object

Parameters:	name (str) – name of data object to be returned version_old (str) – version of the object where the data will be appended version_new (str) – version of the new objct after appending the data numpy_dict (dict) – dictionary containing the values

get(name, version, from_index=0, to_index=None)¶

get the numpy object for a name and a version, rows can be used

Parameters:	name (str) – identifier of the object version (str) – version of the object from_index (int) – the index from which the data should be taken. Defaults to 0. to_index (int or None) – the index to which the data is returned (None means till the end). Defaults to None.
Returns:	numpy array – the numpy object to return

pull()¶: Pull changes from an external repo

push()¶: Push changes to an external repo.

class RepoScriptStore¶

add(script_file)¶

Add a script to the storage.

Parameters:	script_file (string) – file (incluing path) of script

get(name, versions=None)¶

class RepoStore¶

LAST_VERSION = 'last'¶: Abstract base class for all storages which can be used in the ML repository

add(obj)¶

Add an object to the storage.

Parameters:	obj (RepoObject\|list(RepoObject)) – repository object or list o repository objects
Raises:	Exception if an object with same name already exists.

get(name, versions=None, modifier_versions=None, obj_fields=None, repo_info_fields=None, throw_error_not_exist=True, throw_error_not_unique=True)¶

Get a dictionary/list of dictionaries fulffilling the conditions.

Returns a list of objects matching the name and whose

-version is in the given list of versions -modifiers match the version number/are in the list of version numbers of the given modifiers

Parameters:

name (str) – object id
versions (list, version_number, tuple) – either a list of versions or a single version of the objects to be returned,. Defaults to None. if None, the condition on version is ignored. If a tuple is given, this tuple defines a version intervall, i.e. all versions between the first and last entry (both including) are returned. In addition FIRST_VERSION and LAST_VERSION can be used for versions to access the last/first version.
modifier_versions (dictionary) – modifier ids together with version specs which are matched by the returned object.. Defaults to None.
obj_fields (list of str or str) – list of strings identifying the fields which will be returned in the dictionary, if None, no fields are returned, if set to ‘all’, all fields will be returned . Defaults to None.
repo_info_fields (list of str or str) – list of strings identifying the fields of the repo_info dict which will be returned in the dictionary, if None, no fields are returned, if set to ‘all’, all fields will be returned. Defaults to None.
throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
throw_error_not_unique (bool) – true - throw error if item is not unique, else return []. Defaults to True.

Returns:

RepoObject or list thereof – The repo object

get_first_version(name, throw_error_not_exist=True)¶

Return version number of first (in a temporal sense) object in storage

Parameters:	name (str) – object name for which the version is returned throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:	NotImplementedError – [description]

get_latest_version(name, throw_error_not_exist=True)¶

Return latest version number of object in the storage

Parameters:	name (str) – object name throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Returns:	str – latest version number

get_names(ml_obj_type)¶

Return the names of all objects belonging to the given category.

Parameters:	ml_obj_type (str) – Value of MLObjectType-Enum specifying the category for which all names will be returned

get_version(name, offset, throw_error_not_exist=True)¶

Return versionnumber for the given offset

If offset >= 0 it returns the version number of the offset version, if <0 it returns according to the python list logic the version number of the (offset-1)-last version

Parameters:	name (str) – name of object offset (int) – offset throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.

object_exists(name, version='last')¶

Returns True if an object with the given name and version exists.

Parameters:	name (string) – object name version (version number) – version number. Defaults to LAST_VERSION.

pull()¶: Pull changes from an external repo

push()¶: Push changes to an external repo.

replace(obj)¶

Overwrite existing object without incrementing version

Parameters:	obj (RepoObject) – repo object to be overwritten

Memory storages¶

class NumpyMemoryStorage¶

Bases: pailab.ml_repo.repo_store.NumpyStore

add(name, version, numpy_dict)¶

Add numpy data from an object to the storage.

Parameters:	name (str) – identifier (as string) of object version (str) – object version numpy_dict (numpy dict) – numpy dictionary

append(name, version_old, version_new, numpy_dict)¶

appends an numpy dictionary to an existing object

Parameters:	name (str) – identifier of the object version_old (str) – the old version of the object version_new (str) – the new version of the object numpy_dict (numpy dict) – the numpy dictionary to append
Raises:	`Exception` – raises an exception if the object does not exist

get(name, version, from_index=0, to_index=None)¶

get the numpy object for a name and a version, rows can be used

Parameters:	name (str) – identifier of the object version (str) – version of the object from_index (int) – the index from which the data should be taken. Defaults to 0. to_index (int or None) – the index to which the data is returned (None means till the end). Defaults to None.
Raises:	`Exception` – raises an exception if no object with the name exists `Exception` – raises an exception if no object and with the version exists
Returns:	numpy array – the numpy object to return

class RepoObjectMemoryStorage¶

Bases: pailab.ml_repo.repo_store.RepoStore

The repo object memory storage. This class is used to store repo object (excluding large objects) in the memory. The importance of the handler is mostly for testing purposes.

get_first_version(name, throw_error_not_exist=True)¶

Determine the first version of the object

Parameters:	name (str) – identifier of the object throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:	`Exception` – Raises an exception if the object does not exists
Returns:	str – the first version string of the object

get_latest_version(name, throw_error_not_exist=True)¶

Determine the latest version of the object

Parameters:	name (str) – identifier of the object throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:	`Exception` – Raises an exception if the object does not exists
Returns:	str – the latest version string of the object

get_names(category)¶

Return the names of all objects belonging to the given category.

Parameters:	ml_obj_type (str) – Value of MLObjectType-Enum specifying the category for which all names will be returned
Returns:	list of str – a list of all objects in the category

get_version(name, offset, throw_error_not_exist=True)¶

Return the newest version up to offset versions

Parameters:	name (str) – the identifier of the object offset (int) – the offset throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:	`Exception` – raises an error if the offset is higher than the number of versions available `Exception` – raises an exception if the object does not exists and throw_error_not_exist == True
Returns:	str – the version

replace(obj)¶

Overwrite existing object without incrementing version

Parameters:	obj (RepoObject) – repo object to be overwritten

RepoObjectDiskStorage¶

class RepoObjectDiskStorage(folder, file_format='pickle')¶

The RepoObjectDiskStorage class

check_integrity()¶

Checks if files are missing or have not yet been added

Returns:	dictionary – contains sets of missing files and/or set of files not yet added

close_connection()¶: Closes the database connection

get_config()¶

return the configuration

Returns:	dict – a dictionary of the configuration

get_first_version(name, throw_error_not_exist=True)¶

Determine the first version of the object

Parameters:	name (str) – identifier of the object throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:	`Exception` – Raises an exception if the object does not exists
Returns:	str – the first version string of the object

get_latest_version(name, throw_error_not_exist=True)¶

Determine the latest version of the object

Parameters:	name (str) – identifier of the object throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:	`Exception` – Raises an exception if the object does not exists
Returns:	str – the latest version string of the object

get_names(ml_obj_type)¶

Return the names of all objects belonging to the given category.

Parameters:	ml_obj_type (str) – Value of MLObjectType-Enum specifying the category for which all names will be returned
Returns:	list of str – a list of all objects in the category

get_version(name, offset, throw_error_not_exist=True)¶

Return the newest version up to offset versions

Parameters:	name (str) – the identifier of the object offset (int) – the offset throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises:	`Exception` – raises an exception if the object does not exists and throw_error_not_exist == True
Returns:	str – the version

get_version_condition(name, versions, version_column, time_column)¶

returns the condition part of the versions for the sql statement

Parameters:	name (str) – not used versions (str or list of str) – a or the versions to condition on version_column (str) – version column name time_column (str) – time column name
Returns:	str – the condition for the versions

replace(obj)¶

Overwrite existing object without incrementing version

Parameters:	obj (RepoObject) – repo object to be overwritten

RepoObjectGitStorage¶

class RepoObjectGitStorage(remote=None, **kwargs)¶

Object storage with git support.

This storge stores all objects on disk in a local git storage. It provides functionality to push and pull from another git repo. Note that it handles all files in the same way as RepoObjectDiskStorage does (using methods from this storage).

commit(message, force=True)¶

Commits the changes

Parameters:	message (str) – Commit message force (bool) – If False, objecs will only be commited if integrity check succeeded.
Raises:	`Exception` – raises an exception if the integrity check fails

pull(remote_name='origin')¶

Pull from the remote git repository

Parameters:	remote_name (str) – the name of the remote git repository. Defaults to ‘origin’.
Raises:	`Exception` – raises an exception if the remote name is not available `Exception` – raises an error if the pull fails

push(remote_name='origin')¶

pushes the changes to the remote git repository

Parameters:	remote_name (str) – name of the remote repository. Defaults to ‘origin’.
Raises:	`Exception` – raises an exception if the remote does not exist

replace(obj)¶

Overwrite existing object without incrementing version

Parameters:	obj (RepoObject) – the repo object to be overwritten

NumpyHDFStorage¶

Module defining classes to store numpy data in hdf5 files.

This module provides implementations of the pailab.ml_repo.repo_store.NumpyStore using hdf5 file format.

class NumpyHDFRemoteStorage(folder, remote_store=None, sync_get=False, sync_add=False)¶

Storage working like NumpyHDFStorage locally but in addition provides synchronization with a remote.

This storage stores numpy data in hdf5 files in a directory. It works very similar to the NumpyHDFStorage with the difference that it synchronizes the data with a given remote (downloads and uploads the respective files).

Example

This example shows how to setup the storage so that the data is stored in a local directory and it can be synchronized ith googl ecloud storage:

>>> numpy = NumpyHDFRemoteStorage('C:\tmp\data')
>>> from pailab.ml_repo.remote_gcs import RemoteGCS
>>> remote = RemoteGCS(bucket='my_data')
>>> numpy.set_remote(remote)

Parameters:

folder (str) – folder where data is stored
remote_store (obj or dict) – object representing a remote storage (e.g. pailab.ml_repo.remote_gcs.RemoteGCS for the google cloud storage) or dictionary defining the remote params so that it can be created
sync_get (bool) – If True, tries to download data automatically if it does not exist locally, otherwise it checks only locally
sync_add (bool) – If True, added data will be directly uploaded to the remote

add(name, version, numpy_dict)¶

Add numpy data from an object to the storage.

Parameters:	name (str) – the identifier of the object to add version (str) – the object version numpy_dict (numpy dict) – the numpy dictionary to add

get(name, version, from_index=0, to_index=None)¶

get the numpy object for a name and a version, rows can be used

Parameters:	name (str) – identifier of the object version (str) – version of the object from_index (int) – the index from which the data should be taken. Defaults to 0. to_index (int or None) – the index to which the data is returned (None means till the end). Defaults to None.
Raises:	`Exception` – raises an exception if no object with the name exists `Exception` – raises an exception if no object and with the version exists
Returns:	numpy array – the numpy object to return

pull()¶: Pull changes from an external repo

push()¶: Push changes to an external repo.

class NumpyHDFStorage(folder, version_files=False)¶

Storage using hdf5 files to store numpy data.

Example

Setup storage using folder C:\temp\data:

>>> store = NumpyHDFStorage('C:\temp\data')

Parameters:	folder (str) – main directory where the files will be stored version_files (bool) – If True, each version is contained in a separate file, otherwise all versions are in one file. If you like to work in a distributed environmnt (e.g. multiple users working in parallel) you should set this parameter to True so that no file merge is necessary. . Defaults to False.

add(**kw)¶

Add numpy data from an object to the storage.

Parameters:	name (str) – the identifier of the object to add version (str) – the object version numpy_dict (numpy dict) – the numpy dictionary to add

append(**kw)¶

append data to the an existing object

Parameters:	name (str) – the object identifier version_old (str) – the previous object version version_new (str) – the next object version numpy_dict (numpy dict) – the data to add as a numpy dictionary

get(**kw)¶

get the numpy object for a name and a version, rows can be used

Parameters:	name (str) – identifier of the object version (str) – version of the object from_index (int) – the index from which the data should be taken. Defaults to 0. to_index (int or None) – the index to which the data is returned (None means till the end). Defaults to None.
Raises:	`Exception` – raises an exception if no object with the name exists `Exception` – raises an exception if no object and with the version exists
Returns:	numpy array – the numpy object to return

object_exists(name, version)¶

checks whether the object exists

Parameters:	name (str) – the identifier of the object version (str) – the version of the object
Returns:	bool – returns true if the object exists

trace(aFunc)¶: Trace entry, exit and exceptions.