ml_repo¶
-
class
MLRepo
(workspace=None, user=None, config=None, save_config=False, name='NONE')¶ Repository for doing machine learning
- The repository and his extensions provide a solid fundament to do machine learning science supporting features such as:
- auditing/versioning of the data, models and tests
- best practice standardized plotting for investigating model performance and model behaviour
- automated quality checks
Parameters: - workspace ([type]) – [description]. Defaults to None.
- user (str) – the user. Defaults to None.
- config (dict) – the configuration to use. Defaults to None.
- save_config (bool) – determines whether to save the configuration or not. Defaults to False.
-
add
(repo_object, message='', category=None)¶ Add a repo_object or list of repo objects to the repository.
Raises an exception if the category of the object is not defined in the object and if it is not defined with the category argument. It raises an exception if an object with this id does already exist.
Parameters: - repo_object (RepoObject) – repo_object or list of repo_objects to be added, will be modified so that it contains the version number
- message (str) – commit message. Defaults to ‘’.
- category (MLObjectType) – Category of repo_object which overwrites the objects category.. Defaults to None.
Returns: str or dictionary – version number of object added or dictionary of names and versions of objects added
-
add_eval_function
(f, repo_name=None)¶ Add the function to evaluate the model
Parameters: - module_name (str) – module where function is located
- function_name (str) – function name
- repo_name (str) – identifier of the repo object used to store the information, if None, the name is set to module_name.function_name. Defaults to None.
-
add_measure
(measure, coordinates=None)¶ Add a measure to the repository
If the measure already exists, it returns the messageParameters: - measure (str) – string defining the measure, i.e MAX,…
- coordinates (list of str) – list of strings defining the coordinates (by name) used for the measure, if None, all coordinates will be used. Defaults to None.
-
add_model
(model_name, model_eval=None, model_training=None, model_param=None, training_param=None, preprocessors=None)¶ Add a new model to the repo
Parameters: - model_name (str) – identifier of the model
- model_eval (str) – identifier of the evaluation function in the repo to evaluate the model, if None and there is only one evaluation function in the repo, this function will be used
- model_training (str) – identifier of the training function in the repo to train the model, if None and there is only one evaluation function in the repo, this function will be used
- model_param (str) – identifier of the model parameter in the repo, if None and there is exactly one ModelParameter in teh repo, this will be used,. Defaults to None. otherwise it is assumed that no model_params are needed
- training_param (str) – identifier of the training parameter, if None and there is only one training_parameter object in the repo, . Defaults to None. this will be used. If an empty string is given as training parameter, we assume that the algorithm does not need a training pram.
- preprocessors (list) – list of preprocessors to be execute. Defaults to None. this is a list of strings
-
add_preprocessing_fitting_function
(f, repo_name=None)¶ Add function to fit a preprocessor
Parameters: - module_name (str) – module where function is located
- function_name (str) – function name
- repo_name (tring) – identifier of the repo object used to store the information, if None, the name is set to module_name.function_name. Defaults to None.
-
add_preprocessing_transforming_function
(f, repo_name=None)¶ Add function to transform the data by a preprocessor
Parameters: - module_name (str) – module where function is located
- function_name (str) – function name
- repo_name (str) – identifier of the repo object used to store the information, if None, the name is set to module_name.function_name. Defaults to None.
-
add_preprocessor
(preprocessor_name, transforming_function=None, fitting_function=None, preprocessor_param=None)¶ Add a new preprocessor to the repo
Parameters: - preprocessor_name (str) – identifier of the preprocessor
- transforming_function (str) – identifier of the transforming function in the repo, if None and there is only one transforming function in the repo, this function will be used
- fitting_function (str) – identifier of the fitting function in the repo to fit the preprocessor, if None the preprocessor does not need to be fitted
- preprocessor_param (str) – identifier of the preprocessor parameter. Defaults to None.
Raises: Exception
– Raises an error if the preprocessing transforming function is not in repo
-
add_raw_data
(name, data, input_names=None, data_y=None, target_names=None, file_format=None, axis=1)¶ Adds a RawData object to the repository.
This methods creates/reads from the given data/file a RawData object and adds it to the repository.
Examples
Read data from csv file ‘test_data.csv’ and use columns with headers ‘x0’, ‘x1’ as input data and column with label ‘x2’ as target, store the results under name ‘my_data’:
>>ml_repo.add_raw_data('my_data', 'test_data.csv', ['x0', 'x1], file_format = 'csv')
Create data from a DataFrame test where the columns ‘x0’, ‘x1’ are used as input and no target is specified:
>>ml_repo.add_raw_data('my_data', test, ['x0', 'x1])
Parameters: - name (str) – Name of RawData in repository (if name does not start with ‘raw_data/’ this is added.
- data (str, numpy ndarray or pandas DataFrame) – Eithr a pandas DataFarme, a numpy ndarray or a string that is interpreted as filename of the underling data.
- input_names (iterable of str, optional) – List of the input variables names. Defaults to None.
- data_y (str or numpy ndarray, optional) – Either a numpy ndarray or a string defining the filename of th y-data (not valid if file_format==’csv’). Defaults to None.
- target_names (iterable of str, optional) – List of the target variables names. Defaults to None.
- file_format ('csv' or 'numpy', optional) – File type which can be either csv or numpy (numpy means an ndarray stored with numpy.save). Defaults to None.
- axis (int, optional) – If only an ndarray is given but target variables are defined, this array will b split into weo arrays (one for input, one for target) along this axis. Defaults to 1.
Returns: version number of RawData object added
Return type: str
-
add_test_data
(name, raw_data_name, start_index=0, end_index=None, raw_data_version='last')¶ Add test data as a DataSet to the repository.
This method defines a DataSet and adds it to the repository. A DataSet is a logical unit based on a RawData object and defines the range of data that is taken from the respective RawData data.Parameters: - name (str) – Name of respective object in repository.
- raw_data_name (str) – Name of the underlying RawData object that is used as basis.
- start_index (int, optional) – Start index where test data starts from underlying RawData. Defaults to 0.
- end_index (int, optional) – End index where test data end. Defaults to None.
- raw_data_version (str) – Version of underlying RawData (if ‘last’, always the latest RawData will be used to derive the respective DataSet). Defaults to ‘last’
-
add_training_data
(name, raw_data_name, start_index=0, end_index=None, raw_data_version='last')¶ Add training data as a DataSet to the repository.
This method defines a DataSet and adds it to the repository. A DataSet is a logical unit based on a RawData object and defines the range of data that is taken from the respective RawData data.Parameters: - name (str) – Name of respective object in repository.
- raw_data_name (str) – Name of the underlying RawData object that is used as basis.
- start_index (int, optional) – Start index where training data starts from underlying RawData. Defaults to 0.
- end_index (int, optional) – End index where training data end. Defaults to None.
- raw_data_version (str) – Version of underlying RawData (if ‘last’, always the latest RawData will be used to derive the respective DataSet). Defaults to ‘last’
-
add_training_function
(f, repo_name=None)¶ Add function to train a model
Parameters: - module_name (str) – module where function is located
- function_name (str) – function name
- repo_name (tring) – identifier of the repo object used to store the information, if None, the name is set to module_name.function_name. Defaults to None.
-
delete
(name, version)¶ Delete a specific object.
It deletes the object. If other objects were modified by this object, it throws an exception that first the modified objects must be deleted.
Parameters: - name (str) – name of the object
- version (str) – version of the object
Raises: Exception
– If the object has depending objects, it can not be deleted and an error is thrown.
-
get
(name, version='last', full_object=False, modifier_versions=None, obj_fields=None, repo_info_fields=None, throw_error_not_exist=True, throw_error_not_unique=True)¶ Get repo objects. It throws an exception, if an object with the name does not exist.
Parameters: - name (str) – the object name
- version (str) – object version, default is latest (-1). If the fields are nested (an element of a dictionary which is an element of a dictionary, use path notation to the element, i.e. p/elem1/elem2 to get p[elem1][elem2]). Defaults to repo_store.RepoStore.LAST_VERSION.
- full_object (bool) – flag to determine whether the numpy objects are loaded (True->load). Defaults to False.
- modifier_versions ([type]) – [description]. Defaults to None.
- obj_fields ([type]) – [description]. Defaults to None.
- repo_info_fields ([type]) – [description]. Defaults to None.
- throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
- throw_error_not_unique (bool) – true - throw error if item is not unique, else return []. Defaults to True.
Raises: Exception
– raises an exception if no object with the specific name is foundReturns: RepoObject or list thereof – The repo object
-
static
get_calibrated_model_name
(model_name)¶ For a model name the calibrated model name is returned
Parameters: model_name (str) – model name Returns: string – the calibrated model name
-
get_commits
(version_start='first', version_end='last')¶ gets the commits
Parameters: - version_start (str) – only display versions after version_start. Defaults to repo_store.RepoStore.FIRST_VERSION.
- version_end (str) – only display versions up to version_end. Defaults to repo_store.RepoStore.LAST_VERSION.
Returns: list of commit infos – returns a list of commit infots
-
static
get_eval_name
(model, data)¶ Return name of the object containing evaluation results
Parameters: - model (ModelDefinition object or str) –
- {RawData or DataSet object or str} -- (data) –
Returns: string – name of valuation results
-
get_history
(name, repo_info_fields=None, obj_member_fields=None, version_start='first', version_end='last')¶ Return a list of histories of object member variables without bigobjects
Parameters: - name (str) – the object name
- repo_info_fields (list of strings) – List of fields from repo_info which will be returned in the dictionary. If List contains flag ‘ALL’, all fields will be returned.. Defaults to None.
- obj_member_fields (list of strings) – List of member atributes from repo_object which will be returned in the dictionary. If List contains flag ‘ALL’, all attributes will be returned.. Defaults to None.
- version_start (str) – only display versions after version_start. Defaults to repo_store.RepoStore.FIRST_VERSION.
- version_end (str) – only display versions up to version_end. Defaults to repo_store.RepoStore.LAST_VERSION.
Returns: str or list of strings – returns a list of the objects
-
get_ml_repo_store
()¶ Return the storage for the ml repo
Returns: RepoStore – the storage for the RepoObjects
-
get_names
(ml_obj_type)¶ Get the list of names of all repo_objects from a given repo_object_type in the repository.
Parameters: ml_obj_type (MLObjectType) – MLObjectType specifying the types of objects names are returned for. Returns: list of strings – list of object names for the given category.
-
get_numpy_data_store
()¶ Return the numpy data store of the ml repo
Returns: numpy_handler – the numpy repo
-
get_training_data
(version='last', full_object=True, model=None, model_version='last')¶ Returns training data for a model.
It returns the training data in the repo for a specified model. If there is only one set of training data in the repo, this set will be returned. Otherwise, the model is loaded and the training data is used as defined in the model. If in this case a model is not specified the method throws an exception.
Parameters: - version (str) – version of data object. Defaults to repo_store.RepoStore.LAST_VERSION.
- full_object (bool) – if True, the complete data is returned including numpy data. Defaults to True.
- model (str) – Name of model definition for which the training data will be returned.
- model_version (str) – Version of model definition for which teh trainin data will be returned.
-
pull
()¶ Pull changes from an external repo
-
push
()¶ Push changes to an external repo.
-
run
(job)¶ Executes a job
Parameters: job (Job) – The job object to be executed Returns: [type] – Return the name and version of the job or a message that the job does not need to be rerun
-
run_evaluation
(model=None, message=None, model_version='last', datasets={}, predecessors=[], run_descendants=False, labels=None)¶ Evaluate the model on all datasets.
Parameters: - model (str) – name of model to evaluate, if None and only one model exists. Defaults to None.
- message (str) – message inserted into commit, if None: an automated message is created. Defaults to None.
- model_version (str) – version of model to be evaluated.. Defaults to repo_store.RepoStore.LAST_VERSION.
- datasets (dict) – dictionary of datasets (names and version numbers) on which the model is evaluated. . Defaults to {}.
- predecessors (list) – list of jobs which shall have been completed successfull before the evaluation is started. Default is all datasets from testdata on latest version.. Defaults to [].
- run_descendants (bool) – if True also run all decendant jobs. Defaults to False.
- labels ([type]) – [description]. Defaults to None.
Returns: list of strings – a list of the job ids
-
run_measures
(model=None, message=None, model_version='last', datasets={}, measures={}, predecessors=[], labels=None)¶ Run the measures
Parameters: - model (str) – name of model to evaluate, if None and only one model exists. Defaults to None.
- message (str) – message inserted into commit, if None: an automated message is created. Defaults to None.
- model_version (str) – version of model to be evaluated.. Defaults to repo_store.RepoStore.LAST_VERSION.
- datasets (dict) – dictionary of datasets (names and version numbers) on which the model is evaluated. . Defaults to {}.
- predecessors (list) – list of jobs which shall have been completed successfull before the evaluation is started. Default is all datasets from testdata on latest version.. Defaults to [].
- run_descendants (bool) – if True also run all decendant jobs. Defaults to False.
- labels ([type]) – [description]. Defaults to None.
Returns: list of strings – a list of the job ids
-
run_tests
(test_definitions=None, predecessors=[])¶ Run tests for a specific model version.
Parameters: - test_definitions (list or set) – List or set of names of the test definitions which shall be executed. If None, all test definitions are executed.. Defaults to None.
- predecessors (list) – list of jobs which shall have been completed successfull before the evaluation is started. Default is all datasets from testdata on latest version.. Defaults to [].
Returns: str – ticket number of job
-
run_training
(model=None, message=None, model_version='last', training_function_version='last', training_data_version='last', training_param_version='last', model_param_version='last', run_descendants=False)¶ Run the training algorithm.
Parameters: - model (str) – the identifyer of the model. Defaults to None.
- message (str) – the commit message. Defaults to None.
- model_version (str) – the version of the model. Defaults to repo_store.RepoStore.LAST_VERSION.
- training_function_version (str) – the version of the training function. Defaults to repo_store.RepoStore.LAST_VERSION.
- training_data_version (str) – the version of the training data. Defaults to repo_store.RepoStore.LAST_VERSION.
- training_param_version (str) – the version of the training parameter. Defaults to repo_store.RepoStore.LAST_VERSION.
- {str} --the version of the model parameter. Defaults to repo_store.RepoStore.LAST_VERSION. (model_param_version) –
- run_descendants (bool) – if True also run all decendant jobs. Defaults to False.
Returns: [type] – return name and version or message
-
set_label
(label_name, model=None, model_version='last', message='')¶ Label a certain model version.
It checks if a model with this version really exists and throws an exception if such a model does not exist. This method labels a certain model version.Parameters: - label_name (str) – the label name
- model (str) – the identifyer of the model. Defaults to None.
- model_version (str) – model version for which the label is set.. Defaults to repo_store.RepoStore.LAST_VERSION.
- message (str) – commit message. Defaults to ‘’.
repo_objects¶
This module contains a bunch of different RepoObjects.
In principal, all objects that can be stored within pailab’s MLRepo are called a RepoObject. So, if you need a new object apart from those documented here, you just have to implement the respective interfaces, so that the object can be processed by pailab. This may be accomplished in three different ways:
- Inherit your class from the
pailab.repo_objects.RepoObject
class. This may not be very pythonic, but it easily shows you which interfaces you definitively have to implement.- If you have a very simple object you may use the decorator
pailab.repo_objects.repo_object_init
in conjunction with your classe’s constructor to make your class a RepoObject.- Just implement the methods needed (again look at
pailab.repo_objects.RepoObject
to what has to be defined).
-
class
CommitInfo
(message, author, objects, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶ Stores each commit including the commit message and the objects commited.
:param : param message (string): commit message :param : param author (string): author :param objects: dictionary of names of committed objects and version numbers :type objects: dictionary
-
class
DataSet
(raw_data, start_index=0, end_index=None, raw_data_version='last', repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶ Class used to define data used e.g. for training or testing.
This class refers to some RawData object and a start- and endindex. The repository
Parameters: - raw_data (str) – id of raw_data the dataset refers to
- start_index (int) – index of first entry of the raw data used in the dataset. Defaults to 0.
- end_index (int or None) – end_index of last entry of the raw data used in the dataset (if None, all including last element are used). Defaults to None.
- raw_data_version (str) – version of RawData object the DataSet refers to. Defaults to ‘last’.
- repo_info (RepoInfo) – dictionary of the repo info}). Defaults to RepoInfo().
Raises: Exception
– raises an exception if the start index is after the end index
-
class
Function
(f, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶ -
create
()¶ Returns the function object
Returns: the function object Return type: function object
-
get_version
()¶ returns the version
Returns: [type] – the module version
-
-
class
Label
(model_name, model_version, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶ RepoObject to label a certain model version
-
class
Measure
(value, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶ the measure repo object
-
class
MeasureConfiguration
(measures, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶ RepoObject defining a configuration for all measures which shall be computed.
-
F1
= 'f1'¶
-
L2
= 'l2'¶
-
MAX
= 'max'¶
-
MSE
= 'mse'¶
-
PRECISION
= 'precision'¶
-
R2
= 'r2'¶
-
RECALL
= 'recall'¶
-
ROC_AUC
= 'roc_auc'¶
-
add_measure
(measure, coords=None)¶ add a measure to the repo object
Parameters: - measure ([type]) – the measure
- coords ([type]) – the coordinates. Defaults to None.
-
static
get_name
(measure_def)¶ function to return a name of the measure
Parameters: measure_def (MeasureConfiguration) – the measure definition Returns: str – the name of the measure
-
-
class
Model
(preprocessors=None, eval_function=None, train_function=None, train_param=None, model_param=None, training_data=None, test_data=None, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶
-
class
Preprocessor
(transforming_function, fitting_function=None, preprocessing_param=None, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶ Preprocessor class
-
class
RawData
(x_data, x_coord_names, y_data=None, y_coord_names=None, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶ Class to store numpy data.
-
class
RepoInfo
(kwargs={}, name=None, version=None, category=None, modification_info=None)¶ Contains all repo relevent information
This class contains all repo relevant information such as version, name, descriptions. It must be a member of all objects which are handled by the repo.
-
get_dictionary
()¶ Return repo info as dictionary
-
set_fields
(kwargs)¶ Set repo info fields from a dictionary
Parameters: kwargs (dict) – additional arguments
-
-
class
RepoInfoKey
¶ Enums to describe all possible repository informations.
-
AUTHOR
= 'author'¶
-
BIG_OBJECTS
= 'big_objects'¶
-
CATEGORY
= 'category'¶
-
CLASSNAME
= 'classname'¶
-
COMMIT_DATE
= 'commit_date'¶
-
COMMIT_MESSAGE
= 'commit_message'¶
-
DESCRIPTION
= 'description'¶
-
MODIFICATION_INFO
= 'modification_info'¶
-
NAME
= 'name'¶
-
VERSION
= 'version'¶
-
-
class
RepoObject
(repo_info)¶ Base class for objects which are handled b the repository.
-
from_dict
(repo_obj_dict)¶ set object from a dictionary
Parameters: repo_object_dict (dict) – dictionary with the object data
-
numpy_from_dict
(repo_numpy_dict)¶ sets the attributes of the numpy dictionary
Parameters: repo_numpy_dict (dict) – dictionary with the object data
-
numpy_to_dict
()¶ function to get the attributes as a dictionary
Returns: dict – dictionary of the attributes
-
to_dict
()¶ Return a data dictionary for a given repo_object without the big data objects
Returns: dict – dictionary of data
-
-
class
Result
(data, big_data=None, repo_info=<pailab.ml_repo.repo_objects.RepoInfo object>)¶ the result repo object
-
numpy_from_dict
(repo_numpy_dict)¶ sets the attributes of the numpy dictionary
Parameters: repo_numpy_dict (dict) – dictionary with the object data
-
numpy_to_dict
()¶ returns the big data object
Returns: [type] – the big data
-
-
create_repo_obj
(obj)¶ Create a repo_object from a dictionary.
This function creates a repo object from a dictionary in a factory-like fashion. It uses the obj[‘repo_info’][‘classname’] within the dictionary and constructs the class using get_object_from_classname. It throws an exception if the dictionary does not contain an ‘repo_info’ key.
Parameters: obj (dict) – dictionary containing all informations for a repo_object. Raises: Exception
– raises an exception if the dictionary is not a repo dictionaryReturns: [type] – the object of the specified class
-
create_repo_obj_dict
(obj)¶ Create from a repo_object a dictionary with all values to handle the object within the repo
Parameters: obj (RepoObject) – repository object Returns: dict – returns the dictionary
-
get_object_from_classname
(classname, data)¶ Returns an object instance for given classname and data dictionary.
Parameters: - classname (str) – Full classname as string including the modules, e.g. repo.Y if class Y is defined in module repo.
- data (dict) – dictionary of data used to initialize the object instance.
Returns: [type] – Instance object of class.
-
class
repo_object_init
(big_objects=[])¶ Decorator class to modify a constructor so that the class can be used within the ml repository as repo_object.
-
from_dict
(repo_obj_dict)¶ set object from a dictionary
Parameters: - repo_object (RepoObject) – repo_object which will be set from the dictionary
- repo_object_dict (dict) – dictionary with the object data
-
init_repo_object
(init_self, repo_info)¶ initialiser for repo objects
Parameters: - init_self ([type]) – [description]
- repo_info (dict) – the repository info
-
numpy_from_dict
(repo_numpy_dict)¶ function to transform a dictionary to a numpy
Parameters: - repo_obj (RepoObject) – the repo object
- repo_numpy_dict (numpy dict) – the repo numpy dictionary
-
numpy_to_dict
()¶ function to get the attributes as a dictionary
Parameters: repo_obj (RepoObject) – the repo object Returns: dict – dictionary of the attributes
-
to_dict
()¶ Return a data dictionary for a given repo_object
Parameters: repo_obj (RepoObject) – A repo_object, i.e. object which provides the repo_object interface Returns: dict – dictionary of data
-
object types¶
-
class
MLObjectType
¶ Enum describing all ml object types.
The MLObjectType is assigned to each object in the MLRepo. It is used to structure all objects and to support consistency checks and automatic pipelines, the following types are defined:
- EVAL_DATA: evaluation data (result from evaluation of a model)
- RAW_DATA: raw data, i.e. simple numpy structures most often used to derive test or training data from the RawData
- TRAINING_DATA: training data used for model training
- TEST_DATA: data used for model testing
- TEST: concrete test
- TEST_DEFINITION: definition of a test which is applied to respective data and models to obtain a test
- MODEL_PARAM: model parameter
- TRAINING_PARAM: training parameter
- TRAINING_FUNCTION: function to train e certain model
- MODEL_EVAL_FUNCTION: function to evaluate a certain model
- PREPROCESSOR_PARAM: preprocessing parameter
- PREPROCESSOR: definition of a preprocessor
- PREPROCESSING_FITTING_FUNCTION: function to fit the preprocessor
- PREPROCESSING_TRANSFORMING_FUNCTION: function to apply preprocessing to data
- LABEL: model label
- MODEL: definition of a model
- CALIBRATED_MODEL: object containing a calibrated instande of a model
- COMMIT_INFO: internally used to store commit messages
- MAPPING: internally used mapping object to map an object’s name to the object’s category
- MEASURE: computed measure (e.g. norm of error)
- MEASURE_CONFIGURATION: the configuration of all measures applied to the model
- RESULT: object holding results
- JOB: a job
- TRAINING_STATISTIC: object holding training statistics, e.g. training history
- CACHED_VALUE: cached return values of time consuming functions
repo stores¶
Base classes¶
-
class
NumpyStore
¶ class to handle big objects
-
add
(name, version, numpy_dict)¶ Add numpy data from an object to the storage.
Parameters: - name (str) – Name (as string) of object
- version (str) – object version
- numpy_dict (numpy dict) – numpy dictionary
-
append
(name, version_old, version_new, numpy_dict)¶ Append data to an existing object
Parameters: - name (str) – name of data object to be returned
- version_old (str) – version of the object where the data will be appended
- version_new (str) – version of the new objct after appending the data
- numpy_dict (dict) – dictionary containing the values
-
get
(name, version, from_index=0, to_index=None)¶ get the numpy object for a name and a version, rows can be used
Parameters: - name (str) – identifier of the object
- version (str) – version of the object
- from_index (int) – the index from which the data should be taken. Defaults to 0.
- to_index (int or None) – the index to which the data is returned (None means till the end). Defaults to None.
Returns: numpy array – the numpy object to return
-
pull
()¶ Pull changes from an external repo
-
push
()¶ Push changes to an external repo.
-
-
class
RepoScriptStore
¶ -
add
(script_file)¶ Add a script to the storage.
Parameters: script_file (string) – file (incluing path) of script
-
get
(name, versions=None)¶
-
-
class
RepoStore
¶ -
LAST_VERSION
= 'last'¶ Abstract base class for all storages which can be used in the ML repository
-
add
(obj)¶ Add an object to the storage.
Parameters: obj (RepoObject|list(RepoObject)) – repository object or list o repository objects Raises: Exception if an object with same name already exists.
-
get
(name, versions=None, modifier_versions=None, obj_fields=None, repo_info_fields=None, throw_error_not_exist=True, throw_error_not_unique=True)¶ Get a dictionary/list of dictionaries fulffilling the conditions.
- Returns a list of objects matching the name and whose
- -version is in the given list of versions -modifiers match the version number/are in the list of version numbers of the given modifiers
Parameters: - name (str) – object id
- versions (list, version_number, tuple) – either a list of versions or a single version of the objects to be returned,. Defaults to None. if None, the condition on version is ignored. If a tuple is given, this tuple defines a version intervall, i.e. all versions between the first and last entry (both including) are returned. In addition FIRST_VERSION and LAST_VERSION can be used for versions to access the last/first version.
- modifier_versions (dictionary) – modifier ids together with version specs which are matched by the returned object.. Defaults to None.
- obj_fields (list of str or str) – list of strings identifying the fields which will be returned in the dictionary, if None, no fields are returned, if set to ‘all’, all fields will be returned . Defaults to None.
- repo_info_fields (list of str or str) – list of strings identifying the fields of the repo_info dict which will be returned in the dictionary, if None, no fields are returned, if set to ‘all’, all fields will be returned. Defaults to None.
- throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
- throw_error_not_unique (bool) – true - throw error if item is not unique, else return []. Defaults to True.
Returns: RepoObject or list thereof – The repo object
-
get_first_version
(name, throw_error_not_exist=True)¶ Return version number of first (in a temporal sense) object in storage
Parameters: - name (str) – object name for which the version is returned
- throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises: NotImplementedError – [description]
-
get_latest_version
(name, throw_error_not_exist=True)¶ Return latest version number of object in the storage
Parameters: - name (str) – object name
- throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Returns: str – latest version number
-
get_names
(ml_obj_type)¶ Return the names of all objects belonging to the given category.
Parameters: ml_obj_type (str) – Value of MLObjectType-Enum specifying the category for which all names will be returned
-
get_version
(name, offset, throw_error_not_exist=True)¶ Return versionnumber for the given offset
If offset >= 0 it returns the version number of the offset version, if <0 it returns according to the python list logic the version number of the (offset-1)-last version
Parameters: - name (str) – name of object
- offset (int) – offset
- throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
-
object_exists
(name, version='last')¶ Returns True if an object with the given name and version exists.
Parameters: - name (string) – object name
- version (version number) – version number. Defaults to LAST_VERSION.
-
pull
()¶ Pull changes from an external repo
-
push
()¶ Push changes to an external repo.
-
replace
(obj)¶ Overwrite existing object without incrementing version
Parameters: obj (RepoObject) – repo object to be overwritten
-
Memory storages¶
-
class
NumpyMemoryStorage
¶ Bases:
pailab.ml_repo.repo_store.NumpyStore
-
add
(name, version, numpy_dict)¶ Add numpy data from an object to the storage.
Parameters: - name (str) – identifier (as string) of object
- version (str) – object version
- numpy_dict (numpy dict) – numpy dictionary
-
append
(name, version_old, version_new, numpy_dict)¶ appends an numpy dictionary to an existing object
Parameters: - name (str) – identifier of the object
- version_old (str) – the old version of the object
- version_new (str) – the new version of the object
- numpy_dict (numpy dict) – the numpy dictionary to append
Raises: Exception
– raises an exception if the object does not exist
-
get
(name, version, from_index=0, to_index=None)¶ get the numpy object for a name and a version, rows can be used
Parameters: - name (str) – identifier of the object
- version (str) – version of the object
- from_index (int) – the index from which the data should be taken. Defaults to 0.
- to_index (int or None) – the index to which the data is returned (None means till the end). Defaults to None.
Raises: Exception
– raises an exception if no object with the name existsException
– raises an exception if no object and with the version exists
Returns: numpy array – the numpy object to return
-
-
class
RepoObjectMemoryStorage
¶ Bases:
pailab.ml_repo.repo_store.RepoStore
The repo object memory storage. This class is used to store repo object (excluding large objects) in the memory. The importance of the handler is mostly for testing purposes.
-
get_first_version
(name, throw_error_not_exist=True)¶ Determine the first version of the object
Parameters: - name (str) – identifier of the object
- throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises: Exception
– Raises an exception if the object does not existsReturns: str – the first version string of the object
-
get_latest_version
(name, throw_error_not_exist=True)¶ Determine the latest version of the object
Parameters: - name (str) – identifier of the object
- throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises: Exception
– Raises an exception if the object does not existsReturns: str – the latest version string of the object
-
get_names
(category)¶ Return the names of all objects belonging to the given category.
Parameters: ml_obj_type (str) – Value of MLObjectType-Enum specifying the category for which all names will be returned Returns: list of str – a list of all objects in the category
-
get_version
(name, offset, throw_error_not_exist=True)¶ Return the newest version up to offset versions
Parameters: - name (str) – the identifier of the object
- offset (int) – the offset
- throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises: Exception
– raises an error if the offset is higher than the number of versions availableException
– raises an exception if the object does not exists and throw_error_not_exist == True
Returns: str – the version
-
replace
(obj)¶ Overwrite existing object without incrementing version
Parameters: obj (RepoObject) – repo object to be overwritten
-
RepoObjectDiskStorage¶
-
class
RepoObjectDiskStorage
(folder, file_format='pickle')¶ The RepoObjectDiskStorage class
-
check_integrity
()¶ Checks if files are missing or have not yet been added
Returns: dictionary – contains sets of missing files and/or set of files not yet added
-
close_connection
()¶ Closes the database connection
-
get_config
()¶ return the configuration
Returns: dict – a dictionary of the configuration
-
get_first_version
(name, throw_error_not_exist=True)¶ Determine the first version of the object
Parameters: - name (str) – identifier of the object
- throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises: Exception
– Raises an exception if the object does not existsReturns: str – the first version string of the object
-
get_latest_version
(name, throw_error_not_exist=True)¶ Determine the latest version of the object
Parameters: - name (str) – identifier of the object
- throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises: Exception
– Raises an exception if the object does not existsReturns: str – the latest version string of the object
-
get_names
(ml_obj_type)¶ Return the names of all objects belonging to the given category.
Parameters: ml_obj_type (str) – Value of MLObjectType-Enum specifying the category for which all names will be returned Returns: list of str – a list of all objects in the category
-
get_version
(name, offset, throw_error_not_exist=True)¶ Return the newest version up to offset versions
Parameters: - name (str) – the identifier of the object
- offset (int) – the offset
- throw_error_not_exist (bool) – true - throw error if not exists, else return []. Defaults to True.
Raises: Exception
– raises an exception if the object does not exists and throw_error_not_exist == TrueReturns: str – the version
-
get_version_condition
(name, versions, version_column, time_column)¶ returns the condition part of the versions for the sql statement
Parameters: - name (str) – not used
- versions (str or list of str) – a or the versions to condition on
- version_column (str) – version column name
- time_column (str) – time column name
Returns: str – the condition for the versions
-
replace
(obj)¶ Overwrite existing object without incrementing version
Parameters: obj (RepoObject) – repo object to be overwritten
-
RepoObjectGitStorage¶
-
class
RepoObjectGitStorage
(remote=None, **kwargs)¶ Object storage with git support.
This storge stores all objects on disk in a local git storage. It provides functionality to push and pull from another git repo. Note that it handles all files in the same way as RepoObjectDiskStorage does (using methods from this storage).
-
commit
(message, force=True)¶ Commits the changes
Parameters: - message (str) – Commit message
- force (bool) – If False, objecs will only be commited if integrity check succeeded.
Raises: Exception
– raises an exception if the integrity check fails
-
pull
(remote_name='origin')¶ Pull from the remote git repository
Parameters: remote_name (str) – the name of the remote git repository. Defaults to ‘origin’.
Raises: Exception
– raises an exception if the remote name is not availableException
– raises an error if the pull fails
-
push
(remote_name='origin')¶ pushes the changes to the remote git repository
Parameters: remote_name (str) – name of the remote repository. Defaults to ‘origin’. Raises: Exception
– raises an exception if the remote does not exist
-
replace
(obj)¶ Overwrite existing object without incrementing version
Parameters: obj (RepoObject) – the repo object to be overwritten
-
NumpyHDFStorage¶
Module defining classes to store numpy data in hdf5 files.
This module provides implementations of the pailab.ml_repo.repo_store.NumpyStore
using hdf5 file format.
-
class
NumpyHDFRemoteStorage
(folder, remote_store=None, sync_get=False, sync_add=False)¶ Storage working like NumpyHDFStorage locally but in addition provides synchronization with a remote.
This storage stores numpy data in hdf5 files in a directory. It works very similar to the
NumpyHDFStorage
with the difference that it synchronizes the data with a given remote (downloads and uploads the respective files).Example
This example shows how to setup the storage so that the data is stored in a local directory and it can be synchronized ith googl ecloud storage:
>>> numpy = NumpyHDFRemoteStorage('C:\tmp\data') >>> from pailab.ml_repo.remote_gcs import RemoteGCS >>> remote = RemoteGCS(bucket='my_data') >>> numpy.set_remote(remote)
Parameters: - folder (str) – folder where data is stored
- remote_store (obj or dict) – object representing a remote storage (e.g.
pailab.ml_repo.remote_gcs.RemoteGCS
for the google cloud storage) or dictionary defining the remote params so that it can be created - sync_get (bool) – If True, tries to download data automatically if it does not exist locally, otherwise it checks only locally
- sync_add (bool) – If True, added data will be directly uploaded to the remote
-
add
(name, version, numpy_dict)¶ Add numpy data from an object to the storage.
Parameters: - name (str) – the identifier of the object to add
- version (str) – the object version
- numpy_dict (numpy dict) – the numpy dictionary to add
-
get
(name, version, from_index=0, to_index=None)¶ get the numpy object for a name and a version, rows can be used
Parameters: - name (str) – identifier of the object
- version (str) – version of the object
- from_index (int) – the index from which the data should be taken. Defaults to 0.
- to_index (int or None) – the index to which the data is returned (None means till the end). Defaults to None.
Raises: Exception
– raises an exception if no object with the name existsException
– raises an exception if no object and with the version exists
Returns: numpy array – the numpy object to return
-
pull
()¶ Pull changes from an external repo
-
push
()¶ Push changes to an external repo.
-
class
NumpyHDFStorage
(folder, version_files=False)¶ Storage using hdf5 files to store numpy data.
Example
Setup storage using folder
C:\temp\data
:>>> store = NumpyHDFStorage('C:\temp\data')
Parameters: - folder (str) – main directory where the files will be stored
- version_files (bool) – If True, each version is contained in a separate file, otherwise all versions are in one file. If you like to work in a distributed environmnt (e.g. multiple users working in parallel) you should set this parameter to True so that no file merge is necessary. . Defaults to False.
-
add
(**kw)¶ Add numpy data from an object to the storage.
Parameters: - name (str) – the identifier of the object to add
- version (str) – the object version
- numpy_dict (numpy dict) – the numpy dictionary to add
-
append
(**kw)¶ append data to the an existing object
Parameters: - name (str) – the object identifier
- version_old (str) – the previous object version
- version_new (str) – the next object version
- numpy_dict (numpy dict) – the data to add as a numpy dictionary
-
get
(**kw)¶ get the numpy object for a name and a version, rows can be used
Parameters: - name (str) – identifier of the object
- version (str) – version of the object
- from_index (int) – the index from which the data should be taken. Defaults to 0.
- to_index (int or None) – the index to which the data is returned (None means till the end). Defaults to None.
Raises: Exception
– raises an exception if no object with the name existsException
– raises an exception if no object and with the version exists
Returns: numpy array – the numpy object to return
-
object_exists
(name, version)¶ checks whether the object exists
Parameters: - name (str) – the identifier of the object
- version (str) – the version of the object
Returns: bool – returns true if the object exists
-
trace
(aFunc)¶ Trace entry, exit and exceptions.