steppy package

steppy.adapter module

class steppy.adapter.Adapter(adapting_recipes: Dict[str, Any])

Bases: object

Translates outputs from parent steps to inputs to the current step.

adapting_recipes

The recipes that the adapter was initialized with.

Example

Normally Adapter is used with a Step. In the following example RandomForestTransformer follows sklearn convention of calling arguments X and y, however names passed to the Step are different. We use Adapter to map recieved names to the expected names.

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from steppy.base import BaseTransformer, Step
from steppy.adapter import Adapter, E

iris = load_iris()

pipeline_input = {
    'train_data': {
        'target': iris.target,
        'data': iris.data
    }
}

class RandomForestTransformer(BaseTransformer):
    def __init__(self, random_state=None):
        self.estimator = RandomForestClassifier(random_state=random_state)

    def fit(self, X, y):
        self.estimator.fit(X, y)
        return self

    def transform(self, X, **kwargs):
        y_proba  = self.estimator.predict_proba(X)
        return {'y_proba': y_proba}

random_forest = Step(
    name="random_forest",
    transformer=RandomForestTransformer(),
    input_data=['train_data'],
    adapter=Adapter({
        'X': E('train_data', 'data'),
        'y': E('train_data', 'target')
    }),
    experiment_directory='./working_dir'
)

result = random_forest.fit_transform(pipeline_input)
print(log_loss(y_true=iris.target, y_pred=result['y_proba']))
adapt(all_ouputs: Dict[str, Dict[str, Any]]) → Dict[str, Any]

Adapt inputs for the transformer included in the step.

Parameters:all_ouputs – Dict of outputs from parent steps. The keys should match the names of these steps and the values should be their respective outputs.
Returns:Dictionary with the same keys as adapting_recipes and values constructed according to the respective recipes.
exception steppy.adapter.AdapterError

Bases: Exception

class steppy.adapter.E(input_name, key)

Bases: tuple

input_name

Alias for field number 0

key

Alias for field number 1

steppy.base module

class steppy.base.BaseTransformer

Bases: object

Abstraction on two level fit and transform execution.

Base transformer is an abstraction strongly inspired by the sklearn.Transformer and sklearn.Estimator. Two main concepts are:

1. Every action that can be performed on data (transformation, model training) can be performed in two steps: fitting (where trainable parameters are estimated) and transforming (where previously estimated parameters are used to transform the data into desired state).

2. Every transformer knows how it should be persisted and loaded (especially useful when working with Keras/Pytorch and Sklearn) in one pipeline.

fit(*args, **kwargs)

Performs estimation of trainable parameters.

All model estimations with sklearn, keras, pytorch models as well as some preprocessing techniques (normalization) estimate parameters based on data (training data). Those parameters are trained during fit execution and are persisted for the future. Only the estimation logic, nothing else.

Parameters:
  • args – positional arguments (can be anything)
  • kwargs – keyword arguments (can be anything)
Returns:

self object

Return type:

BaseTransformer

fit_transform(*args, **kwargs)

Performs fit followed by transform.

This method simply combines fit and transform.

Parameters:
  • args – positional arguments (can be anything)
  • kwargs – keyword arguments (can be anything)
Returns:

outputs

Return type:

dict

load(filepath)

Loads the trainable parameters of the transformer.

Specific implementation of loading persisted model parameters should be implemented here. In case of transformers that do not learn any parameters one can leave this method as is.

Parameters:filepath (str) – filepath from which the transformer should be loaded
Returns:self instance
Return type:BaseTransformer
persist(filepath)

Saves the trainable parameters of the transformer

Specific implementation of model parameter persistence should be implemented here. In case of transformers that do not learn any parameters one can leave this method as is.

Parameters:filepath (str) – filepath where the transformer parameters should be persisted
transform(*args, **kwargs)

Performs transformation of data.

All data transformation including prediction with deep learning/machine learning models can be performed here. No parameters should be estimated in this method nor stored as class attributes. Only the transformation logic, nothing else.

Parameters:
  • args – positional arguments (can be anything)
  • kwargs – keyword arguments (can be anything)
Returns:

outputs

Return type:

dict

class steppy.base.IdentityOperation

Bases: steppy.base.BaseTransformer

Transformer that performs identity operation, f(x)=x.

transform(**kwargs)

Performs transformation of data.

All data transformation including prediction with deep learning/machine learning models can be performed here. No parameters should be estimated in this method nor stored as class attributes. Only the transformation logic, nothing else.

Parameters:
  • args – positional arguments (can be anything)
  • kwargs – keyword arguments (can be anything)
Returns:

outputs

Return type:

dict

class steppy.base.Step(name, transformer, experiment_directory, input_data=None, input_steps=None, adapter=None, cache_output=False, persist_output=False, load_persisted_output=False, force_fitting=False, persist_upstream_pipeline_structure=False)

Bases: object

Step is a building block of steppy pipelines.

It is an execution wrapper over the transformer (see BaseTransformer), which realizes single operation on data. With Step you can:

  1. design multiple input/output data flows and connections between Steps.
  2. handle persistence and caching of transformers and intermediate results.

Step executes fit_transform method inspired by the sklearn on every step recursively starting from the very last Step and making its way forward through the input_steps. One can easily debug the data flow by plotting the pipeline graph (see: persist_as_png()) or return step in a jupyter notebook cell.

name

str – Step name. Each step in a pipeline must have a unique name. This names is used to persist or cache transformers and outputs of this Step.

transformer

obj – object that inherits from BaseTransformer or Step instance. When Step instance is passed, transformer from that Step will be copied and used to perform transformations. It is useful when both train and valid data are passed in one pipeline (common situation in deep learning).

experiment_directory

str – path to the directory where all execution artifacts will be stored. The following sub-directories will be created, if they were not created by other Steps:

  • transformers: transformer objects are persisted in this folder
  • outputs: step output dictionaries are persisted in this folder (if persist_output=True)
  • cache: step output dictionaries are cached in this folder (if cache_output=True).
input_data

list – Elements of this list are keys in the data dictionary that is passed to the Step’s fit_transform and transform methods. List of str, default is empty list.

Example

data_train = {'input': {'images': X_train,
                        'labels': y_train}
             }

my_step = Step(name='random_forest',
               transformer=RandomForestTransformer(),
               input_data=['input']
               )

my_step.fit_transform(data_train)

data_train is dictionary where:

  • keys are names of data packets,
  • values are data packets, that is dictionaries that describes dataset. In this example keys in the data packet are images and labels and values are actual data of any type.

Step.input_data takes the key from data_train (values must match!) and extracts actual data that will be passed to the fit_transform and transform method of the self.transformer.

input_steps

list – List of input Steps that the current Step uses as its input. list of Step instances, default is empty list. Current Step will combine outputs from input_steps and input_data using adapter. Then pass it to the transformer methods fit_transform and transform.

Example

self.input_steps=[cnn_step, rf_step, ensemble_step, guesses_step]

Each element of the list is Step instance.

adapter

obj – It renames and arranges inputs that are passed to the Transformer (see BaseTransformer). Default is None. If not None, then must be an instance of the Adapter class.

Example

self.adapter=Adapter({'X': E('input', 'images'),
                      'y': E('input', 'labels')}
                     )

Adapter simplifies the renaming and combining of inputs from multiple steps. In this example, after the adaptation:

  • X is key to the data stored under the images key

  • y is key to the data stored under the labels key

    where both images and labels keys comes from input (see input_data)

cache_output

bool – If True, Step output dictionary will be cached to the <experiment_directory>/cache/<name>, when transform method of the Step transformer is completed. If the same Step is used multiple times, transform method is invoked only once. Further invokes simply load output from the <experiment_directory>/cache/<name> directory. Default False: do not cache outputs

Warning

One should always run pipeline.clean_cache() before executing pipeline.fit_transform(data) or pipeline.transform(data) When working with large datasets, cache might be very large.

persist_output

bool – If True, persist Step output to disk under the <experiment_directory>/outputs/<name> directory. Default False: do not persist any files to disk. If True then Step output dictionary will be persisted to the <experiment_directory>/outputs/<name> directory, after transform method of the Step transformer is completed. Step persists to disk the output after every run of the transformer’s transform method. It means that Step overrides files. See also load_persisted_output parameter.

Warning

When working with large datasets, cache might be very large.

load_persisted_output

bool – If True, Step output dictionary already persisted to the <experiment_directory>/cache/<name> will be loaded when Step is called. Default False: do not load persisted output. Useful when debugging and working with ensemble models or time consuming feature extraction. One can easily persist already computed pieces of the pipeline and save time by loading them instead of calculating.

Warning

Re-running the same pipeline on new data with load_persisted_output set True may lead to errors when outputs from old data are loaded while user would expect the pipeline to use new data instead.

force_fitting

bool – If True, Step transformer will be fitted (via fit_transform) even if <experiment_directory>/transformers/<step_name> exists. Default False: do not force fitting of the transformer. Helpful when one wants to use persist_output=True and load persist_output=True on a previous Step and fit current Step multiple times. This is a typical scenario for tuning hyperparameters for an ensemble model trained on the outputs from first level models or a model build on features that are time consuming to compute.

persist_upstream_pipeline_structure

bool – If True, the upstream pipeline structure (with regard to the current Step) will be persisted as json file in the experiment_directory. Default False: do not persist upstream pipeline structure.

all_steps

Build dictionary with all Step instances that are upstream to self.

Returns:dictionary where keys are Step names (str) and values are Step instances (obj)
Return type:all_steps (dict)
clean_cache()

Removes everything from the directory <experiment_directory>/cache.

fit_transform(data)

Fit the model and transform data or load already processed data.

Loads cached or persisted outputs or adapts data for the current transformer and executes transformer.fit_transform.

Parameters:data (dict) –

data dictionary with keys as input names and values as dictionaries of key-value pairs that can be passed to the self.transformer.fit_transform method. Example:

data = {'input_1': {'X': X,
                    'y': y},
        'input_2': {'X': X,
                    'y': y}
        }
Returns:Step outputs from the self.transformer.fit_transform method
Return type:dict
get_step(name)

Extracts step by name from the pipeline.

Extracted step is a fully functional pipeline as well. This method can be used to port parts of the pipeline between problems.

Parameters:name (str) – name of the step to be fetched
Returns:extracted step
Return type:Step (obj)
output_is_cached

(bool) – True if step outputs exists under the <experiment_directory>/cache/<name>. See cache_output.

output_is_persisted

(bool) – True if step outputs exists under the <experiment_directory>/outputs/<name>. See persist_output.

persist_pipeline_diagram(filepath)

Creates pipeline diagram and persists it to disk as png file.

Pydot graph is created and persisted to disk as png file under the filepath directory.

Parameters:filepath (str) – filepath to which the png with pipeline visualization should be persisted
transform(data)

Transforms data or loads already processed data.

Loads cached persisted outputs or adapts data for the current transformer and executes its transform method.

Parameters:data (dict) –

data dictionary with keys as input names and values as dictionaries of key:value pairs that can be passed to the step.transformer.fit_transform method

Example

data = {'input_1':{'X':X,
                   'y':y
                   },
        'input_2': {'X':X,
                    'y':y
                   }
       }
Returns:step outputs from the transformer.transform method
Return type:dict
transformer_is_cached

(bool) – True if transformer exists under the directory <experiment_directory>/transformers/<step_name>

upstream_pipeline_structure

Build dictionary with entire upstream pipeline structure (with regard to the current Step).

Returns:dictionary describing the upstream pipeline structure. It has two keys: 'edges' and 'nodes', where:
  • value of 'edges' is set of tuples (input_step.name, self.name)
  • value of 'nodes' is set of all step names upstream to this Step
Return type:dict
exception steppy.base.StepsError

Bases: Exception

steppy.base.make_transformer(func)

steppy.utils module

steppy.utils.display_pipeline(structure_dict)

Displays pipeline structure in the jupyter notebook.

Parameters:structure_dict (dict) – dict returned by upstream_pipeline_structure().
steppy.utils.get_logger()

Fetch existing steppy logger.

Example

initialize_logger()
logger = get_logger()
logger.info('My message inside pipeline')

result looks like this:

2018-06-02 12:33:48 steppy >>> My message inside pipeline
Returns:logger object formatted in the steppy style
Return type:logging.Logger
steppy.utils.initialize_logger()

Initialize steppy logger.

This logger is used throughout the steppy library to report computation progress.

Example

Simple use of steppy logger:

initialize_logger()
logger = get_logger()
logger.info('My message inside pipeline')

result looks like this:

2018-06-02 12:33:48 steppy >>> My message inside pipeline
Returns:logger object formatted in the steppy style
Return type:logging.Logger
steppy.utils.persist_as_png(structure_dict, filepath)

Saves pipeline diagram to disk as png file.

Parameters:
  • structure_dict (dict) – dict returned by upstream_pipeline_structure()
  • filepath (str) – filepath to which the png with pipeline visualization should be persisted