steppy package¶
steppy.adapter module¶
-
class
steppy.adapter.
Adapter
(adapting_recipes: Dict[str, Any])¶ Bases:
object
Translates outputs from parent steps to inputs to the current step.
-
adapting_recipes
¶ The recipes that the adapter was initialized with.
Example
Normally Adapter is used with a Step. In the following example RandomForestTransformer follows sklearn convention of calling arguments X and y, however names passed to the Step are different. We use Adapter to map recieved names to the expected names.
from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import log_loss from steppy.base import BaseTransformer, Step from steppy.adapter import Adapter, E iris = load_iris() pipeline_input = { 'train_data': { 'target': iris.target, 'data': iris.data } } class RandomForestTransformer(BaseTransformer): def __init__(self, random_state=None): self.estimator = RandomForestClassifier(random_state=random_state) def fit(self, X, y): self.estimator.fit(X, y) return self def transform(self, X, **kwargs): y_proba = self.estimator.predict_proba(X) return {'y_proba': y_proba} random_forest = Step( name="random_forest", transformer=RandomForestTransformer(), input_data=['train_data'], adapter=Adapter({ 'X': E('train_data', 'data'), 'y': E('train_data', 'target') }), experiment_directory='./working_dir' ) result = random_forest.fit_transform(pipeline_input) print(log_loss(y_true=iris.target, y_pred=result['y_proba']))
-
adapt
(all_ouputs: Dict[str, Dict[str, Any]]) → Dict[str, Any]¶ Adapt inputs for the transformer included in the step.
Parameters: all_ouputs – Dict of outputs from parent steps. The keys should match the names of these steps and the values should be their respective outputs. Returns: Dictionary with the same keys as adapting_recipes and values constructed according to the respective recipes.
-
-
exception
steppy.adapter.
AdapterError
¶ Bases:
Exception
steppy.base module¶
-
class
steppy.base.
BaseTransformer
¶ Bases:
object
Abstraction on two level fit and transform execution.
Base transformer is an abstraction strongly inspired by the
sklearn.Transformer
andsklearn.Estimator
. Two main concepts are:1. Every action that can be performed on data (transformation, model training) can be performed in two steps: fitting (where trainable parameters are estimated) and transforming (where previously estimated parameters are used to transform the data into desired state).
2. Every transformer knows how it should be persisted and loaded (especially useful when working with Keras/Pytorch and Sklearn) in one pipeline.
-
fit
(*args, **kwargs)¶ Performs estimation of trainable parameters.
All model estimations with sklearn, keras, pytorch models as well as some preprocessing techniques (normalization) estimate parameters based on data (training data). Those parameters are trained during fit execution and are persisted for the future. Only the estimation logic, nothing else.
Parameters: - args – positional arguments (can be anything)
- kwargs – keyword arguments (can be anything)
Returns: self object
Return type:
-
fit_transform
(*args, **kwargs)¶ Performs fit followed by transform.
This method simply combines fit and transform.
Parameters: - args – positional arguments (can be anything)
- kwargs – keyword arguments (can be anything)
Returns: outputs
Return type: dict
-
load
(filepath)¶ Loads the trainable parameters of the transformer.
Specific implementation of loading persisted model parameters should be implemented here. In case of transformers that do not learn any parameters one can leave this method as is.
Parameters: filepath (str) – filepath from which the transformer should be loaded Returns: self instance Return type: BaseTransformer
-
persist
(filepath)¶ Saves the trainable parameters of the transformer
Specific implementation of model parameter persistence should be implemented here. In case of transformers that do not learn any parameters one can leave this method as is.
Parameters: filepath (str) – filepath where the transformer parameters should be persisted
-
transform
(*args, **kwargs)¶ Performs transformation of data.
All data transformation including prediction with deep learning/machine learning models can be performed here. No parameters should be estimated in this method nor stored as class attributes. Only the transformation logic, nothing else.
Parameters: - args – positional arguments (can be anything)
- kwargs – keyword arguments (can be anything)
Returns: outputs
Return type: dict
-
-
class
steppy.base.
IdentityOperation
¶ Bases:
steppy.base.BaseTransformer
Transformer that performs identity operation, f(x)=x.
-
transform
(**kwargs)¶ Performs transformation of data.
All data transformation including prediction with deep learning/machine learning models can be performed here. No parameters should be estimated in this method nor stored as class attributes. Only the transformation logic, nothing else.
Parameters: - args – positional arguments (can be anything)
- kwargs – keyword arguments (can be anything)
Returns: outputs
Return type: dict
-
-
class
steppy.base.
Step
(name, transformer, experiment_directory, input_data=None, input_steps=None, adapter=None, cache_output=False, persist_output=False, load_persisted_output=False, force_fitting=False, persist_upstream_pipeline_structure=False)¶ Bases:
object
Step is a building block of steppy pipelines.
It is an execution wrapper over the transformer (see
BaseTransformer
), which realizes single operation on data. With Step you can:- design multiple input/output data flows and connections between Steps.
- handle persistence and caching of transformers and intermediate results.
Step executes fit_transform method inspired by the sklearn on every step recursively starting from the very last Step and making its way forward through the input_steps. One can easily debug the data flow by plotting the pipeline graph (see:
persist_as_png()
) or return step in a jupyter notebook cell.-
name
¶ str – Step name. Each step in a pipeline must have a unique name. This names is used to persist or cache transformers and outputs of this Step.
-
transformer
¶ obj – object that inherits from BaseTransformer or Step instance. When Step instance is passed, transformer from that Step will be copied and used to perform transformations. It is useful when both train and valid data are passed in one pipeline (common situation in deep learning).
-
experiment_directory
¶ str – path to the directory where all execution artifacts will be stored. The following sub-directories will be created, if they were not created by other Steps:
- transformers: transformer objects are persisted in this folder
- outputs: step output dictionaries are persisted in this folder
(if
persist_output=True
) - cache: step output dictionaries are cached in this folder
(if
cache_output=True
).
-
input_data
¶ list – Elements of this list are keys in the data dictionary that is passed to the Step’s fit_transform and transform methods. List of str, default is empty list.
Example
data_train = {'input': {'images': X_train, 'labels': y_train} } my_step = Step(name='random_forest', transformer=RandomForestTransformer(), input_data=['input'] ) my_step.fit_transform(data_train)
data_train is dictionary where:
- keys are names of data packets,
- values are data packets, that is dictionaries that describes dataset. In this example keys in the data packet are images and labels and values are actual data of any type.
Step.input_data takes the key from data_train (values must match!) and extracts actual data that will be passed to the fit_transform and transform method of the self.transformer.
-
input_steps
¶ list – List of input Steps that the current Step uses as its input. list of Step instances, default is empty list. Current Step will combine outputs from input_steps and input_data using adapter. Then pass it to the transformer methods fit_transform and transform.
Example
self.input_steps=[cnn_step, rf_step, ensemble_step, guesses_step]
Each element of the list is Step instance.
-
adapter
¶ obj – It renames and arranges inputs that are passed to the Transformer (see
BaseTransformer
). Default isNone
. Ifnot None
, then must be an instance of theAdapter
class.Example
self.adapter=Adapter({'X': E('input', 'images'), 'y': E('input', 'labels')} )
Adapter simplifies the renaming and combining of inputs from multiple steps. In this example, after the adaptation:
X is key to the data stored under the images key
y is key to the data stored under the labels key
where both images and labels keys comes from input (see
input_data
)
-
cache_output
¶ bool – If True, Step output dictionary will be cached to the
<experiment_directory>/cache/<name>
, when transform method of the Step transformer is completed. If the same Step is used multiple times, transform method is invoked only once. Further invokes simply load output from the<experiment_directory>/cache/<name>
directory. DefaultFalse
: do not cache outputsWarning
One should always run pipeline.clean_cache() before executing pipeline.fit_transform(data) or pipeline.transform(data) When working with large datasets, cache might be very large.
-
persist_output
¶ bool – If True, persist Step output to disk under the
<experiment_directory>/outputs/<name>
directory. DefaultFalse
: do not persist any files to disk. If True then Step output dictionary will be persisted to the<experiment_directory>/outputs/<name>
directory, after transform method of the Step transformer is completed. Step persists to disk the output after every run of the transformer’s transform method. It means that Step overrides files. See also load_persisted_output parameter.Warning
When working with large datasets, cache might be very large.
-
load_persisted_output
¶ bool – If True, Step output dictionary already persisted to the
<experiment_directory>/cache/<name>
will be loaded when Step is called. DefaultFalse
: do not load persisted output. Useful when debugging and working with ensemble models or time consuming feature extraction. One can easily persist already computed pieces of the pipeline and save time by loading them instead of calculating.Warning
Re-running the same pipeline on new data with load_persisted_output set
True
may lead to errors when outputs from old data are loaded while user would expect the pipeline to use new data instead.
-
force_fitting
¶ bool – If True, Step transformer will be fitted (via fit_transform) even if
<experiment_directory>/transformers/<step_name>
exists. DefaultFalse
: do not force fitting of the transformer. Helpful when one wants to usepersist_output=True
and loadpersist_output=True
on a previous Step and fit current Step multiple times. This is a typical scenario for tuning hyperparameters for an ensemble model trained on the outputs from first level models or a model build on features that are time consuming to compute.
-
persist_upstream_pipeline_structure
¶ bool – If True, the upstream pipeline structure (with regard to the current Step) will be persisted as json file in the
experiment_directory
. DefaultFalse
: do not persist upstream pipeline structure.
-
all_steps
¶ Build dictionary with all Step instances that are upstream to self.
Returns: dictionary where keys are Step names (str) and values are Step instances (obj) Return type: all_steps (dict)
-
clean_cache
()¶ Removes everything from the directory
<experiment_directory>/cache
.
-
fit_transform
(data)¶ Fit the model and transform data or load already processed data.
Loads cached or persisted outputs or adapts data for the current transformer and executes
transformer.fit_transform
.Parameters: data (dict) – data dictionary with keys as input names and values as dictionaries of key-value pairs that can be passed to the
self.transformer.fit_transform
method. Example:data = {'input_1': {'X': X, 'y': y}, 'input_2': {'X': X, 'y': y} }
Returns: Step outputs from the self.transformer.fit_transform
methodReturn type: dict
-
get_step
(name)¶ Extracts step by name from the pipeline.
Extracted step is a fully functional pipeline as well. This method can be used to port parts of the pipeline between problems.
Parameters: name (str) – name of the step to be fetched Returns: extracted step Return type: Step (obj)
-
output_is_cached
¶ (bool) – True if step outputs exists under the
<experiment_directory>/cache/<name>
. See cache_output.
-
output_is_persisted
¶ (bool) – True if step outputs exists under the
<experiment_directory>/outputs/<name>
. Seepersist_output
.
-
persist_pipeline_diagram
(filepath)¶ Creates pipeline diagram and persists it to disk as png file.
Pydot graph is created and persisted to disk as png file under the filepath directory.
Parameters: filepath (str) – filepath to which the png with pipeline visualization should be persisted
-
transform
(data)¶ Transforms data or loads already processed data.
Loads cached persisted outputs or adapts data for the current transformer and executes its transform method.
Parameters: data (dict) – data dictionary with keys as input names and values as dictionaries of key:value pairs that can be passed to the
step.transformer.fit_transform
methodExample
data = {'input_1':{'X':X, 'y':y }, 'input_2': {'X':X, 'y':y } }
Returns: step outputs from the transformer.transform method Return type: dict
-
transformer_is_cached
¶ (bool) – True if transformer exists under the directory
<experiment_directory>/transformers/<step_name>
-
upstream_pipeline_structure
¶ Build dictionary with entire upstream pipeline structure (with regard to the current Step).
Returns: dictionary describing the upstream pipeline structure. It has two keys: 'edges'
and'nodes'
, where:- value of
'edges'
is set of tuples(input_step.name, self.name)
- value of
'nodes'
is set of all step names upstream to this Step
Return type: dict - value of
-
exception
steppy.base.
StepsError
¶ Bases:
Exception
-
steppy.base.
make_transformer
(func)¶
steppy.utils module¶
-
steppy.utils.
display_pipeline
(structure_dict)¶ Displays pipeline structure in the jupyter notebook.
Parameters: structure_dict (dict) – dict returned by upstream_pipeline_structure()
.
-
steppy.utils.
get_logger
()¶ Fetch existing steppy logger.
Example
initialize_logger() logger = get_logger() logger.info('My message inside pipeline')
result looks like this:
2018-06-02 12:33:48 steppy >>> My message inside pipeline
Returns: logger object formatted in the steppy style Return type: logging.Logger
-
steppy.utils.
initialize_logger
()¶ Initialize steppy logger.
This logger is used throughout the steppy library to report computation progress.
Example
Simple use of steppy logger:
initialize_logger() logger = get_logger() logger.info('My message inside pipeline')
result looks like this:
2018-06-02 12:33:48 steppy >>> My message inside pipeline
Returns: logger object formatted in the steppy style Return type: logging.Logger
-
steppy.utils.
persist_as_png
(structure_dict, filepath)¶ Saves pipeline diagram to disk as png file.
Parameters: - structure_dict (dict) – dict returned by
upstream_pipeline_structure()
- filepath (str) – filepath to which the png with pipeline visualization should be persisted
- structure_dict (dict) – dict returned by