planning 2018.05.07

notebooks

What is pipeline? (definition)
What problem Steps solves? Useful reading for this:
- Machine Learning: The High-Interest Credit Card of Technical Debt NIPS 2014
- Hidden Technical Debt in Machine Learning Systems NIPS 2015
- What’s your ML Test Score? A rubric for ML production systems NIPS 2016
new repo name:
- steps, ml-steps, vulcan
How do we handle partitioning into training and test datasets? (just use separate data nodes?)
Why do we have to define a caching directory for each and every transformer? Is there a better way to do it (Per project? Context handler? Something else?)
Input has complicated notation.

data = {'input':
          {
               'X': X_train,
               'y': y_train,
           }
        }

Installation remove graphviz from the projects -> use plotly (refactor) steps-core only few dependencies

API-design nested dicts in input -> simplify interface -> DataStep should merge input_step and input_data into one API piece.

Write down difference and relation between Step and transformer (part of Step) You add input that is never used -> you see it on the graph.

cache_transformer -> persisted_transformer

Do Not transform twice!!!

Question: add single step as input to ensembling.