Imagine flipping a switch on a set of train tracks. A mungebit behaves like this: once the trained switch is flipped, it can only run the predict_function, otherwise it will run the train_function.

The train function is responsible for performing a munging step and storing metadata that can replicate the munging step in a live production environment without the full reference data set.

The predict function is responsible for performing a munging step using metadata it computed during an earlier training step. This is usually done in a live production environment setting.

The majority of data projects are overcome by the burden of excessive data wrangling. Part of the problem lies in the fact that when new data is introduced that was drawn from the same source as the original, such as a training set for a statistical model, different code needs to be written to achieve the same transformations. Mungebits solve this problem by forcing the user to determine how to correctly munge on out-of-sample data (such as live streaming data in the form of one-row data.frames) at "munge-time", when the reason for the wrangling is still apparent. A frequent source of data errors is treating this process as an afterthought.

mungebit_run(data, ...)

mungebit_train(data, ..., `_envir` = parent.frame())

mungebit_predict(data, ..., `_envir` = parent.frame())

mungebit

Arguments

data

environment or data.frame. Essentially an environment containing a data variable. In this case, that data variable will have a side effect enacted on it. If a data.frame, then the return value will be the modified data.frame and the mungebit will record any results it must memorize in its input.

...

additional arguments to the mungebit's train_function or predict_function.

_envir

environment. Internal argument used for determining the execution context of the invoked train_function or predict_function.

Value

The modified data, whether it is an environment or data.frame.

The modified data, whether it is an environment or data.frame. Side effects on the input local variable provided to the train_function will be recorded on the mungebit object.

The modified data, whether it is an environment or data.frame. Side effects on the input local variable provided to the predict_function will be recorded on the mungebit object.

Details

The purpose of the train function is to

  1. Perform some munging on the data set, such as renaming columns, creating derived features, performing principal component analysis, replacing some values, removing outliers, etc.

  2. Storing the metadata necessary to replicate the munging operation after the original training set is no longer available. For example, if we are imputing a variable, we would need to remember its mean so we can use it later to replace NA values.

The purpose of the predict function is to

  1. Perform some munging on the data set, such as renaming columns, creating derived features, performing principal component analysis, replacing some values, removing outliers, etc.

  2. Use the metadata computed during the train step to correctly perform this munging.

Consider the following problem. Imagine we wish to discretize a variable, say determined algorithmically with cuts [0, 0.5), [0.5, 1.5), [1.5, 3). When we apply the same transformation on a new data set, we cannot run the same discretization code, since it may produce new cutoffs, and hence invalidate the results if, for example, we had trained a model on the prior cutoffs. To ensure the exact same mathematical transformation is performed on new data--whether a new test set derived from recent data or a one-row data.frame representing a single record streaming through a production system--we must run different code on the "original" set versus the new set.

Mathematically speaking, a transformation of a data set can be represented by a single mathematical function that is implemented differently during "training" versus "prediction." Here, "training" refers to the first time the transformation is performed, and "prediction" refers to subsequent times, such as on newly obtained data or a one-row data.frame representing a single new record in a production system.

Therefore, the correct approach to data preparation, if you wish to re-use it in the future on new data sets or in a live production environment, is to treat it as a collection of tuples (train_function, predict_function, input), where train_function represents the original code, input represents an arbitrary R object such as a list, used for storing "metadata" necessary to re-create the original transformation, and the predict_function takes this input metadata and produces the identical transformation on an out-of-sample data set.

For example, if we wish to impute a data set, train_function might compute the mean, store it in input$mean, replace the NA values with the mean, and return the dataset. Meanwhile, the predict_function simply replaces the NA values with the cached input$mean.

Usually, these steps would be in disjoint code bases: the modeler would perform the ad-hoc munging while playing with the dataset, and a software engineer would take the computed input$mean and hard code it into a "data pipeline". It would be infeasible to recompute the mean on-the-fly since it depends on the original data set, which may be prohibitively large. However, while it may require a lot of space and time to compute the original input, as they are parameterized potentially by a very large data set, usually the input itself is small and the resulting predict_function is inexpensive.

The fundamental problem of data preparation, and the reason why data scientists spend over 90% of their time on data preparation, is a lack of respect for this dichotomy. Using mungebits makes this duality blatantly apparent in all circumstances and will hopefully reduce the amount of time wasted on cumbersome wrangling.

Examples

not_run({ mb <- mungebit(column_transformation(function(col, scale = NULL) { if (!isTRUE(trained)) { # trained is an injected keyword cat("Column scaled by ", input$scale, "\n") } else { input$scale <- scale } col * input$scale })) iris2 <- mb$run(iris, "Sepal.Length", 2) # iris2 now contains a copy of iris with Sepal.Length doubled. iris3 <- mb$run(iris2, "Sepal.Length") # > Column scaled by 2 head(iris3[[1]] / iris[[1]]) # > [1] 4 4 4 4 4 4 })