The munge
function allows a convenient format for applying a
sequence of mungepiece
objects to a dataset.
munge(data, mungelist, stagerunner = FALSE, list = FALSE, parse = TRUE)
data | data.frame. Raw, uncleaned data. |
---|---|
mungelist | list. A list of lists which will be translated to a list of mungepieces. It is also possible to pass a list of mungepieces, but often the special syntax is more convenient. See the examples section. |
stagerunner | logical or list. Either One can also provide a list with a |
list | logical. Whether or not to return the list of mungepieces
instead of executing them on the |
parse | logical. Whether or not to pre-parse the |
A cleaned data.frame
, the result of applying each
mungepiece
constructed from the mungelist
.
The munge
helper accepts a raw, pre-munged (pre-cleaned)
dataset and a list of lists. Each sublist represents the code
and hyperparameters necessary to clean the dataset. For example,
the first row could consist of an imputation function and a list
of variables to apply the imputation to. It is important to
understand what a mungebit
and mungepiece
does before using the munge
helper, as it constructs these
objects on-the-fly for its operation.
The end result of calling munge
is a fully cleaned data set
(i.e., one to whom all the mungepieces have been applied and trained)
adjoined with a "mungepieces"
attribute: the list of trained
mungepieces.
For each sublist in the list of pre-mungepieces passed to munge
,
the following format is available. See the examples for a more hands-on
example.
list(train_fn, ...)
-- If the first element of args
is
a function followed by other arguments, the constructed mungepiece
will use the train_fn
as both the train and predict
function for the mungebit, and list(...)
(that is, the remaining
elements in the list) will be used as both the train and predict
arguments in the mungepiece. In other words, using this format
specifies you would like exactly the same behavior in
training as in prediction. This is appropriate for mungebits
that operate in place and do not need information obtained
from the training set, such as simple value replacement or
column removal.
list(list(train_fn, predict_fn), ...)
-- If args
consists of a two-element pair in its first
element, it must be a pair of either NULL
s or functions,
with not both elements NULL
. If the train_fn
or predict_fn
, resp., is NULL
, this will signify to have
no effect during training or prediction, resp.
The remaining arguments, that is list(...)
, will be used
as both the training and prediction arguments.
This structure is ideal if the behavior during training and prediction
has an identical parametrization but very different implementation,
such as imputation, so you can pass two different functions.
It is also useful if you wish to have no effect during prediction,
such as removing faulty rows during training, or no effect during
training, such as making a few transformations that are only
necessary on raw production data rather than the training data.
list(train = list(train_fn, ...), predict = list(predict_fn, ...))
If args
consists of a list consisting of exactly two named
elements with names "train" and "predict", then the first format will be
used for the respective fields. In other words, a mungepiece will
be constructed consisting of a mungebit with train_fn
as the
training function, predict_fn
as the predict fuction, and
the mungepiece train arguments will be the train list of additional
arguments list(...)
, and similarly the predict arguments will be
the predict list of additional arguments list(...)
.
Note train_fn
and predict_fn
must both be functions
and not NULL
, since then we could simply use the second format
described above.
This format is ideal when the parametrization differs during training and
prediction. In this case, train_fn
usually should be the same
as predict_fn
, but the additional arguments in each list can
be used to identify the parametrized discrepancies. For example, to
sanitize a dataset one may wish to drop unnecessary variables. During
training, this excludes the dependent variable, but during prediction
we may wish to drop the dependent as well.
This format can also be used to perform totally different behavior on
the dataset during training and prediction (different functions and
parameters), but mungebits should by definition achieve the same
operation during training and prediction, so this use case is rare
and should be handled carefully.
mungebit
, mungepiece
,
parse_mungepiece
# First, we show off the various formats that the parse_mungepiece # helper accepts. For this exercise, we can use dummy train and # predict functions and arguments. train_fn <- predict_fn <- function(x, ...) { x } train_arg1 <- predict_arg1 <- dual_arg1 <- TRUE # Can be any parameter value. # The typical way to construct mungepieces would be using the constructor. piece <- mungepiece$new( mungebit$new(train_fn, predict_fn), list(train_arg1), list(predict_arg1) ) # This is tedious and can be simplified with the munge syntax, which # allows one to specify a nested list that defines all the mungebits # and enclosing mungepieces at once. raw_data <- iris munged_data <- munge(raw_data, list( # If the train function with train args is the same as the predict function # with predict args, we use this syntax. The first element should be # the funtion we use for both training and prediction. The remaining # arguments will be used as both the `train_args` and `predict_args` # for the resulting mungepiece. "Same train and predict" = list(train_fn, train_arg1, train_arg2 = "blah"), # If the train and predict arguments to the mungepiece match, but we # wish to use a different train versus predict function for the mungebit. "Different functions, same args" = list(list(train_fn, predict_fn), dual_arg1, dual_arg2 = "blah"), # If we wish to only run this mungepiece during training. "Only run in train" = list(list(train_fn, NULL), train_arg1, train_arg2 = "blah"), # If we wish to only run this mungepiece during prediction. "Only run in predict" = list(list(NULL, predict_fn), predict_arg1, predict_arg2 = "blah"), # If we wish to run different arguments but the same function during # training versus prediction. "Totally different train and predict args, but same functions" = list(train = list(train_fn, train_arg1), predict = list(train_fn, predict_arg1)), # If we wish to run different arguments with different functions during # training versus prediction. "Totally different train and predict function and args" = list(train = list(train_fn, train_arg1), predict = list(predict_fn, predict_arg1)) )) # End the call to munge() # This is an abstract example that was purely meant to illustrate syntax # The munged_data variable will have the transformed data set along # with a "mungepieces" attribute recording a list of trained mungepieces # derived from the above syntax. # A slightly more real-life example. not_run({ munged_data <- munge(raw_data, list( "Drop useless vars" = list(list(drop_vars, vector_of_variables), list(drop_vars, c(vector_variables, "dep_var"))), "Impute variables" = list(imputer, imputed_vars), "Discretize vars" = list(list(discretize, restore_levels), discretized_vars) )) # Here, we have requested to munge the raw_data by dropping useless variables, # including the dependent variable dep_var after model training, # imputing a static list of imputed_vars, discretizing a static list # of discretized_vars being careful to use separate logic when merely # using the computed discretization cuts to bin the numeric features into # categorical features. The end result is a munged_data set with an # attribute "mungepieces" that holds the list of mungepieces used for # munging the data, and can be used to perform the exact same set of # operations on a single row dataset coming through in a real-time production # system. munged_single_row_of_data <- munge(single_row_raw_data, munged_data) }) # The munge function uses the attached "mungepieces" attribute, a list of # trained mungepieces.