Constructing mungepieces and mungebits by hand is a little tedious. To simplify the process, we introduce a tiny DSL that allows for easier construction of mungebits. The intention is for this function to be used in conjuction with a list passed to the munge helper.

parse_mungepiece(args)

Arguments

args

list. A list of arguments. This can be one of the following formats

  1. list(train_fn, ...) -- If the first element of args is a function followed by other arguments, the constructed mungepiece will use the train_fn as both the train and predict function for the mungebit, and list(...) (that is, the remaining elements in the list) will be used as both the train and predict arguments in the mungepiece. In other words, using this format specifies you would like exactly the same behavior in training as in prediction. This is appropriate for mungebits that operate in place and do not need information obtained from the training set, such as simple value replacement or column removal.

  2. list(list(train_fn, predict_fn), ...) -- If args consists of a two-element pair in its first element, it must be a pair of either NULLs or functions, with not both elements NULL. If the train_fn or predict_fn, resp., is NULL, this will signify to have no effect during training or prediction, resp. The remaining arguments, that is list(...), will be used as both the training and prediction arguments. This structure is ideal if the behavior during training and prediction has an identical parametrization but very different implementation, such as imputation, so you can pass two different functions. It is also useful if you wish to have no effect during prediction, such as removing faulty rows during training, or no effect during training, such as making a few transformations that are only necessary on raw production data rather than the training data.

  3. list(train = list(train_fn, ...), predict = list(predict_fn, ...)) If args consists of a list consisting of exactly two named elements with names "train" and "predict", then the first format will be used for the respective fields. In other words, a mungepiece will be constructed consisting of a mungebit with train_fn as the training function, predict_fn as the predict fuction, and the mungepiece train arguments will be the train list of additional arguments list(...), and similarly the predict arguments will be the predict list of additional arguments list(...). Note train_fn and predict_fn must both be functions and not NULL, since then we could simply use the second format described above. This format is ideal when the parametrization differs during training and prediction. In this case, train_fn usually should be the same as predict_fn, but the additional arguments in each list can be used to identify the parametrized discrepancies. For example, to sanitize a dataset one may wish to drop unnecessary variables. During training, this excludes the dependent variable, but during prediction we may wish to drop the dependent as well. This format can also be used to perform totally different behavior on the dataset during training and prediction (different functions and parameters), but mungebits should by definition achieve the same operation during training and prediction, so this use case is rare and should be handled carefully.

Note that the above trichotomy is exhaustive: any mungepiece can be constructed using this helper, regardless of its mungebit's train or predict function or its own train or predict arguments. In the first two formats, the first unnamed list element is always reserved and will never belong to the train_args or predict_args of the mungepiece.

Also note that in the first two formats, the first list element must be unnamed.

Value

The constructed mungepiece.

Note

To understand the documentation of this helper, please read the documentation on mungebit and mungepiece first.

See also

mungepiece, mungebit.

Examples

# First, we show off the various formats that the parse_mungepiece # helper accepts. For this exercise, we can use dummy train and # predict functions and arguments. train_fn <- predict_fn <- function(x, ...) { x } train_arg1 <- predict_arg1 <- dual_arg1 <- TRUE # Can be any parameter value. # If the train function with train args is the same as the predict function # with predict args. piece <- parse_mungepiece(list(train_fn, train_arg1, train_arg2 = "blah")) # If the train and predict arguments to the mungepiece match, but we # wish to use a different train versus predict function for the mungebit. piece <- parse_mungepiece(list(list(train_fn, predict_fn), dual_arg1, dual_arg2 = "blah")) # If we wish to only run this mungepiece during training. piece <- parse_mungepiece(list(list(train_fn, NULL), train_arg1, train_arg2 = "blah")) # If we wish to only run this mungepiece during prediction piece <- parse_mungepiece(list(list(NULL, predict_fn), predict_arg1, predict_arg2 = "blah")) # If we wish to run different arguments but the same function during # training versus prediction. piece <- parse_mungepiece(list(train = list(train_fn, train_arg1), predict = list(train_fn, predict_arg1))) # If we wish to run different arguments with different functions during # training versus prediction. piece <- parse_mungepiece(list(train = list(train_fn, train_arg1), predict = list(predict_fn, predict_arg1))) # The munge function uses the format defined in parse_mungepiece to create # and execute a list of mungepieces on a dataset. not_run({ munged_data <- munge(raw_data, list( "Drop useless vars" = list(list(drop_vars, vector_of_variables), list(drop_vars, c(vector_variables, "dep_var"))), "Impute variables" = list(imputer, imputed_vars), "Discretize vars" = list(list(discretize, restore_levels), discretized_vars) )) # Here, we have requested to munge the raw_data by dropping useless variables, # including the dependent variable dep_var after model training, # imputing a static list of imputed_vars, discretizing a static list # of discretized_vars being careful to use separate logic when merely # using the computed discretization cuts to bin the numeric features into # categorical features. The end result is a munged_data set with an # attribute "mungepieces" that holds the list of mungepieces used for # munging the data, and can be used to perform the exact same set of # operations on a single row dataset coming through in a real-time production # system. munged_single_row_of_data <- munge(single_row_raw_data, munged_data) }) # The munge function uses the attached "mungepieces" attribute, a list of # trained mungepieces.