The munge function allows a convenient format for applying a sequence of mungepiece objects to a dataset.

munge(data, mungelist, stagerunner = FALSE, list = FALSE, parse = TRUE)

Arguments

data

data.frame. Raw, uncleaned data.

mungelist

list. A list of lists which will be translated to a list of mungepieces. It is also possible to pass a list of mungepieces, but often the special syntax is more convenient. See the examples section.

stagerunner

logical or list. Either TRUE or FALSE, by default the latter. If TRUE, a stagerunner object will be returned whose context will contain a key data after being ran, namely the munged data set (with a "mungepieces" attribute).

One can also provide a list with a remember parameter, which will be used to construct a stagerunner with the same value for its remember parameter.

list

logical. Whether or not to return the list of mungepieces instead of executing them on the data. By default FALSE.

parse

logical. Whether or not to pre-parse the mungelist using parse_mungepiece. Note that if this is TRUE, any trained mungepieces will be duplicated and marked as untrained. By default, TRUE.

Value

A cleaned data.frame, the result of applying each mungepiece constructed from the mungelist.

Details

The munge helper accepts a raw, pre-munged (pre-cleaned) dataset and a list of lists. Each sublist represents the code and hyperparameters necessary to clean the dataset. For example, the first row could consist of an imputation function and a list of variables to apply the imputation to. It is important to understand what a mungebit and mungepiece does before using the munge helper, as it constructs these objects on-the-fly for its operation.

The end result of calling munge is a fully cleaned data set (i.e., one to whom all the mungepieces have been applied and trained) adjoined with a "mungepieces" attribute: the list of trained mungepieces.

For each sublist in the list of pre-mungepieces passed to munge, the following format is available. See the examples for a more hands-on example.

  1. list(train_fn, ...) -- If the first element of args is a function followed by other arguments, the constructed mungepiece will use the train_fn as both the train and predict function for the mungebit, and list(...) (that is, the remaining elements in the list) will be used as both the train and predict arguments in the mungepiece. In other words, using this format specifies you would like exactly the same behavior in training as in prediction. This is appropriate for mungebits that operate in place and do not need information obtained from the training set, such as simple value replacement or column removal.

  2. list(list(train_fn, predict_fn), ...) -- If args consists of a two-element pair in its first element, it must be a pair of either NULLs or functions, with not both elements NULL. If the train_fn or predict_fn, resp., is NULL, this will signify to have no effect during training or prediction, resp. The remaining arguments, that is list(...), will be used as both the training and prediction arguments. This structure is ideal if the behavior during training and prediction has an identical parametrization but very different implementation, such as imputation, so you can pass two different functions. It is also useful if you wish to have no effect during prediction, such as removing faulty rows during training, or no effect during training, such as making a few transformations that are only necessary on raw production data rather than the training data.

  3. list(train = list(train_fn, ...), predict = list(predict_fn, ...)) If args consists of a list consisting of exactly two named elements with names "train" and "predict", then the first format will be used for the respective fields. In other words, a mungepiece will be constructed consisting of a mungebit with train_fn as the training function, predict_fn as the predict fuction, and the mungepiece train arguments will be the train list of additional arguments list(...), and similarly the predict arguments will be the predict list of additional arguments list(...). Note train_fn and predict_fn must both be functions and not NULL, since then we could simply use the second format described above. This format is ideal when the parametrization differs during training and prediction. In this case, train_fn usually should be the same as predict_fn, but the additional arguments in each list can be used to identify the parametrized discrepancies. For example, to sanitize a dataset one may wish to drop unnecessary variables. During training, this excludes the dependent variable, but during prediction we may wish to drop the dependent as well. This format can also be used to perform totally different behavior on the dataset during training and prediction (different functions and parameters), but mungebits should by definition achieve the same operation during training and prediction, so this use case is rare and should be handled carefully.

See also

mungebit, mungepiece, parse_mungepiece

Examples

# First, we show off the various formats that the parse_mungepiece # helper accepts. For this exercise, we can use dummy train and # predict functions and arguments. train_fn <- predict_fn <- function(x, ...) { x } train_arg1 <- predict_arg1 <- dual_arg1 <- TRUE # Can be any parameter value. # The typical way to construct mungepieces would be using the constructor. piece <- mungepiece$new( mungebit$new(train_fn, predict_fn), list(train_arg1), list(predict_arg1) ) # This is tedious and can be simplified with the munge syntax, which # allows one to specify a nested list that defines all the mungebits # and enclosing mungepieces at once. raw_data <- iris munged_data <- munge(raw_data, list( # If the train function with train args is the same as the predict function # with predict args, we use this syntax. The first element should be # the funtion we use for both training and prediction. The remaining # arguments will be used as both the `train_args` and `predict_args` # for the resulting mungepiece. "Same train and predict" = list(train_fn, train_arg1, train_arg2 = "blah"), # If the train and predict arguments to the mungepiece match, but we # wish to use a different train versus predict function for the mungebit. "Different functions, same args" = list(list(train_fn, predict_fn), dual_arg1, dual_arg2 = "blah"), # If we wish to only run this mungepiece during training. "Only run in train" = list(list(train_fn, NULL), train_arg1, train_arg2 = "blah"), # If we wish to only run this mungepiece during prediction. "Only run in predict" = list(list(NULL, predict_fn), predict_arg1, predict_arg2 = "blah"), # If we wish to run different arguments but the same function during # training versus prediction. "Totally different train and predict args, but same functions" = list(train = list(train_fn, train_arg1), predict = list(train_fn, predict_arg1)), # If we wish to run different arguments with different functions during # training versus prediction. "Totally different train and predict function and args" = list(train = list(train_fn, train_arg1), predict = list(predict_fn, predict_arg1)) )) # End the call to munge() # This is an abstract example that was purely meant to illustrate syntax # The munged_data variable will have the transformed data set along # with a "mungepieces" attribute recording a list of trained mungepieces # derived from the above syntax. # A slightly more real-life example. not_run({ munged_data <- munge(raw_data, list( "Drop useless vars" = list(list(drop_vars, vector_of_variables), list(drop_vars, c(vector_variables, "dep_var"))), "Impute variables" = list(imputer, imputed_vars), "Discretize vars" = list(list(discretize, restore_levels), discretized_vars) )) # Here, we have requested to munge the raw_data by dropping useless variables, # including the dependent variable dep_var after model training, # imputing a static list of imputed_vars, discretizing a static list # of discretized_vars being careful to use separate logic when merely # using the computed discretization cuts to bin the numeric features into # categorical features. The end result is a munged_data set with an # attribute "mungepieces" that holds the list of mungepieces used for # munging the data, and can be used to perform the exact same set of # operations on a single row dataset coming through in a real-time production # system. munged_single_row_of_data <- munge(single_row_raw_data, munged_data) }) # The munge function uses the attached "mungepieces" attribute, a list of # trained mungepieces.