A mungebit defines an atomic data transformation of an arbitrary data set. In order to specify the parameters that may be relevant for a particular data set (such as restricting its effect to specific columns, fixing certain parameters such as the imputation method, or providing information that may contain domain knowledge) one uses a mungepiece. A mungepiece defined a domain-specific atomic transformation of a data set.

mungepiece_initialize(mungebit = NULL, train_args = list(),
  predict_args = train_args)

Arguments

mungebit

mungebit. A mungebit mungebit representing an abstract transformation of a data set, such as type conversion, imputation, outlier detection, dimensionality reduction, value replacement, and so on.

train_args

list. Arguments to pass to the mungebit when it is run for the first time, i.e., on a training set that will be fed to a predictive model and may be quite large. These arguments, passed directly to the mungebit's train_function, should contain domain-specific metadata that is necessary to apply the mungebit to this specific data set.

For example, if the modeler knows certain columns do not contain missing values, they might pass a character vector of column names to an imputation mungebit that avoids attempting to impute the columns guaranteed to be fully present. Doing this heuristically might require an unnecessary pass over the data, potentially expensive if the data consists of thousands of features; domain-specific knowledge might be used to pinpoint the few features that require imputation.

predict_args

list. Arguments to pass to the mungebit when it is run for the second or subsequent time, i.e., on a prediction set that will usually be coming from model validation code or a real-time production environment. After the mungebit has been trained on the training set, it should be capable of predicting on single row data sets, i.e., new "points" coming through in a live production setting.

Usually, the prediction arguments will be the same as the training arguments for the mungepiece.

Details

A mungepiece is defined by the collection of

  1. A mungebit. The mungebit determines the qualitative nature of the data transformation. The mungebit may represent a discretization method, principal component analysis, replacement of outliers or special values, and so on. If a training set represents automobile data and there are variables like "weight" or "make," these variables should not be hardcoded in the mungebit's train and predict functions. The mungebit should only represent that abstract mathematical operation performed on the data set.

  2. Training arguments. While the mungebit represents the code necessary for performing some abstract mathematical operation on the data set, the training arguments record the metadata necessary to perform the operation on a particular data set. For example, if we have an automobile data set and know the "weight" column has some missing values, we might pass a vector of column names that includes "weight" to an imputation mungebit and create an imputation-for-this-automobile-data mungepiece. If we have a medical data set that includes special patient type codes and some of the codes were mistyped during data entry or are synonyms for the same underlying "type," we could pass a list of character vectors to a "grouper" mungebit that would condense the categorical feature by grouping like types. If we know that some set of variables is predictive for modeling a particular statistical question but are unsure about the remaining variables, we could use this intuition to pass the list of known variables as exceptions to a "sure independence screening" mungebit. The mungebit would run a univariate regression against each variable not contained in the exceptions list and drop those totally uncorrelated with the dependent variable. This is a typical technique for high dimensionality reduction. Knowledge of the exceptions would reduce the computation time necessary for recording which variables are nonpredictive, an operation that may be very computationally expensive. In short, the mungebit records what we are doing to the data set from an abstract level and does not contain any domain knowledge. The training arguments, the arguments passed to the mungebit's train_function, record the details that pinpoint the abstract transformation to a particular training set intended for use with a predictive model.

  3. Prediction arguments. It is important to understand the train-predict dichotomy of the mungebit. If we are performing an imputation, the mungebit will record the means computed from the variables in the training set for the purpose of replacing NA values. The training arguments might be used for specifying the columns to which the imputation should be restricted. The prediction arguments, by default the same as the training arguments, are metadata passed to the mungebit's predict_function, such as again the variables the imputation applies to. Sometimes the prediction arguments may differ slightly from the training arguments, such as when handling the dependent variable (which will not be present during prediction) or when the code used for prediction needs some further parametrization to replicate the behavior of the train_function on one-row data sets (i.e., real-time points in a production setting).

In short, mungepieces parametrize a single transformation of a data set for that particular data set. While a mungebit is abstract and domain-independent and may represent computations like imputation, outlier detection, and dimensionality reduction, a mungepiece records the human touch and domain knowledge that is necessary for ensuring the mungebits operate on the appropriate features and optimize space-time tradeoffs (for example, the modeler may know that certain columns do not contain missing values and avoid passing them to the imputation mungebit).

Informally speaking, you can think of a mungebit as the raw mold for a transformation and a mungepiece as the cemented product constructed from the mold that is specific to a particular data set. A mungepiece affixes a mungebit so it works on a specific data set, and domain knowledge may be necessary to construct the mungepiece optimally.

Examples

not_run({ doubler <- mungebit$new(column_transformation(function(x) x * 2)) cols <- c("Sepal.Length", "Petal.Length") mp <- mungepiece$new(doubler, list(cols)) iris2 <- mp$run(iris) stopifnot(identical(iris2[cols], 2 * iris[cols])) })