A mungebit defines an atomic data transformation of an arbitrary data set. In order to specify the parameters that may be relevant for a particular data set (such as restricting its effect to specific columns, fixing certain parameters such as the imputation method, or providing information that may contain domain knowledge) one uses a mungepiece. A mungepiece defined a domain-specific atomic transformation of a data set.
mungepiece_initialize(mungebit = NULL, train_args = list(), predict_args = train_args)
mungebit | mungebit. A mungebit |
---|---|
train_args | list. Arguments to pass to the mungebit when it is
run for the first time, i.e., on a training set that will be
fed to a predictive model and may be quite large. These arguments,
passed directly to the mungebit's For example, if the modeler knows certain columns do not contain missing values, they might pass a character vector of column names to an imputation mungebit that avoids attempting to impute the columns guaranteed to be fully present. Doing this heuristically might require an unnecessary pass over the data, potentially expensive if the data consists of thousands of features; domain-specific knowledge might be used to pinpoint the few features that require imputation. |
predict_args | list. Arguments to pass to the mungebit when it
is run for the second or subsequent time, i.e., on a Usually, the prediction arguments will be the same as the training arguments for the mungepiece. |
A mungepiece is defined by the collection of
A mungebit. The mungebit determines the qualitative nature
of the data transformation. The mungebit may represent
a discretization method, principal component analysis,
replacement of outliers or special values, and so on.
If a training set represents automobile data and there are
variables like "weight" or "make," these variables should not be
hardcoded in the mungebit's train
and predict
functions. The mungebit should only represent that abstract
mathematical operation performed on the data set.
Training arguments. While the mungebit represents the code
necessary for performing some abstract mathematical operation
on the data set, the training arguments record the metadata
necessary to perform the operation on a particular
data set.
For example, if we have an automobile data set and know the
"weight" column has some missing values, we might pass a vector
of column names that includes "weight" to an imputation mungebit
and create an imputation-for-this-automobile-data mungepiece.
If we have a medical data set that includes special patient type
codes and some of the codes were mistyped during data entry or
are synonyms for the same underlying "type," we could pass a list
of character vectors to a "grouper" mungebit that would condense
the categorical feature by grouping like types.
If we know that some set of variables is predictive for modeling a
particular statistical question but are unsure about the remaining
variables, we could use this intuition to pass the list of known
variables as exceptions to a "sure independence screening" mungebit.
The mungebit would run a univariate regression against each variable
not contained in the exceptions list and drop those totally uncorrelated
with the dependent variable. This is a typical technique for high
dimensionality reduction. Knowledge of the exceptions would reduce
the computation time necessary for recording which variables are
nonpredictive, an operation that may be very computationally expensive.
In short, the mungebit records what we are doing to the data set
from an abstract level and does not contain any domain knowledge.
The training arguments, the arguments passed to the mungebit's
train_function
, record the details that pinpoint the
abstract transformation to a particular training set intended for
use with a predictive model.
Prediction arguments. It is important to understand the
train-predict dichotomy of the mungebit. If we are performing an
imputation, the mungebit will record the means computed from the
variables in the training set for the purpose of replacing NA
values. The training arguments might be used for specifying the columns
to which the imputation should be restricted.
The prediction arguments, by default the same as the training arguments,
are metadata passed to the mungebit's predict_function
, such as
again the variables the imputation applies to. Sometimes the prediction
arguments may differ slightly from the training arguments, such as when
handling the dependent variable (which will not be present during
prediction) or when the code used for prediction needs some further
parametrization to replicate the behavior of the train_function
on one-row data sets (i.e., real-time points in a production setting).
In short, mungepieces parametrize a single transformation of a data set for that particular data set. While a mungebit is abstract and domain-independent and may represent computations like imputation, outlier detection, and dimensionality reduction, a mungepiece records the human touch and domain knowledge that is necessary for ensuring the mungebits operate on the appropriate features and optimize space-time tradeoffs (for example, the modeler may know that certain columns do not contain missing values and avoid passing them to the imputation mungebit).
Informally speaking, you can think of a mungebit as the raw mold for a transformation and a mungepiece as the cemented product constructed from the mold that is specific to a particular data set. A mungepiece affixes a mungebit so it works on a specific data set, and domain knowledge may be necessary to construct the mungepiece optimally.
not_run({ doubler <- mungebit$new(column_transformation(function(x) x * 2)) cols <- c("Sepal.Length", "Petal.Length") mp <- mungepiece$new(doubler, list(cols)) iris2 <- mp$run(iris) stopifnot(identical(iris2[cols], 2 * iris[cols])) })