In this guide, we have worked our way through all the high-level features of the Syberia modeling engine. Hopefully you have a better idea of the tools the engine provides and are ready to do some modeling work. As the Syberia ecosystem matures, we hope to provide more guides for other yet-to-be-built engines that simplify the various domains of expertise covered by R and CRAN.
The modeling engine provides basic capabilities for the following simplified workflow.
Note that validation and cross-validation are not built into this workflow but can be incorporated through additional tools and functions. Each model and academic or business question may require its own validation strategy, so simple tools like AUC and classification error may not be sufficient. However, the final model object should be sufficient for validating on out-of-sample data and can be used to tune the parameters laid out earlier in the process.
Armed with the example modeling project, we learned that the modeling engine gives us the power of some abstractions.
write
and read
variables providing functions for reading and
writing R objects in the desired format.train
and
predict
variable giving the function that
must be used during training versus validation and production, respectively.train
and predict
variable just like in mungebits, except
that the arguments have different meanings.The models directory ties everything together. By loose convention, models in development should be placed in models/dev and models in production (or finalized for other business uses or publication) should be placed in models/prod.
Files in the models directory follow the grammar specified throughout this
guide. The last expression in each model file should be a static list object
with keys c("import", "data", "model", "export")
outlining the hyper-parameters for the end-to-end modeling process. Calling
project$resource
on a model resource produces
a stagerunner R6 object.
Mastery of the modeling engine should give you the ability to produce production-ready solutions to arbitrary supervised learning questions on medium-sized data sets. In the future, the modeling engine may do more.
In every Syberia project and engine we find a config/engines.R
file. This file specifies which engines are mounted by the project. Most
engines should make use of the base engine, which defines
the role of the routes file and the lib/controllers directory. To summarize:
test
directory and have a resource
helper injected that builds an example of the tested resource.As previously mentioned, model files live in the models directory and
can be executed using the run
helper. Resourcing a model
file constructs a stagerunner, which can be used to replay portions of the modeling
process for examination and debugging.
Mastering the shortcuts in the run
helper
leads to faster iteration workflows. For example, we can re-run everything up to
the end of data stage using run(, to = "data")
annd
just one munge step using, e.g., run(, "2/2")
. The
first argument is left blank to indicate we are continuing to iterate on
the last-executed model file.
The modeling engine comes with a compact grammar for defining in-memory
feature engineering pipelines. Under the hood, this simply calls out to the
mungebits2::munge
function. Using one line per
munge step with a descriptive label encourages us to move most of the logic
to helpers or intelligently-parametrized mungebit files.
list(data = list(
"Impute the age variable" =
list(imputer, "age"),
"Impute all numeric variables" =
list(imputer, is.numeric),
"Create a dependent variable during training" =
list(renamer ~ NULL, "Sepal.Length", "dep_var"),
"Remove the dependent variable during prediction" =
list(NULL ~ drop_variables, "dep_var"),
"Drop different variable in train vs predict" =
list(train = list(drop_variables, c("one")), predict = list(drop_variables, c("one", "dep_var"))),
"Impute all numeric vars but the dep_var" =
list(imputer, list(is.numeric, except("dep_var"))),
"Apply the as.numeric function to each character column" =
list(column_transformation(as.numeric), is.character),
"An alias for the same operation as above" =
list(!as.numeric, is.character),
"Apply as.factor during train but not during predict" =
list(!as.factor ~ NULL, is.character)
))
The above does not define any sort of real munge procedure but curates some examples of how to use the modeling grammar to define dataframe transformations.