Summary and review

In this guide, we have worked our way through all the high-level features of the Syberia modeling engine. Hopefully you have a better idea of the tools the engine provides and are ready to do some modeling work. As the Syberia ecosystem matures, we hope to provide more guides for other yet-to-be-built engines that simplify the various domains of expertise covered by R and CRAN.

The modeling engine provides basic capabilities for the following simplified workflow.

  • Importing data sets from arbitrary sources.
  • Cleaning the data by picking the relevant parameters for mungebits.
  • Applying a statistical method to the cleaned data.
  • Exporting the model object to a persistent storage layer for later use.

Note that validation and cross-validation are not built into this workflow but can be incorporated through additional tools and functions. Each model and academic or business question may require its own validation strategy, so simple tools like AUC and classification error may not be sufficient. However, the final model object should be sufficient for validating on out-of-sample data and can be used to tune the parameters laid out earlier in the process.

The modeling engine

Armed with the example modeling project, we learned that the modeling engine gives us the power of some abstractions.

  • The lib/adapters directory for importing and exporting R objects such as data and model objects. Files here should contain write and read variables providing functions for reading and writing R objects in the desired format.
  • The lib/mungebits directory for defining feature engineering templates. These files should have a train and predict variable giving the function that must be used during training versus validation and production, respectively.
  • The lib/stages directory for specifying custom stages. By default, the modeling engine comes bundled with import, data, model, and export stage. Typically you won't need to touch this until more advanced functionality is required in your modeling projects.
  • The lib/classifiers directory for specifying statistical methods on cleaned data sets. Files here come with a train and predict variable just like in mungebits, except that the arguments have different meanings.
  • The models directory ties everything together. By loose convention, models in development should be placed in models/dev and models in production (or finalized for other business uses or publication) should be placed in models/prod.

    Files in the models directory follow the grammar specified throughout this guide. The last expression in each model file should be a static list object with keys c("import", "data", "model", "export") outlining the hyper-parameters for the end-to-end modeling process. Calling project$resource on a model resource produces a stagerunner R6 object.

Mastery of the modeling engine should give you the ability to produce production-ready solutions to arbitrary supervised learning questions on medium-sized data sets. In the future, the modeling engine may do more.

The base engine

In every Syberia project and engine we find a config/engines.R file. This file specifies which engines are mounted by the project. Most engines should make use of the base engine, which defines the role of the routes file and the lib/controllers directory. To summarize:

  • The config/routes.R file is a list whose keys are directory names and whose values are strings defining which file in "lib/controllers" specifies the DSL (domain-specific language) for that directory. This allows us to have different definitions for adapters, mungebits, stages, et cetera.
  • The lib/controllers directory allows us to specify custom controllers if we uncover new abstractions that apply to our project. We can also overwrite the modeling engine default controllers if we wish to add additional features or customizability.
  • The test controller specifies how to parse test files. By default, tests reside in the test directory and have a resource helper injected that builds an example of the tested resource.

The modeling grammar

As previously mentioned, model files live in the models directory and can be executed using the run helper. Resourcing a model file constructs a stagerunner, which can be used to replay portions of the modeling process for examination and debugging.

Mastering the shortcuts in the run helper leads to faster iteration workflows. For example, we can re-run everything up to the end of data stage using run(, to = "data") annd just one munge step using, e.g., run(, "2/2"). The first argument is left blank to indicate we are continuing to iterate on the last-executed model file.

Mungebits

The modeling engine comes with a compact grammar for defining in-memory feature engineering pipelines. Under the hood, this simply calls out to the mungebits2::munge function. Using one line per munge step with a descriptive label encourages us to move most of the logic to helpers or intelligently-parametrized mungebit files.

    
list(data = list(
  "Impute the age variable" =
    list(imputer, "age"),
  "Impute all numeric variables" =
    list(imputer, is.numeric),
  "Create a dependent variable during training" =
    list(renamer ~ NULL, "Sepal.Length", "dep_var"),
  "Remove the dependent variable during prediction" =
    list(NULL ~ drop_variables, "dep_var"),
  "Drop different variable in train vs predict" =
    list(train = list(drop_variables, c("one")), predict = list(drop_variables, c("one", "dep_var"))),
  "Impute all numeric vars but the dep_var" =
    list(imputer, list(is.numeric, except("dep_var"))),
  "Apply the as.numeric function to each character column" =
    list(column_transformation(as.numeric), is.character),
  "An alias for the same operation as above" =
    list(!as.numeric, is.character),
  "Apply as.factor during train but not during predict" =
    list(!as.factor ~ NULL, is.character)
))
    
  

The above does not define any sort of real munge procedure but curates some examples of how to use the modeling grammar to define dataframe transformations.