Usage

Ensure you have set up Syberia prior to proceeding.

Syberia is a meta-framework: it allows you to build engines which form the basis of your R projects. If your project consists primarily of machine learning projects you intend to push to a production system, use the modeling engine.

If it consists primarily of visualization dashboards, you will learn how to modularize your code to make the process of building dashboards faster and then pull it out into a dashboard engine for future projects. Almost any project has abstractions lurking around the corner that will help you with similar future projects.

Throughout this tutorial, we will focus our efforts on the modeling engine.

Running a model

Recall our command from earlier.

    
run("example")
    
  

This command executed the model file "models/dev/example1":

    
list(
  import = list(R = "iris"),
  data   = list(
    "Create dep var" = list(renamer, c("Sepal.Length" = "dep_var")),
    "Create ID var"  = list(multi_column_transformation(seq_along), "dep_var", "id")
  ),
  model  = list("lm", .id_var = "id"),
  export = list(R = "model")
)
    
  

Let's briefly traverse through what happened:

  1. When the modeling engine loaded, it attached some global helpers through the boot file to make it easier to experiment in the console. One of these was run.
  2. The run function found the closest matching model, in this case "models/dev/example1", using fuzzy matching (so example really means the regular expression .*e.*x.*a.*m.*p.*l.*e.*, that is, any instance of an "e" followed by some characters and then an "x" followed by some letters and then an "a", etc.).
  3. The underlying syberia_engine object "compiled" the resource. Note the local variable renamer in the model file: it is not actually defined anywhere! As we'll see later, the models controller has injected it for us. It is possible to define and parametrize new feature engineering on the fly and the models controller will automatically inject your helpers. We'll go into that and more on controllers down the road.
  4. In this case, the final "compiled" resource is a stagerunner, an object capable of tracking our progress in the modeling pipeline from start to finish. If during later munging or model training we noticed a typo in the parameters to renamer, we could halt execution, fix the typo, and type run(, "2/1"), meaning "re-run the first substage of the second stage of the last executed model." Then we could finish the model using run(, "2/2", "export") ("run from 'Create ID var' to 'export'") without losing our work or having to start from scratch with the raw data. More on this later.
  5. The compiled stagerunner is using a method called $run to execute the stages we're asking for. The result of calling the method is a list of two environments, before and after, one representing what existed prior to running the step and the other what the result was afterward. Go ahead and try comparing ls(run(, "2/1")$before$data) and ls(run(, "2/1")$after$data). Do you notice the column names changing? The modeling engine mounts shortcuts for inspecting data before and after the last run, so this is equivalent to ls(B) and ls(A), respectively. Try it! Having easy interactive access to the data before and after munging is intended to speed up the feature engineering inspection and debugging process, typically a cumbersome component of data science.
  6. The keys of the list, c("import", "data", "model", "export"), were stitched together through stages by the models controller. The modeling engine comes with those four pre-built stages but it is trivial to extend them or define more if you're doing visualization, report generation, more impressive ensemble or stack pipelines, and any other tasks. Under the hood, think of the stagerunner as executing a chain of functions operating on a context, a designated R environment object for playing with all the objects necessary to complete the modeling process. Later, it will be possible to generalize the interactive Syberia approach to arbitrary DAG (directed acyclic graph) runners.
  7. After the run has completed, the last stage export = list(R = "model") says "export the model to this R session" under the global variable model. Both the import and export stages are fully customizable if we want to read the data from or write the model to a variety of backends.

After running the model (i.e., the full modeling pipeline) we have a ready-to-deploy model object. We call this object deployable because it has everything it needs to replicate both the feature engineering pipeline and the final score prediction on a single row of streaming data, and it is trivial to serialize.

If the model was predicting the probability a user will click on a link from a very raw, unclean dataset, we could send the raw 1-row dataset for a new user to a simple server that called model$predict(one_row) and returned the final score. The model object takes care of reproducing the entire data pipeline on the single row of data through mungebits.

The end goal of the Syberia modeling engine is to make the process of building any model, no matter how complex or stacked or multi-staged, the rather trivial combination of pre-built modular and testable units yielding an object that can execute the entire streaming prediction, from assembling raw data sets to producing final scores: we want to see CRAN come alive.

Next Steps