Ensure you have set up Syberia prior to proceeding.
Syberia is a meta-framework: it allows you to build engines which form the basis of your R projects. If your project consists primarily of machine learning projects you intend to push to a production system, use the modeling engine.
If it consists primarily of visualization dashboards, you will learn how to modularize your code to make the process of building dashboards faster and then pull it out into a dashboard engine for future projects. Almost any project has abstractions lurking around the corner that will help you with similar future projects.
Throughout this tutorial, we will focus our efforts on the modeling engine.
Recall our command from earlier.
This command executed the model file "models/dev/example1":
list( import = list(R = "iris"), data = list( "Create dep var" = list(renamer, c("Sepal.Length" = "dep_var")), "Create ID var" = list(multi_column_transformation(seq_along), "dep_var", "id") ), model = list("lm", .id_var = "id"), export = list(R = "model") )
Let's briefly traverse through what happened:
runfunction found the closest matching model, in this case "models/dev/example1", using fuzzy matching (so
examplereally means the regular expression
.*e.*x.*a.*m.*p.*l.*e.*, that is, any instance of an "e" followed by some characters and then an "x" followed by some letters and then an "a", etc.).
syberia_engineobject "compiled" the resource. Note the local variable
renamerin the model file: it is not actually defined anywhere! As we'll see later, the models controller has injected it for us. It is possible to define and parametrize new feature engineering on the fly and the models controller will automatically inject your helpers. We'll go into that and more on controllers down the road.
renamer, we could halt execution, fix the typo, and type
run(, "2/1"), meaning "re-run the first substage of the second stage of the last executed model." Then we could finish the model using
run(, "2/2", "export")("run from 'Create ID var' to 'export'") without losing our work or having to start from scratch with the raw data. More on this later.
$runto execute the stages we're asking for. The result of calling the method is a list of two environments,
after, one representing what existed prior to running the step and the other what the result was afterward. Go ahead and try comparing
ls(run(, "2/1")$after$data). Do you notice the column names changing? The modeling engine mounts shortcuts for inspecting data before and after the last run, so this is equivalent to
ls(A), respectively. Try it! Having easy interactive access to the data before and after munging is intended to speed up the feature engineering inspection and debugging process, typically a cumbersome component of data science.
c("import", "data", "model", "export"), were stitched together through stages by the models controller. The modeling engine comes with those four pre-built stages but it is trivial to extend them or define more if you're doing visualization, report generation, more impressive ensemble or stack pipelines, and any other tasks. Under the hood, think of the stagerunner as executing a chain of functions operating on a context, a designated R environment object for playing with all the objects necessary to complete the modeling process. Later, it will be possible to generalize the interactive Syberia approach to arbitrary DAG (directed acyclic graph) runners.
export = list(R = "model")says "export the model to this R session" under the global variable
model. Both the import and export stages are fully customizable if we want to read the data from or write the model to a variety of backends.
After running the model (i.e., the full modeling pipeline) we have a ready-to-deploy model object. We call this object deployable because it has everything it needs to replicate both the feature engineering pipeline and the final score prediction on a single row of streaming data, and it is trivial to serialize.
If the model was predicting the probability a user
will click on a link from a very raw, unclean dataset, we could send the
raw 1-row dataset for a new user to a simple server that called
and returned the final score. The model object takes care of reproducing
the entire data pipeline on the single row of data through mungebits.
The end goal of the Syberia modeling engine is to make the process of building any model, no matter how complex or stacked or multi-staged, the rather trivial combination of pre-built modular and testable units yielding an object that can execute the entire streaming prediction, from assembling raw data sets to producing final scores: we want to see CRAN come alive.