Ensure you have set up Syberia prior to proceeding.
Syberia is a meta-framework: it allows you to build engines which form the basis of your R projects. If your project consists primarily of machine learning projects you intend to push to a production system, use the modeling engine.
If it consists primarily of visualization dashboards, you will learn how to modularize your code to make the process of building dashboards faster and then pull it out into a dashboard engine for future projects. Almost any project has abstractions lurking around the corner that will help you with similar future projects.
Throughout this tutorial, we will focus our efforts on the modeling engine.
Recall our command from earlier.
run("example")
This command executed the model file "models/dev/example1":
list(
import = list(R = "iris"),
data = list(
"Create dep var" = list(renamer, c("Sepal.Length" = "dep_var")),
"Create ID var" = list(multi_column_transformation(seq_along), "dep_var", "id")
),
model = list("lm", .id_var = "id"),
export = list(R = "model")
)
Let's briefly traverse through what happened:
run
.
run
function found the closest matching model,
in this case "models/dev/example1", using fuzzy matching (so example
really means the regular expression .*e.*x.*a.*m.*p.*l.*e.*
, that
is, any instance of an "e" followed by some characters and then an "x" followed by
some letters and then an "a", etc.).
syberia_engine
object "compiled" the resource.
Note the local variable renamer
in the model file: it
is not actually defined anywhere! As we'll see later, the models controller has injected
it for us. It is possible to define and parametrize new feature engineering on the fly
and the models controller will automatically inject your helpers. We'll go into
that and more on controllers down the road.
renamer
,
we could halt execution, fix the typo, and type run(, "2/1")
,
meaning "re-run the first substage of the second stage of the last executed model." Then
we could finish the model using run(, "2/2", "export")
("run from 'Create ID var' to 'export'") without losing our work or having to start
from scratch with the raw data. More on this later.
$run
to
execute the stages we're asking for. The result of calling the method is a list of
two environments, before
and after
,
one representing what existed prior to running the step and the other what the result
was afterward. Go ahead and try comparing ls(run(, "2/1")$before$data)
and ls(run(, "2/1")$after$data)
. Do you notice the column
names changing? The modeling engine mounts shortcuts for inspecting data before and
after the last run, so this is equivalent to ls(B)
and
ls(A)
, respectively. Try it! Having easy interactive access
to the data before and after munging is intended to speed up the feature
engineering inspection and debugging process, typically a cumbersome component of
data science.
c("import", "data", "model", "export")
,
were stitched together through stages
by the models controller. The modeling engine comes with those four pre-built stages
but it is trivial to extend them or define more if you're doing visualization, report generation,
more impressive ensemble or stack pipelines, and any other tasks. Under the hood,
think of the stagerunner as executing a chain of functions operating on a context,
a designated R environment object for playing with all the objects necessary to
complete the modeling process. Later, it will be possible to generalize the interactive
Syberia approach to arbitrary DAG (directed acyclic graph) runners.
export = list(R = "model")
says "export the model to this R session" under the global variable model
.
Both the import and export stages are fully customizable if we want to read the
data from or write the model to a variety of backends.
After running the model (i.e., the full modeling pipeline) we have a ready-to-deploy model object. We call this object deployable because it has everything it needs to replicate both the feature engineering pipeline and the final score prediction on a single row of streaming data, and it is trivial to serialize.
If the model was predicting the probability a user
will click on a link from a very raw, unclean dataset, we could send the
raw 1-row dataset for a new user to a simple server that called model$predict(one_row)
and returned the final score. The model object takes care of reproducing
the entire data pipeline on the single row of data through mungebits.
The end goal of the Syberia modeling engine is to make the process of building any model, no matter how complex or stacked or multi-staged, the rather trivial combination of pre-built modular and testable units yielding an object that can execute the entire streaming prediction, from assembling raw data sets to producing final scores: we want to see CRAN come alive.