Syberia comes bundled with director, an R6 package aimed at simplifying management of large R projects.
By default, most R projects are collections of loosely organized scripts. A newcomer to the project would have to rely on very thorough documentation to understand how the executable tasks fit together and how to replicate and expand on any analysis. Using notebooks like Jupyter just makes the problem worse: hard to inspect in code reviews and encouraging messy watch-and-see-what-you-get scripting, notebooks suck the developer right out of our analyst bones.
If you just want to roll up your sleeves and get to work, feel free to skip this section and move on to testing (a software developer wrote this guide; we won't let you skip testing!). A high-level understanding of how Syberia projects manage stuff under the hood is helpful for expanding your horizon to more than just the functionality packaged with the modeling engine.
In Syberia, we rely on convention over configuration, a common design pattern in software frameworks that removes some leeway in the organizational structure in exchange for something incredibly powerful: all R developers working in Syberia projects will share common knowledge about how work is organized.
In particular, the modeling engine that currently comes bundled as the default engine for structured supervised learning projects such as regression and classification sets up a few conventions which all modeling engine projects conform to:
The models directory holds all models for a given project. Each file is an R script that ends with a list object which hyper-parameterizes the data science process into a linear sequence of stages. All the conventions below are the recipe ingredients that turn model files from static lists to living, breathing objects defining a dynamic modeling pipeline.
In the current modeling engine incarnation, these are stagerunners, R6 objects that tie together all aspects of the modeling process—but in the future we may release support for distributed backends and engines such as Spark or Tensorflow. Let's humbly learn how to walk before we run.
The lib/adapters directory defines IO (input and output) mechanisms for data and serialized model objects. For example, data might be read in CSV format, from a database or data warehouse, or simply reside in-memory. Data and model objects may be written to a file, to a cloud storage service, or pushed via HTTP to an API.
The lib/mungebits directory holds mungebits, already familiar from an earlier lesson, which define feature engineering templates that transform dataframes to dataframes. In the future, we may extend the notion of mungebits to distributed settings such as Spark, Tensorflow or Hadoop and even explore graphical or other forms of data manipulation and computation—but development will stay nice and cozy within R.
The lib/stages directory is a collection of modeling stages, by default import, data, model, and export. The complex logic that defines how to process the parameters in the modeling file lives in these stages: each R file in this directory is a functional, a function that returns another function, namely, a function that takes in the parameters from the model file under the given stage key and spits out a transformation of an environment that performs the appropriate operations of the stage.
For example, import stage may read from a CSV file whereas model stage may construct a tundra container and then call its predict method.
The lib/classifiers directory is responsible for housing all the training and prediction functions for various statistical classifiers: linear regression, random forest, gradient boosting machines, support vector machines, et cetera. These are responsible for the construction of a model object, a serializable R object that represents all meta-data required to replicate the scoring of outputs on new validation sets or real-time data.
Note that while in the current convention modeling remains in-memory and finitely bound, this is not a given. If the syntax is sharp and compact, we may replace backends from the current in-memory R approach to acting as a distributed command center for millions of machines, but experimentation on local slices of the data can remain in one of the most powerful interactive consoles we know: the R interpreter.
The above conventions are not exhaustive. Several are inherited from the parent engine of the modeling engine: the base engine. The conventions established in the list below are defined by the base engine, which is more general than the modeling engine and serves as the clay for almost all future Syberia engines. In case you are wondering, do not worry, at the present moment the topology is simple and linear: your project makes use of the modeling engine, which makes use of the base engine, which relies on the Syberia package.
Future larger projects may have complex intertwined engines for a variety of machine learning and business tasks, but for the moment, once again, let us walk before we run.
For example, unsupervised learning, natural language processing, scientific research, dashboarding and business intelligence, trading and finance, may all require different conventions which differ somewhat or drastically from the structured supervised learning problem. Building on top of the base engine allows for some very powerful meta-conventions:
The lib/controllers directory forms the heart of Syberia's expansive
configurability. In an effort to strike the balance between establishing conventions
over configurations, whilst suitably recognizing that the taxonomy of
computing tasks may require different conventions to optimally iterate
on its given set of problems, the notion of a controller allows us to
generalize how R works. In the old way, we would use base::source
to execute an R file and store the value of its last expression in a local variable.
Controllers allow us to define preprocessors and parsers. A preprocessor is a function that specifies everything before we execute an R file and a parser is a function that specifies everything after we execute an R file—in a given directory. In effect, controllers allow us to establish heterogeneous DSLs (domain-specific languages) in a hierarchical directory structure in a way that most optimally conforms to how we wish to solve a particular problem, whether it be production-ready machine learning or experimental business intelligence dashboarding.
The Syberia team looks forward to the creativity and imagination the R community will display when fully unlocking the power of controllers: let the domain-specific languages flood out the rigidity of CRAN and the constraints of packages, thus establishing R as a true player in the "general purpose programming language" sector. It is, after all, just a LISP—a rather powerful one at that.
Philosophical asides beside, let's examine how a director object helps us generalize our thinking about R projects with some more concrete examples.
R --vanilla
.
devtools::install_github("syberia/director") # Install director into the global library.
library(director)
dir <- tempdir()
project <- director(dir)
project # Director object monitoring 'temporary director'.
We have in our hands a director object! Note the class(project)
.
Under the hood, Syberia manages the creation of a director object when your R session
starts and syberia::syberia_project()
is called: this last
expression returns a syberia_engine
R6 object which inherits
from the director
R6 class and expands on the vanilla functionalities
of a director object.
Let's see what our object can do: ls(project)
. The noteworthy
methods are exists, filename, find, root
and
resource
.
Let's create an R file in our temporary project and see what happens.
# Write "cat('Hello world!')" into hello.R at the root of the project.
# You can also do this using a text editor.
# Note: `project$root()` is equivalent to `dir` from earlier, if not a little clearer.
writeLines("cat('Hello world!')", file.path(project$root(), "hello.R"))
# Our project now recognizes hello.R as a "hello" resource.
project$find() # [1] "hello"
# Executing the resource gives us what we expect.
project$resource("hello") # Hello world!NULL
# We have a NULL at the end the return value of the cat function is NULL.
By default, the resource
method acts as a slightly more
powerful source
.
devtools::install_github("robertzk/Ramd") # Install the Ramd package for the 'define' helper.
# Create dir/world
dir.create(file.path(project$root(), "world"))
# Write to dir/world/world.R
writeLines("Ramd::define('hello', function(hello) { cat(hello, 'world!') })",
file.path(project$root(), "world", "world.R"))
writeLines("'hello'", file.path(project$root(), "world", "hello.R"))
project$find() # [1] "world" "hello"
Note that find
does not produce results in alphabetical
order. The resources are ordered by which ones were modified most recently.
In the above example, we illustrated one of the most powerful concepts in director: idempotent resources. Many dynamic scripting languages like JavaScript or R have found themselves at the forefront of popular usage without the language designer anticipating their growth in complexity. In R projects, we typically find it hard to organize our work sometimes because, like JavaScript, R was simply not designed with large architectures in mind.
However, there is a neat solution. We take a page out of the books of require.js and node.js and allow each director object a very specific convention: if a file (ignoring extension) has the same name as its parent directory, it gets treated as an idempotent resource, meaning that the other files in that directory are helper files which become invisible to the director object.
These helper files are implicitly understood to be available to the main file for its final construction, but otherwise should not be treated as Syberia resources, that is, should not be made available as modular pieces for re-use within the rest of the project. If you find yourself needing to make use of them, it is a good indication that they deserve the title of their own distinct resource.
list.files(project$root(), recursive = TRUE)
# "hello.R" "world/hello.R" "world/world.R"
project$filename("world")
# [1] "world/world.R"
The director object keeps track of the fact that the "world" resource is in fact at "world/world.R". This is a virtual construction which maps either to "world.R" or "world/world.R" depending on the name of the parent directory of "world.R".
project$resource("world")
# hello world!NULL
Our resource has been pieced together from the world directory using both "world/hello.R" and "world/world.R".
The usage of the Ramd package was for the Ramd::define
function, which allows a convenient syntax for sourcing adjacent files. Note
these files are truly sourced using base::source
and are pure helper R scripts—they do not benefit from the preprocessor
or parser magic in director that we are about to introduce! (This is a frequent
source of bizarre bugs and stack traces and merits explicit mention.)
Continuing with the example from the previous section, let's add some flair to our project.
john <- utils::person("John", "Doe")
john # "John Doe"
class(john) # "person"
We use the simplest illustrative structure we know of: the utils package's
definition of a person
S3 class. Imagine we
want to define a convention that all files in the "people" directory
must return a c("first name", "last name")
character vector and create a person object. It would be annoying and
repetitive to write utils::person("first name", "last name")
in each file, and anyway we may wish to replace the base definition with another
class down the road. (See don't repeat yourself.)
dir.create(file.path(project$root(), "people"))
writeLines('c("Jane", "Smith")', file.path(project$root(), "people/jane.R"))
jane <- project$resource("people/jane")
jane # [1] "Jane" "Smith"
class(jane) # "character"
So far, we have not accomplished our objective: jane
is still a lowly character vector, not a person
.
To animate her like Pinnocchio, we shall define a parser on the director object.
project$register_parser("people", function(output) {
do.call(utils::person, as.list(unname(output)))
})
Now let's try again.
jane <- project$resource("people/jane")
jane # [1] "Jane Smith"
class(jane) # "person"
What a lovely lady jane
now has become! By
calling register_parser
on the project
,
we have implicitly told our director to, going forward, treat all
files in the "people" directory with the convention: take the output
of the file and pass it as arguments to utils::person
.
dir.create(file.path(project$root(), "people", "maryjane"))
writeLines("mary <- 'Mary'; jane <- 'Jane'; paste(mary, jane, sep = '-')",
file.path(project$root(), "people", "maryjane", "mj.R"))
writeLines("Ramd::define('mj', function(mj) { c(mj, 'Jones') })",
file.path(project$root(), "people", "maryjane", "maryjane.R"))
maryjane <- project$resource("people/maryjane")
maryjane # [1] "Mary-Jane Jones"
class(maryjane) # "person"
As you can see, maryjane
is more complex than
mary
, but she is still a great example of a
person
.
So far, we have encountered an example of a parser. We can also do additional work before sourcing the file, such as injecting helper functions (maybe even recursively composed of other Syberia resources.)
project$register_preprocessor("people", function(source_env, source) {
source_env$capitalize <- function(x) {
x <- strsplit(x, "")[[1]]
paste(c(toupper(x[1]), tolower(x[-1])), collapse = "")
}
source()
})
We can now use the capitalize
helper in
files in the "people" directory.
writeLines("sapply(c('jimmy', 'JOHNSON'), capitalize)",
file.path(project$root(), "people", "jimmy.R"))
jimmy <- project$resource("people/jimmy")
jimmy # [1] "Jimmy Johnson"
class(jimmy) # "person"
Before Hadley Wickham pops out of thin air to remind you never to use
sapply
, note that we never defined
capitalize
anywhere in the file: it got injected
by the preprocessor.
The full list of magic provided by the preprocessor and parser
definitions is listed in the appendix. The Syberia base engine (the
foundation for the modeling engine) provides conventions that bootstrap
a vanilla director
object into providing
some very basic conventions: everything in "lib/controllers" allows
us some shortcuts for calling project$register_preprocessor
and project$register_parser
under the hood.
In other words, Syberia uses director recursively to bootstrap the definition
of "controllers" into existence: even controllers are just another Syberia
resource and can be rewritten at will; and even director
objects can be replaced with more powerful tools for managing R projects.
cat("I know of these people:", paste(collapse = ", ", project$find("people")), "\n")
# I know of these people: people/jimmy, people/maryjane, people/jane
vapply(project$find("people"), function(p) { as.character(project$resource(p)) }, "")
# people/jimmy people/maryjane people/jane
# "Jimmy Johnson" "Mary-Jane Jones" "Jane Smith"
# By default, `find` matches using fuzzy wildcard and is equivalent to
# the regex: .*p.*.p.*l.*
project$find("ppl") # [1] "people/jimmy" "people/maryjane" "people/jane"
# This is mostly because the author is lazy and likes convenience. Feel
# free to suggest a change in a pull request or issue in the director package.
# You can target just subdirectories:
project$find("ppl", method = "partial") # character(0)
project$find("people", method = "partial") # [1] "people/jimmy" "people/maryjane" "people/jane"
# You can also target for an exact match, at most one result.
project$find("people", method = "exact") # character(0)
project$find("people/jimmy", method = "exact") # [1] "people/jimmy"
# You can use `filename` to discover the filename behind a resource.
unname(vapply(project$find("people"), project$filename, character(1)))
# [1] "people/jimmy.R" "people/maryjane/maryjane.R" "people/jane.R"
# You can use `exists` to check for resource existence.
project$exists("people/jimmy") # [1] TRUE
project$exists("people/jimmyjohns") # [1] FALSE
Some additional more advanced functionality is provided by the
syberia_engine
object which extends the director
object but that is beyond the scope of this introduction. It involves handling
all communication and coordination between the root project and its
dependent engines, in addition to locating resources across the topology
of engines mounted to the root project. If all that sounds complicated,
that's because it is, so don't worry about it for now.
Under the hood, you don't have to worry about most of the explicit
examples in this chapter. The Syberia modeling engine handles almost
all uses of director objects under the hood. The only real usage
you should remember is to call project$resource("foo")
when you want to call resource "foo" (e.g., "models/model1", "lib/mungebits/imputer", et cetera).
The project
global helper is just a
shortcut for syberia::syberia_project()
, which
builds your project from the current working directory into a syberia_engine
R6 object and then memoizes (caches) it in memory.
You are so close! You can skip straight to the summary, but any great aspiring developer will read the section on testing.
Preprocessors and parsers offer powerful capability for extending the
vanilla base::source
function.