Managing large projects

Syberia comes bundled with director, an R6 package aimed at simplifying management of large R projects.

By default, most R projects are collections of loosely organized scripts. A newcomer to the project would have to rely on very thorough documentation to understand how the executable tasks fit together and how to replicate and expand on any analysis. Using notebooks like Jupyter just makes the problem worse: hard to inspect in code reviews and encouraging messy watch-and-see-what-you-get scripting, notebooks suck the developer right out of our analyst bones.

If you just want to roll up your sleeves and get to work, feel free to skip this section and move on to testing (a software developer wrote this guide; we won't let you skip testing!). A high-level understanding of how Syberia projects manage stuff under the hood is helpful for expanding your horizon to more than just the functionality packaged with the modeling engine.

In Syberia, we rely on convention over configuration, a common design pattern in software frameworks that removes some leeway in the organizational structure in exchange for something incredibly powerful: all R developers working in Syberia projects will share common knowledge about how work is organized.

The modeling engine conventions

In particular, the modeling engine that currently comes bundled as the default engine for structured supervised learning projects such as regression and classification sets up a few conventions which all modeling engine projects conform to:

  • The models directory holds all models for a given project. Each file is an R script that ends with a list object which hyper-parameterizes the data science process into a linear sequence of stages. All the conventions below are the recipe ingredients that turn model files from static lists to living, breathing objects defining a dynamic modeling pipeline.

    In the current modeling engine incarnation, these are stagerunners, R6 objects that tie together all aspects of the modeling process—but in the future we may release support for distributed backends and engines such as Spark or Tensorflow. Let's humbly learn how to walk before we run.

  • The lib/adapters directory defines IO (input and output) mechanisms for data and serialized model objects. For example, data might be read in CSV format, from a database or data warehouse, or simply reside in-memory. Data and model objects may be written to a file, to a cloud storage service, or pushed via HTTP to an API.

  • The lib/mungebits directory holds mungebits, already familiar from an earlier lesson, which define feature engineering templates that transform dataframes to dataframes. In the future, we may extend the notion of mungebits to distributed settings such as Spark, Tensorflow or Hadoop and even explore graphical or other forms of data manipulation and computation—but development will stay nice and cozy within R.

  • The lib/stages directory is a collection of modeling stages, by default import, data, model, and export. The complex logic that defines how to process the parameters in the modeling file lives in these stages: each R file in this directory is a functional, a function that returns another function, namely, a function that takes in the parameters from the model file under the given stage key and spits out a transformation of an environment that performs the appropriate operations of the stage.

    For example, import stage may read from a CSV file whereas model stage may construct a tundra container and then call its predict method.

  • The lib/classifiers directory is responsible for housing all the training and prediction functions for various statistical classifiers: linear regression, random forest, gradient boosting machines, support vector machines, et cetera. These are responsible for the construction of a model object, a serializable R object that represents all meta-data required to replicate the scoring of outputs on new validation sets or real-time data.

Note that while in the current convention modeling remains in-memory and finitely bound, this is not a given. If the syntax is sharp and compact, we may replace backends from the current in-memory R approach to acting as a distributed command center for millions of machines, but experimentation on local slices of the data can remain in one of the most powerful interactive consoles we know: the R interpreter.

The base engine

The above conventions are not exhaustive. Several are inherited from the parent engine of the modeling engine: the base engine. The conventions established in the list below are defined by the base engine, which is more general than the modeling engine and serves as the clay for almost all future Syberia engines. In case you are wondering, do not worry, at the present moment the topology is simple and linear: your project makes use of the modeling engine, which makes use of the base engine, which relies on the Syberia package.

Future larger projects may have complex intertwined engines for a variety of machine learning and business tasks, but for the moment, once again, let us walk before we run.

For example, unsupervised learning, natural language processing, scientific research, dashboarding and business intelligence, trading and finance, may all require different conventions which differ somewhat or drastically from the structured supervised learning problem. Building on top of the base engine allows for some very powerful meta-conventions:

  • The config/routes resource (typically just a routes.R file) establishes the link between the lib/controllers directory and the rest of the project: it tethers together which directories are processed by which controllers.
  • The lib/controllers directory forms the heart of Syberia's expansive configurability. In an effort to strike the balance between establishing conventions over configurations, whilst suitably recognizing that the taxonomy of computing tasks may require different conventions to optimally iterate on its given set of problems, the notion of a controller allows us to generalize how R works. In the old way, we would use base::source to execute an R file and store the value of its last expression in a local variable.

    Controllers allow us to define preprocessors and parsers. A preprocessor is a function that specifies everything before we execute an R file and a parser is a function that specifies everything after we execute an R file—in a given directory. In effect, controllers allow us to establish heterogeneous DSLs (domain-specific languages) in a hierarchical directory structure in a way that most optimally conforms to how we wish to solve a particular problem, whether it be production-ready machine learning or experimental business intelligence dashboarding.

    The Syberia team looks forward to the creativity and imagination the R community will display when fully unlocking the power of controllers: let the domain-specific languages flood out the rigidity of CRAN and the constraints of packages, thus establishing R as a true player in the "general purpose programming language" sector. It is, after all, just a LISP—a rather powerful one at that.

Philosophical asides beside, let's examine how a director object helps us generalize our thinking about R projects with some more concrete examples.

The most basic director object

Let us start from the beginning and open a vanilla R session from the terminal: R --vanilla.
    
devtools::install_github("syberia/director") # Install director into the global library.
library(director)
dir <- tempdir()
project <- director(dir)
project # Director object monitoring 'temporary director'.
    
  

We have in our hands a director object! Note the class(project). Under the hood, Syberia manages the creation of a director object when your R session starts and syberia::syberia_project() is called: this last expression returns a syberia_engine R6 object which inherits from the director R6 class and expands on the vanilla functionalities of a director object.

Let's see what our object can do: ls(project). The noteworthy methods are exists, filename, find, root and resource.

Let's create an R file in our temporary project and see what happens.

    
# Write "cat('Hello world!')" into hello.R at the root of the project.
# You can also do this using a text editor.
# Note: `project$root()` is equivalent to `dir` from earlier, if not a little clearer.
writeLines("cat('Hello world!')", file.path(project$root(), "hello.R"))
# Our project now recognizes hello.R as a "hello" resource.
project$find() # [1] "hello"
# Executing the resource gives us what we expect.
project$resource("hello") # Hello world!NULL
# We have a NULL at the end the return value of the cat function is NULL.
    
  

By default, the resource method acts as a slightly more powerful source.

    
devtools::install_github("robertzk/Ramd") # Install the Ramd package for the 'define' helper.
# Create dir/world
dir.create(file.path(project$root(), "world"))
# Write to dir/world/world.R
writeLines("Ramd::define('hello', function(hello) { cat(hello, 'world!') })",
           file.path(project$root(), "world", "world.R"))
writeLines("'hello'", file.path(project$root(), "world", "hello.R"))
project$find() # [1] "world" "hello"
    
  

Note that find does not produce results in alphabetical order. The resources are ordered by which ones were modified most recently.

In the above example, we illustrated one of the most powerful concepts in director: idempotent resources. Many dynamic scripting languages like JavaScript or R have found themselves at the forefront of popular usage without the language designer anticipating their growth in complexity. In R projects, we typically find it hard to organize our work sometimes because, like JavaScript, R was simply not designed with large architectures in mind.

However, there is a neat solution. We take a page out of the books of require.js and node.js and allow each director object a very specific convention: if a file (ignoring extension) has the same name as its parent directory, it gets treated as an idempotent resource, meaning that the other files in that directory are helper files which become invisible to the director object.

These helper files are implicitly understood to be available to the main file for its final construction, but otherwise should not be treated as Syberia resources, that is, should not be made available as modular pieces for re-use within the rest of the project. If you find yourself needing to make use of them, it is a good indication that they deserve the title of their own distinct resource.

    
list.files(project$root(), recursive = TRUE)
# "hello.R"       "world/hello.R" "world/world.R"
project$filename("world")
# [1] "world/world.R"
    
  

The director object keeps track of the fact that the "world" resource is in fact at "world/world.R". This is a virtual construction which maps either to "world.R" or "world/world.R" depending on the name of the parent directory of "world.R".

    
project$resource("world")
# hello world!NULL
    
  

Our resource has been pieced together from the world directory using both "world/hello.R" and "world/world.R".

The usage of the Ramd package was for the Ramd::define function, which allows a convenient syntax for sourcing adjacent files. Note these files are truly sourced using base::source and are pure helper R scripts—they do not benefit from the preprocessor or parser magic in director that we are about to introduce! (This is a frequent source of bizarre bugs and stack traces and merits explicit mention.)

Preprocessors and parsers

Continuing with the example from the previous section, let's add some flair to our project.

    
john <- utils::person("John", "Doe")
john # "John Doe"
class(john) # "person"
    
  

We use the simplest illustrative structure we know of: the utils package's definition of a person S3 class. Imagine we want to define a convention that all files in the "people" directory must return a c("first name", "last name") character vector and create a person object. It would be annoying and repetitive to write utils::person("first name", "last name") in each file, and anyway we may wish to replace the base definition with another class down the road. (See don't repeat yourself.)

    
dir.create(file.path(project$root(), "people"))
writeLines('c("Jane", "Smith")', file.path(project$root(), "people/jane.R"))
jane <- project$resource("people/jane")
jane # [1] "Jane" "Smith"
class(jane) # "character"
    
  

So far, we have not accomplished our objective: jane is still a lowly character vector, not a person. To animate her like Pinnocchio, we shall define a parser on the director object.

    
project$register_parser("people", function(output) {
  do.call(utils::person, as.list(unname(output)))
})
    
  

Now let's try again.

    
jane <- project$resource("people/jane")
jane # [1] "Jane Smith"
class(jane) # "person"
    
  

What a lovely lady jane now has become! By calling register_parser on the project, we have implicitly told our director to, going forward, treat all files in the "people" directory with the convention: take the output of the file and pass it as arguments to utils::person.

    
dir.create(file.path(project$root(), "people", "maryjane"))
writeLines("mary <- 'Mary'; jane <- 'Jane'; paste(mary, jane, sep = '-')",
           file.path(project$root(), "people", "maryjane", "mj.R"))
writeLines("Ramd::define('mj', function(mj) { c(mj, 'Jones') })",
           file.path(project$root(), "people", "maryjane", "maryjane.R"))
maryjane <- project$resource("people/maryjane")
maryjane # [1] "Mary-Jane Jones"
class(maryjane) # "person"
    
  

As you can see, maryjane is more complex than mary, but she is still a great example of a person.

So far, we have encountered an example of a parser. We can also do additional work before sourcing the file, such as injecting helper functions (maybe even recursively composed of other Syberia resources.)

    
project$register_preprocessor("people", function(source_env, source) {
  source_env$capitalize <- function(x) {
    x <- strsplit(x, "")[[1]]
    paste(c(toupper(x[1]), tolower(x[-1])), collapse = "")
  }
  source()
})
    
  

We can now use the capitalize helper in files in the "people" directory.

    
writeLines("sapply(c('jimmy', 'JOHNSON'), capitalize)",
           file.path(project$root(), "people", "jimmy.R"))
jimmy <- project$resource("people/jimmy")
jimmy # [1] "Jimmy Johnson"
class(jimmy) # "person"
    
  

Before Hadley Wickham pops out of thin air to remind you never to use sapply, note that we never defined capitalize anywhere in the file: it got injected by the preprocessor.

The full list of magic provided by the preprocessor and parser definitions is listed in the appendix. The Syberia base engine (the foundation for the modeling engine) provides conventions that bootstrap a vanilla director object into providing some very basic conventions: everything in "lib/controllers" allows us some shortcuts for calling project$register_preprocessor and project$register_parser under the hood.

In other words, Syberia uses director recursively to bootstrap the definition of "controllers" into existence: even controllers are just another Syberia resource and can be rewritten at will; and even director objects can be replaced with more powerful tools for managing R projects.

    
cat("I know of these people:", paste(collapse = ", ", project$find("people")), "\n")
# I know of these people:  people/jimmy, people/maryjane, people/jane
vapply(project$find("people"), function(p) { as.character(project$resource(p)) }, "")
#      people/jimmy   people/maryjane       people/jane
#   "Jimmy Johnson" "Mary-Jane Jones"      "Jane Smith"
    
  

Additional director functionality

    
# By default, `find` matches using fuzzy wildcard and is equivalent to
# the regex: .*p.*.p.*l.*
project$find("ppl") # [1] "people/jimmy"    "people/maryjane" "people/jane"
# This is mostly because the author is lazy and likes convenience. Feel
# free to suggest a change in a pull request or issue in the director package.
# You can target just subdirectories:
project$find("ppl", method = "partial") # character(0)
project$find("people", method = "partial") # [1] "people/jimmy"    "people/maryjane" "people/jane"
# You can also target for an exact match, at most one result.
project$find("people", method = "exact") # character(0)
project$find("people/jimmy", method = "exact") # [1] "people/jimmy"

# You can use `filename` to discover the filename behind a resource.
unname(vapply(project$find("people"), project$filename, character(1)))
# [1] "people/jimmy.R" "people/maryjane/maryjane.R" "people/jane.R"

# You can use `exists` to check for resource existence.
project$exists("people/jimmy") # [1] TRUE
project$exists("people/jimmyjohns") # [1] FALSE
    
  

Some additional more advanced functionality is provided by the syberia_engine object which extends the director object but that is beyond the scope of this introduction. It involves handling all communication and coordination between the root project and its dependent engines, in addition to locating resources across the topology of engines mounted to the root project. If all that sounds complicated, that's because it is, so don't worry about it for now.

How director interacts with your project

Under the hood, you don't have to worry about most of the explicit examples in this chapter. The Syberia modeling engine handles almost all uses of director objects under the hood. The only real usage you should remember is to call project$resource("foo") when you want to call resource "foo" (e.g., "models/model1", "lib/mungebits/imputer", et cetera).

The project global helper is just a shortcut for syberia::syberia_project(), which builds your project from the current working directory into a syberia_engine R6 object and then memoizes (caches) it in memory.

Next Steps

You are so close! You can skip straight to the summary, but any great aspiring developer will read the section on testing.

Appendix: Available arguments for preprocessor and parser definitions

Preprocessors and parsers offer powerful capability for extending the vanilla base::source function.