You are never going to waste time janitoring data again!
You can get back to doing real work and actual math.
OK, that's my two-line elevator pitch. Let's get you up and running with Syberia!
If you already know what Syberia is and what you can do it with it, feel free to skip to the next subsection. Otherwise, hopefully this brief section will answer some questions.
Will Syberia help me if I am trying to productionize data science models? Yes, but primarily if those models are written in R. Syberia is a framework designed specifically to productionize the R programming language with rapid iteration cycles, although hybrid models consisting of Python, Scala, and other tools will also be possible to deploy.
Will Syberia help me if I am trying to quantitatively understand a business question? While not its primary design, the packages that compose Syberia can be used to augment interactive data analysis and introduce the benefits of solid developer hygiene such as testability and modularity.
Will Syberia help me if I am solving an academic problem? Syberia was designed in an industry context but is general enough to be useful for other domains, including academia. Although no out-of-the-box Syberia engine currently exists for reproducible research and publication, this is a possible path to explore for the future.
What if I need to solve a big data problem or deploy low latency machine learning models? Syberia is designed to act as a command center for all data science problems and eventually interact smoothly with high-performance and large-size data requirements. Currently, you should expect to be able to productionize any R code (10-1000ms latency SLAs on data problems with inputs of millions of rows and thousands of features).
What about dashboards? Unstructured or audiovisual data? Deep learning? The world is at your feet but you will have to do some custom FFI work. For the current alpha release, productionization of medium-size structured data problems is supported out-of-the-box.
Before we get started, let's take a look at what a Syberia model file looks like.
To avoid scaring you off with a real example, let's begin with the hello world of data science: predicting the sepal length of flowers in the iris data set given the sepal width, petal length, petal width, and species type.
# The Hello World program of data science.
list( # A great example of a useless Syberia model.
import = list(R = "iris"), # Load the data from R's global namespace.
data = list( # Add a primary key and dep_var column.
"Make dep var" = list(renamer, c("Sepal.Length" = "dep_var")),
"Example var" = list(example), # Silly example of a custom step that lives elsewhere.
"Make ID var" = list(multi_column_transformation(seq_along), "dep_var", "id")
),
model = list("lm", .id_var = "id"), # Run a linear regression.
export = list(R = "model") # Make a global R variable called `model`.
)
The vanilla method of writing this model in R is one line of code:
model <- stats::lm(Sepal.Length ~ ., data = iris)
Noticing this fact is a great way to convince yourself Syberia is useless and blissfully closing the browser tab.
As with most powerful abstractions, the trivial examples will require a few more lines of boilerplate. Yet a great hidden accomplishment is achieved: the level of complexity becomes fixed for all problems solvable by the abstraction rather than growing linearly or exponentially as the number of features and complex requirements increases.
Let's try a real example. In the Kaggle machine learning competition website, the warm-up predictive analytics challenge is to determine who survived on the Titanic ship given information about the passengers onboard. Typically, a great solution requires several hundred lines of clever code. Using the Syberia modeling engine we condensed such a collection of scripts to a single file:
# Some of the constants used here live in the main file:
# https://github.com/syberia/examples/blob/master/models/dev/titanic/titanic.R
list( # R's a LISP, so we represent the modeling process with a list. :)
import = list(
# For this simple example, we'll use Kaggle's example Titanic data set
# to build a relatively thorough logistic regression analysis.
url = list(
"https://raw.githubusercontent.com/haven-jeon/introduction_to_most_usable_pkgs_in_project/master/bicdata/data/titanic.csv",
# Treat strings as strings, not factors.
stringsAsFactors = FALSE
)
), # You can use a variety of data import adapters, like csv or AWS's S3.
# Now that we have downloaded a data set, let's clean it up so it can be fed
# into a logistic regression classifier.
data = list(
# One line per feature engineering step seems undue, but it's for a
# good cause: commenting out to try with and without a step is trivial,
# and as the parametrization grows longer it is a clear indication
# you need a new mungebit! Using descriptive keys makes it self-documenting.
"Has paren in name" = list(multi_column_transformation(function(name) grepl("(", fixed = TRUE, name)), "name", "has_paren")
,"Name length variable" = list(new_variable, function(name) nchar(name), "name_length")
,"Formal title" = list(regex_factor, "name", "title", cases = titles, fixed = fixed_titles)
,"Ticket type" = list(regex_factor, "ticket", "ticket_type", cases = tickets, fixed = fixed_tickets)
,"title_fare variable" = list(new_variable, function(title, fare) { stats::ave(fare, title, FUN = mean) }, "title_fare")
,"class_fare" = list(multi_column_transformation(function(klass, fare) { stats::ave(fare, klass, FUN = mean) }), c("pclass", "fare"), "class_fare")
,"Some simple derivations" = list(atransform, alist(fare_diff = fare - title_fare, fare_pct = fare / title_fare, fare_diff_class = fare - class_fare, fare_pct_class = fare / class_fare))
,"Derived cabin variables" = list(atransform, cabin_derivations)
,"Cabin diff and pct" = list(atransform, alist(fare_diff_cabin = fare - cabin_fare, fare_pct_cabin = fare / cabin_fare))
,"cabin_single_letter" = list(new_variable, function(cabin_letter) factor(gsub("^(.).*$", "\\1", cabin_letter)), "cabin_single_letter")
,"Logical to factor" = list(!as.factor, is.logical)
,"Drop character vars" = list(drop_variables, is.character)
,"Restore levels" = list(restore_categorical_variables, is.factor)
,"Rename dep_var" = list(renamer, c("survived" = "dep_var"))
# English please! Let's translate each step to a vague description:
#
# Parse out a binary feature that indicates if their name has a parenthesis.
# Turn string length of name on ticket into an integer feature.
# Parse out title feature: c("mr", "ms", "mrs", "master", "rev", "dr", "other")
# Parse out ticket type feature: c("pc", "sc", "ca", "sp", "w", "soc", "ston", "line", "paris", "other")
# Make a new title_vare variable using stats::ave
# Use a helper from mungebits2 package to define a class_fare feature.
# Use base::transform to derive differences and quotients.
# Run some more derivations -- see full file in the syberia/examples repo.
# Add one last logical feature because first letter of cabin seems significant.
# Automatically convert our earlier binary feature to a factor.
# Drop all other character features: the juice has been squeezed!
# Add a key idea: store the categorical levels under the hood for later production use.
# Finally, the convention for now is that the dependent variable is "dep_var".
),
# Now let's build an ML model.
model = list(
"gbm" # The name of our classifier. This is defined elsewhere.
, .id_var = "X" # The URL adapter gave each row a primary key "X".
, distribution = "bernoulli" # Here we try a binary classification.
, number_of_trees = 3000 # Pick some parameters.
, shrinkage_factor = 0.005
, depth = 5
, min_observations = 6
, train_fraction = 1
, bag_fraction = 0.5
, cv = TRUE
, cv_folds = 5
, number_of_cores = 4
, perf_method = "cv"
, prediction_type = "response"
),
# Finally, let's export our model so we can use it later to predict on
# validation sets when we need to make a powerpoint for our boss on the
# latest month of perfomance data. Or if we want to get it published.
# Or if we want to win the Kaggle competition. Or if we want to run it on
# billions of predictions in batch mode or in a stored SQL procedure.
#
# In more advanced uses, the object is also already "productionized":
# serializable and ready-to-go for live production systems on raw, messy data streams.
export = list( # Exporting the model is done in 1 line of code.
R = "titanic_model" # Export the model object to a global variable.
# , s3 = "path/titanic" # It's also possible to export to Amazon's S3.
# , file = "~/path/titanic" # Or a file...
# Or any other custom export format you define.
)
) # And we're done! The model comes to life using the global `run()` helper.
This example showcases the primary principle behind the Syberia approach to modeling. Rather than using collections of loosely-organized scripts, Syberia lets you prescribe what should happen in the model construction process, from importing raw production data to delivery a ready-to-deploy object.
Syberia allows you to "roll back" to earlier points in time without repeating previous steps in case something went wrong, such as when you're debugging or testing your feature engineering.
With Syberia, the approach to modeling is the same as the software engineering approach to development: your work should be testable, modular, and each component you build should have a clear purpose. In the example above, we will see how the high-level file is an end-to-end parametrization of the modeling process that breaks down to constructs like I/O adapters, feature engineering mungebits, exportable classifiers, and re-runnable stagerunners that achieve a space/time tradeoff to allow you to quickly roll back to halfway through your workflow and experiment with different parameters.
The current use case of Syberia is rapidly experimenting with different feature engineering steps and machine learning classifiers and producing deployable model objects so that you do not have to recode your work into Python, Ruby, C++ or Java. Many statisticians and academics have worked hard to produce a flourishing ecosystem of statistical and machine learning methods in R and we think their work belongs in the real world without laborious translation efforts.
The long-term vision for Syberia is to act as a tool for rapidly constructing DSLs (domain-specific languages) to solve problems in a variety of domains: constructing visualization dashboards, performing business intelligence and summarization work, conducting reproducible research, and acting as a command center for all big data frameworks.
If you like reading detailed guides end-to-end, feel free to continue with the download section below. If you like a hands on approach and diving in head-first, check out the links below for additional examples and resources.
The modeling engine outlined in this guide is built on top of the following packages.
Name | Status |
---|---|
Mungebits2 | |
Stagerunner | |
Tundra | |
Director |
Syberia is currently released with the following engines.
Name | Status | Description |
---|---|---|
Base | The base engine that defines routes and controllers. | |
Modeling | The modeling engine for deploying structured learning problems. | |
Example | The hello world of modeling projects. Can be used for new projects. | |
Examples | Some examples from Kaggle and other sources in Syberia. |