Issuing Corrections (Benchmarks are Hard)
My work at Saturn Cloud involves using and learning a lot of different technologies, particularly around scalable and accelerated data science in Python. This can involve benchmarking different tools for specific workloads to show why a customer would want to use Saturn Cloud’s platform over another platform. If I’ve learned anything through this, it’s that benchmarks are hard! This post covers one correction I had to make recently and some reflection on benchmarks in general.
The correction
I recently found a bug in one of these benchmarks I wrote in July comparing the performance of a machine learning grid search across a few platforms: scikit-learn on a single machine, Apache Spark on a cluster, and Dask on a cluster. The bug was in the Dask portion which made it run 10x faster than it should have run 😱. Fortunately the conclusions of the article remain the same as Dask is still faster than the other two options, but the magnitude was initially overstated before I found and fixed the bug. The tagline initially was:
Dask improves scikit-learn parameter search speed by over 100x, and Spark by over 40x.
and then needed to be revised to:
Dask improves scikit-learn parameter search speed by over 16x, and Spark by over 4x.
You can see more of the differences in the runtime plots here:
All of this due to one missing argument! More on that later.
The numbers do make more sense now that I’ve had more experience with Dask (I’ve been working with Dask daily for the last few months after the article was initially published). You would expect the speedups when moving from single-node scikit-learn to Dask in parallel to be roughly equivalent to the number of machines because this is one of those “embarassingly parallel” problems — just train the models across more cores. Spark is a different situation because its grid search trains one model at a time and only the individual model training is parallelized across the cluster.
What went wrong
I used a scikit-learn Pipeline
with several preprocessing steps. One step was using a ColumnTransformer
to scale some numeric features and one-hot encode other categorical features. Here I’m just isolating the part where I found the bug, but you can refer to the article for the full code.
This is how you do it with scikit-learn:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
pipeline = Pipeline(steps=[
('preprocess', ColumnTransformer(transformers=[
('num', StandardScaler(), numeric_feat),
('cat', OneHotEncoder(handle_unknown='ignore',
sparse=False), categorical_feat),
])),
('clf', ElasticNet(normalize=False, max_iter=100)),
])
With Dask, I had to use dask-ml’s Categorizer
and DummyEncoder
for one-hot encoding, because dask-ml’s OneHotEncoder
doesn’t currently support ignoring unknown categories when passing through new data (the handle_unknown
argument).
The pipeline had to be re-arranged a bit with Dask because Categorizer
and DummyEncoder
need to be called outside of the ColumnTransformer
:
from dask_ml.compose import ColumnTransformer
from dask_ml.preprocessing import StandardScaler, DummyEncoder, Categorizer
pipeline = Pipeline(steps=[
('categorize', Categorizer(columns=categorical_feat)),
('onehot', DummyEncoder(columns=categorical_feat)),
('scale', ColumnTransformer(
transformers=[('num', StandardScaler(), numeric_feat)],
)),
('clf', ElasticNet(normalize=False, max_iter=100)),
])
Can you spot the bug here? It’s rather subtle but very important. The key is in this part:
('scale', ColumnTransformer(
transformers=[('num', StandardScaler(), numeric_feat)],
))
The ColumnTransformer class in dask-ml, which mirrors the functionality of the scikit-learn class, by default only includes the columns specified in the transformers list. This means that all of my categorical features were being excluded 😭. No wonder it was super fast! The fix is to use the remainder
argument like so:
('scale', ColumnTransformer(
transformers=[('num', StandardScaler(), numeric_feat)],
remainder='passthrough',
))
This tells dask-ml (or scikit-learn) to still keep all the other features, even if they weren’t explictly listed for transformation.
In the end it was this one argument that caused my runtime to be 10x faster than it should have been. Hence I will throw in another adage here:
Read the docstrings!!!
Had I thoroughly read through all the arguments of ColumnTransformer
, I probably would have realized I needed to set remainder='passthrough'
.
All benchmarks are wrong, but some are useful
My experiences reinforce an idea I had already held, which is adapted from the famous George Box quote:
All benchmarks are wrong, but some are useful
And by this I mean that benchmarks and comparisons are always written with a bias, no matter how careful the author is to remain unbiased. So when I saw that Dask was 40x faster than Spark, I sort of rationalized why in my head rather than taking a more critical look at how this could be the case. I recently gave a talk at the PyData Global 2020 conference comparing various parallel computing tools, and I made sure to mention at the beginning that my attempt is to give a fair shake, but I work for a company built around Dask, so even my opinions will be biased to some degree.
I still believe benchmarks are useful, because even with the fixed version of this grid search example Dask was shown to be faster than Spark. But any reader of a benchmark study should critically evaluate the code and runtimes — or even better try to run it themselves! This is why we have always made sure to published reproducible notebooks with each benchmark, and why I felt it was important to explain what happened with this specific correction.
Happy benchmarking!