Chris Fregly on the PANCAKE STACK Workshop and Data Pipelines

Chris Fregly's Advanced Apache Spark PANCAKE STACK workshop , Advanced Spark and TensorFlow Meetup , videos , and presentations span a wide range of data science and engineering components one might evaluate or adopt in a Spark-based end-to-end recommendation engine and data pipeline. The workshop is timely for data scientists and engineers searching for an integrated perspective on how one could go about building a Spark data pipeline. The source code is packaged into a single Docker image so attendees can set up and immerse themselves in the interactive workshop while minimizing the need to focus on operations and infrastructure during the workshop. The integrated sandbox environment, or playground, forms the central focus of the workshop hosted in several countries around the world, and is available on Github. The workshop attendance and feedback, as well as continuous refinement of the workshop materia,l became the basis for Pipeline.IO , Fregly's start-up. The company's conceptis to productionize what he refers to as the last-mile of the Apache Spark machine learning pipeline, namely the aspects involving model refinement, testing and deployment in a heavily metered and automated manner so data scientists can quickly measure the efficacy of their models and reduce the overall time to optimal model discovery.

InfoQ spoke with Chris Fregly about the PANCAKE STACK workshop and some of the trends ocurring in the data pipeline engineering space he captures in the playground code and the Advanced Spark workshop.

InfoQ: How did the PANCAKE STACK workshop come about, what does it mean and what were your motivations?

Fregly: The PANCAKE STACK is the next generation SMACK Stack which I co-developed back in 2015 with some colleagues from Mesosphere ('M'esos), Typesafe ('A'kka), Datastax ('C'assandra), and Confluent ('K'afka). I was working for Databricks at the time, so I provided content for Apache 'S'park as well as the baseline reference app that we used for all demos based on my playground repo which is now the foundation of my new company, PipelineIO . The extended PANCAKE STACK acronym started off as a joke between me and some old BEA Systems co-workers at a bar in Boulder, Colorado earlier this year. High-altitude drinking is a funny thing sometimes! Plus, the domain name pancake-stack.com was available, so I snagged it and the rest is history! The workshop is very dynamic. In addition to new demos and datasets that we're continuously adding, we try to adapt each session to the flow and knowledge level of each group. Sometimes we go beyond the tech represented by the acronym, sometimes we spend a half-day on a single letter like 'T'ensorFlow. Sometimes I bring in guest speakers, sometimes I do the whole thing myself. It all depends on the the background of the attendees, new release announcements, current trends, the city, speaker availability, and even the venue arrangement. As far as motivations go, the workshop is very timely and relevant because of the often bleeding edge technologies used in the stack and people's desire to learn and play with them in an integrated environment. Fortunately, attendees are very resilient and understanding when demo's crash and burn. We've all been there, of course.

InfoQ: What is Spark exceptionally good at, what is is really not good at and how does your workshop demonstrate these attributes?

Fregly : Apache Spark has made big data processing, machine learning, and advanced analytics accessible to the masses. This is awesome. For me, personally, Spark has activated neurons in my brain that haven't been activated in years. Topics like statistics and probabilistic data structures are extremely fascinating to me. As such, I have many machine learning and streaming approximation demos in the PANCAKE STACK Workshop. Other key factors to Spark's success include accessibility of examples, training videos, demos, meetups, etc. Early on, the Berkeley AMPLab created AMPCamp - a series of workshops highlighting the various libraries of Spark including Spark SQL, ML, GraphX, and Streaming. Continuing this accessibility trend, Databricks, founded by the creators of Spark, recently released a Community Edition of their Databricks Notebook product including many Apache Spark training videos and notebooks. This not only increases signups for their product, but also helps the Spark community as a whole.

InfoQ: You mentioned in a recent workshop that Spark Streaming has some limitations and may not satisfy all use cases. Can you expand on this?

Fregly : Spark Streaming, by nature, is a mini-batch system unlike Apache Storm , Apache Flink , and other event-based (per record) streaming systems. As data flows into Spark Streaming, a batch of received data is submitted to the Spark engine as a job. This is similar to a batch job, but with much less data, of course. The argument is that a mini-batch system will provide higher throughput than an event-based system like Storm - at the expense of higher latency for a single event to make its way through the system. Always a fan of giving the developer control, I would prefer Spark Streaming to let the developer choose their tradeoffs - throughput vs. latency - by enabling both mini-batch and per-record implementations. And as we know from physics, it's much easier to combine atoms (combine events into mini-batches) than split atoms (break up mini-batches into individual events). Mini-batches also limit Complex Event Processing (CEP) functionality offered by event-based streaming systems like Apache Flink, for example. I've also found it difficult to reason through the delivery guarantees and state management of Spark Streaming. Everytime I think I get it right, an edge case pops up that invalidates my thinking. This isn't Spark Streaming's fault, per se, as many other systems such as Apache Kafka (Event Source) and Apache Cassandra

(Event Sink) are involved in the complete end-to-end streaming workflow - each with their own transactional and replay characteristics.

The last limitation is lack of consistent support. Unfortunately, a lot of Spark Streaming questions are left unanswered on the Spark mailing list. Again, this is partially due to the complexity of the overall, end-to-end nature of a production-ready Spark Streaming workflow involving many disparate event sources and sinks.

On the positive side, the creator of Spark SQL from Databricks,