The Quest for Hidden Treasure: An Apache Spark Connector for the Riak NoSQL data ...

This is a guest blog from our friends at Basho. Pavel Hardak is a director of product management at Basho.

This article introduces Riak Spark Connector , an open source library which bridges the gap between Apache Spark and Riak NoSQL database. It brings the full power of Apache Spark to the operational data, managed in Riak distributed clusters. With the Riak Apache Spark Connector, Riak users now have an integrated, scalable solution for Big Data analytics and Spark users now have a resilient, highly available datastore.

About Riak

Riak is open source, distributed NoSQL database, which is developed and supported by Basho Technologies . Basho offers two major products: Riak KV (Key Value) and Riak TS (Time Series). Both products share the same core codebase, but are tuned for different use cases. Riak KV is a highly resilient, scalable, key-value store. Riak KV is known for its ability to scale up and down in a linear fashion, handle huge amounts of reads, updates and writes with low latency while being extremely reliable and fault tolerant. More recently, Riak TS was introduced, specifically optimized for time series data. It adds very fast bulk writes, very efficient “time slice” read queries and supports asubset of the SQL language over Riak TS tables.

Introducing the Riak Connector for Apache Spark

We have found that many leading organizations use a mix of NoSQL and SQL database products in their infrastructure as each one has specific advantages depending on the use case. In the past, some databases were used more for analytical workloads while others were used for operational ones. As modern NoSQL databases, like Riak , are gaining new capabilities, they are being adopted for additional use cases, like IoT , metrics , and edge-device analytics . To make it easier to perform such tasks, Basho has created a Riak Spark Connector , as we believe that Apache Spark is currently the best technology choice to use alongside Riak. Basho selected Spark for this development effort not only due to customer and market demand but also due to the fact that Spark and Riak share major design principles: high performance, scalability, resiliency and operational simplicity.

Implementing the Apache Spark Connector for Riak

Modeled using principles from the “AWS Dynamo” paper , Riak KV buckets are good for scenarios which require frequent, small data-sized operations in near real-time, especially workloads with reads, writes, and updates ― something which might cause data corruption in some distributed databases or bring them to “crawl” under bigger workloads. In Riak, each data item is replicated on several nodes, which allows the database to process a huge number of operations with very low latencywhile having unique anti-corruption and conflict-resolution mechanisms . However, integration with Apache Spark requires a very different mode of operation ― extracting large amounts of data in bulk, so that Spark can do its “magic” in memory over the whole data set. One approach to solve this challenge is to create a myriad of Spark workers, each asking for several data items. This approach works well with Riak, but it creates unacceptable overhead on the Spark side.

The Quest for Hidden Treasure: An Apache Spark Connector for the Riak NoSQL data ...

Another option is using Riak’s built-in secondary index query (2i). In this approach, the user’s application contacts any Riak node with a query, then this Riak node, becoming a “coordinating node”, queries all other relevant Riak nodes, collects required keys and streams it back to the user application. Then the user app will loop over the keys to retrieve the values. Alas, it was found that queries with a bigger result set could possibly overload the coordinating node. Again, not a good result, so we had to teach Riak new tricks.

The solution was found in enhancing the 2i query with a smart Coverage Plan and Parallel Extract APIs. In the new approach, the user application contacts the coordinating node, but this time instead of doing all the work, this node returns the locations of the data using cluster replication and availability information. Then “N” Spark workers open “N” parallel connections to different nodes, which allow the application to retrieve the desired dataset “N” times faster, without generating “hot spots”. To make it even faster, we implemented a special type of bulk query, called a “full bucket read”, which extracts the whole logical bucket without the need for a query condition. Also, it returns both keys and values, saving another round-trip to the server.

The strength of a Riak KV bucket is its ability to store unstructured data in a schema-less architecture with the “values” being opaque. But for many Spark use cases, data must be mapped into a record with Scala or Java types. Fortunately, many Riak applications use JSON, which allows Spark developers to easily convert it into a Spark DataFrame by providing auser-defined schema. The conversion happens “on the fly” and makes it easier for a Spark programmer to work with the retrieved data.

import sqlContext.implicits._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
case class UserData(user_id: String, name: String, age: Int, category: String)
val kv_bucket_name = new Namespace("test-data")
val riakRdd = sc.riakBucket[UserData](kv_bucket_name).queryAll()
val df = riakRdd.toDF()
df.where(df("age") >= 50).select("id", "name")
df.groupBy("category").count Riak TS meets Spark SQL Being a distributed, master-less, highly available and linearly scalable NoSQL datastore, Riak TS adds a number of SQL-like capabilities. It includes a DDL for tables (yeah, CREATE TABLE …) with named attributes and data types, primary keys (used both for local indexing and clustering), a subset of SQL query language with filters and

The Quest for Hidden Treasure: An Apache Spark Connector for the Riak NoSQL data ...

Trending Articles

[奇怪机翻组] 双梦相牵 / ふたりの夢もち [RJ01259078] [WebRip] [1080P HEVC-10Bit AAC 2.0]...

HONDA CITY VTI-S 菜單分享

#新闻拍一拍# 新的摩尔定律：黄氏定律

一如既往的痴情能否打动月瓶金蝎？ (豆瓣月亮水瓶小组)

求購按摩椅~'~

「粉红」不是霸凌辜莞允杠部落客：我爽在哪？

Intel 7-10代集成显卡驱动31.0.101.2137完整版

涉Gotbit加密货币市场操纵台男纽约被捕

臺灣法治會計學會2025年第三季研討會

不靠姊姊！張柏芝弟弟開計程車維生

关门一家亲：习远平、张澜澜、徐才厚

剑指offer——24.二叉树中和为某一值的路径

苏珊米勒日晕05.11｜狮子鼓励孩子；处女相信自己 (豆瓣 SUSAN MILLER小组)

【台積電IT卓越新戰略5】台積IT組織5年三次大調整，要靠平臺工程讓DevOps創新再加速

【日语无字】春之钟.Haru.no.kane.1985.JAP.vhsrip.NoSub.by.xiongzaixia&vivi

美籍老公不讓步李愛綺兒子念公立小學

新华网这张照片绝了!直讽江泽民宋祖英淫乱组图

湖州师范学院音乐学院开发的 Kontakt 8 明代魏氏乐琵琶/瑟/月琴音源即将发布

Google Chrome Portable 140.0.7339.186 穩定版免安裝中文版 - Google 瀏覽器

免费翻墙节点大全