Top 10 Questions about Apache Spark on the MapR Data Platform

In the last few weeks, we’ve seen a lot of activity and momentum centered on Apache Spark . At the Spark Summit in San Francisco, we announced an enterprise-grade Spark distribution that runs on the MapR Platform, and we received a lot of interest at this event. Customers are flocking to Spark as their primary compute engine for big data use cases, and we received further proof of this last week when we ran an “Ask Us Anything about Spark” forum in the Converge Community . There were some great discussions that took place, where our Spark experts answered questions from customers and partners. Here is a summary of some of these discussions:

Question 1

How do I write back into MapR Streams using Spark (Java)?

I am now able to read from MapR Streams using Spark. But now I want to write back into them using Spark (and Java). There is barely any documentation available online for Scala, and there isn’t any available for Java. I did find a “ sendToKafka ” function mentioned in some Scala code, but the same isn’t working for Java (because it writes DStream and I am working with JavaDStream ). All I am looking for is a Java doc for MapR Streams and Spark, or just a function that lets me write JavaDStream into MapR Streams, preferably using Java.

Answer 1: There isn’t a direct method to send the complete DStream to Kafka. The design pattern would have a .foreach() against the RDD from the incoming DStream. You would use the MapR Streams (Kafka) API which would instantiate a Producer, and then (typically) use the Producer.send() on each record from the .foreach(). In Java, you iterate over the DStream calling the method Producer.send() for each message.

Answer 2:At this time, there is only a Scala producer in the org.apache.spark.streaming.kafka.producer package. (Spark 1.6.1)

Question 2

We are configuring security for Spark 1.5.2, and are facing a few challenges as outlined below:

All of the Spark web UIs are not moving from http to https (e.g., : port 4040/ SparkHistoryServer, etc.). While launching Spark SQL and spark-shell, we are facing a lot of issues: w.r.t sqlcontext, hivemetastore, sentry configuration, etc.

Please provide detailed instructions/steps to be followed while enabling security for Spark. The MapR 5.1 cluster is a 3-node secure cluster with native security being enabled for all the components. Our Spark cluster is running Spark on YARN mode on MapR 5.1.

Answer:If the cluster has been configured as a secure cluster, there is no additional configuration you must change. The “configure.sh” command you (or the MapR installer) executed during the installation will configure YARN security for you. By extension, Spark executed with YARN will be secure as well.

Question 3

In our application, we have a large amount of “streaming data” (i.e., CSV files arrivIng at five-minute intervals), we want to store all the data up to some age limit and form real-time RDD views which the visualization will access via Drill.

For the Spark app, what is the best storage method a Spark DataFrame, MapR-DB table, or just a Parquet file? All are accessible to Spark and Drill, but if we are just doing regular column lookups based on a few sub keys, which one is preferable?

Answer 1:Based on the limited information in this thread, and given that your common lookups are based on a set of known columns, it would appear that you may want to store these as Parquet files. Drill is optimized for reading Parquet. Please share more with respect to SLAs, and if you plan to keep data in memory (for example, can it be contained in memory for an extended period of time?), which may point to using Spark DataFrames.

Answer 2:If you need to update your data, then HBase will be faster for updates. If you mainly want to read your data and not update it, then Parquet is optimized for fast columnar reads.

Question 4

We would like to use Spark from a standalone Java application. The Java application should generate a temporary table and start the Hive Thrift Server. Which classes should we use to connect from the Java application to Spark? SparkLauncher? Is there any other way (other than the SparkLauncher) to succeed without using spark-submit?

Answer:This will very much depend on where the Spark application will be executed. I assume the Spark application will be executed on the MapR cluster. If so, spark-submit is the method to do this for Java applications. Java applications are not as flexible in how they are executed on a Spark cluster.

Question 5

We want to load data from an HBase table and convert them into DataFrames to perform aggregations. Since Scala’s case classes have a limitation of 22 parameters, how can we create schema when the number of columns are more than 22? Currently, we have created a Hive external table and query using HiveContext in order to get a DataFrame. Is there any way to create a Dataframe from RDD by directly scanning HBase?

Answer:When you have more than 22 columns, you can programmatically specify the schema. A DataFrame can be created programmatically with three steps.

Create an RDD of Rows from the original RDD. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1. Apply the schema to the RDD of Rows via the createDataFrame method provided by SQLContext.

You cannot currently create a DataFrame from scanning HBase with the current release. There are no released modules which are working on this.

Question 6

When I do aggregations like sum using DataFrame, I encounter a double precision issue. For example, instead of 913.76, it returns 913.7600000000001, and instead of 6796.25 it returns 6796.249999999995. I am using BigDecimal’s setScale(2, BigDecimal.RoundingMode.HALF_UP) method to round off the value. Is there any way to handle this precision problem without applying another round-off function?

Answer:The loss of precision is related to the conversion between data types in Java. This is a common occurrence, and mathematically rounding the number is the most common solution. I would check the real data type of the field that produced the RDD, and match the DataFrame class exactly in order to reduce precision errors.

Question 7

We have a cluster which was created few years ago, before Spark become popular and widely used. Now we are facing a problem with small disk space for the Spark scratch directory (As MapR documentation suggested, have a small disk space for OS and the rest for MapR-FS is not so good for Spark). As we have data on MapR-FS now, it’s very slow/expensive to steal disks from MapR-FS for Spartk scrach/tmp. Can we use the MapR-FS local volumes as scratch on MapR Community Edition 5.1?

There is a manual on how to configure a scratch directory for Spark Standalone:

MapR 5.1 Documentation. Is there any respective documentation for Spark on YARN?

Answer:Yes, you can use MapR-FS for Spark local directories, as documented in the 5.1 documentation. In the MapR Community Edition, this

Top 10 Questions about Apache Spark on the MapR Data Platform

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本