In the last few weeks, we’ve seen a lot of activity and momentum centered on Apache Spark . At the Spark Summit in San Francisco, we announced an enterprise-grade Spark distribution that runs on the MapR Platform, and we received a lot of interest at this event. Customers are flocking to Spark as their primary compute engine for big data use cases, and we received further proof of this last week when we ran an “Ask Us Anything about Spark” forum in the Converge Community . There were some great discussions that took place, where our Spark experts answered questions from customers and partners. Here is a summary of some of these discussions:
Question 1How do I write back into MapR Streams using Spark (Java)?
I am now able to read from MapR Streams using Spark. But now I want to write back into them using Spark (and Java). There is barely any documentation available online for Scala, and there isn’t any available for Java. I did find a “ sendToKafka ” function mentioned in some Scala code, but the same isn’t working for Java (because it writes DStream and I am working with JavaDStream ). All I am looking for is a Java doc for MapR Streams and Spark, or just a function that lets me write JavaDStream into MapR Streams, preferably using Java.
Answer 1: There isn’t a direct method to send the complete DStream to Kafka. The design pattern would have a .foreach() against the RDD from the incoming DStream. You would use the MapR Streams (Kafka) API which would instantiate a Producer, and then (typically) use the Producer.send() on each record from the .foreach(). In Java, you iterate over the DStream calling the method Producer.send() for each message.Answer 2:At this time, there is only a Scala producer in the org.apache.spark.streaming.kafka.producer package. (Spark 1.6.1)
Question 2We are configuring security for Spark 1.5.2, and are facing a few challenges as outlined below:
All of the Spark web UIs are not moving from http to https (e.g., : port 4040/ SparkHistoryServer, etc.). While launching Spark SQL and spark-shell, we are facing a lot of issues: w.r.t sqlcontext, hivemetastore, sentry configuration, etc.Please provide detailed instructions/steps to be followed while enabling security for Spark. The MapR 5.1 cluster is a 3-node secure cluster with native security being enabled for all the components. Our Spark cluster is running Spark on YARN mode on MapR 5.1.
Answer:If the cluster has been configured as a secure cluster, there is no additional configuration you must change. The “configure.sh” command you (or the MapR installer) executed during the installation will configure YARN security for you. By extension, Spark executed with YARN will be secure as well.
Question 3In our application, we have a large amount of “streaming data” (i.e., CSV files arrivIng at five-minute intervals), we want to store all the data up to some age limit and form real-time RDD views which the visualization will access via Drill.
For the Spark app, what is the best storage method a Spark DataFrame, MapR-DB table, or just a Parquet file? All are accessible to Spark and Drill, but if we are just doing regular column lookups based on a few sub keys, which one is preferable?
Answer 1:Based on the limited information in this thread, and given that your common lookups are based on a set of known columns, it would appear that you may want to store these as Parquet files. Drill is optimized for reading Parquet. Please share more with respect to SLAs, and if you plan to keep data in memory (for example, can it be contained in memory for an extended period of time?), which may point to using Spark DataFrames.
Answer 2:If you need to update your data, then HBase will be faster for updates. If you mainly want to read your data and not update it, then Parquet is optimized for fast columnar reads.
Question 4We would like to use Spark from a standalone Java application. The Java application should generate a temporary table and start the Hive Thrift Server. Which classes should we use to connect from the Java application to Spark? SparkLauncher? Is there any other way (other than the SparkLauncher) to succeed without using spark-submit?
Answer:This will very much depend on where the Spark application will be executed. I assume the Spark application will be executed on the MapR cluster. If so, spark-submit is the method to do this for Java applications. Java applications are not as flexible in how they are executed on a Spark cluster.
Question 5We want to load data from an HBase table and convert them into DataFrames to perform aggregations. Since Scala’s case classes have a limitation of 22 parameters, how can we create schema when the number of columns are more than 22? Currently, we have created a Hive external table and query using HiveContext in order to get a DataFrame. Is there any way to create a Dataframe from RDD by directly scanning HBase?
Answer:When you have more than 22 columns, you can programmatically specify the schema. A DataFrame can be created programmatically with three steps.
Create an RDD of Rows from the original RDD. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1. Apply the schema to the RDD of Rows via the createDataFrame method provided by SQLContext.You cannot currently create a DataFrame from scanning HBase with the current release. There are no released modules which are working on this.
Question 6When I do aggregations like sum using DataFrame, I encounter a double precision issue. For example, instead of 913.76, it returns 913.7600000000001, and instead of 6796.25 it returns 6796.249999999995. I am using BigDecimal’s setScale(2, BigDecimal.RoundingMode.HALF_UP) method to round off the value. Is there any way to handle this precision problem without applying another round-off function?
Answer:The loss of precision is related to the conversion between data types in Java. This is a common occurrence, and mathematically rounding the number is the most common solution. I would check the real data type of the field that produced the RDD, and match the DataFrame class exactly in order to reduce precision errors.
Question 7We have a cluster which was created few years ago, before Spark become popular and widely used. Now we are facing a problem with small disk space for the Spark scratch directory (As MapR documentation suggested, have a small disk space for OS and the rest for MapR-FS is not so good for Spark). As we have data on MapR-FS now, it’s very slow/expensive to steal disks from MapR-FS for Spartk scrach/tmp. Can we use the MapR-FS local volumes as scratch on MapR Community Edition 5.1?
There is a manual on how to configure a scratch directory for Spark Standalone:
MapR 5.1 Documentation. Is there any respective documentation for Spark on YARN?
Answer:Yes, you can use MapR-FS for Spark local directories, as documented in the 5.1 documentation. In the MapR Community Edition, this