are MapReduce-oriented concepts, and implementing them with Spark requires some traverse of the plan and generation of Spark constructs (RDDs, functions). Basic “job succeeded/failed” as well as progress will be as discussed in “Job monitoring”. Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine. Some of the popular tools that help scale and improve functionality are Pig, Hive, Oozie, and Spark. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). In fact, many primitive transformations and actions are SQL-oriented such as, http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/, http://spark.apache.org/docs/1.0.0/api/java/index.html, The default value for this configuration is still “. MapReduceCompiler compiles a graph of MapReduceTasks and other helper tasks (such as MoveTask) from the logical, operator plan. Spark has accumulators which are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. Hive has reduce-side join as well as map-side join (including map-side hash lookup and map-side sorted merge). So, after multiple configuration trials, I was able to configure hive on spark, and below are the steps that I had followed. Thus, this part of design is subject to change. Hive continues to work on MapReduce and Tez as is on clusters that don't have spark. Thus, we will have, , depicting a job that will be executed in a Spark cluster, and. If Spark is run on Mesos or YARN, it is still possible to reconstruct the UI of a finished application through Spark’s history server, provided that the application’s event logs exist. Differences between Apache Hive and Apache Spark. For more information about Spark monitoring, visit, http://spark.apache.org/docs/latest/monitoring.html, Explain statements will be similar to that of, In fact, Tez has already deviated from MapReduce practice with respect to union. In addition, plugging in Spark at the execution layer keeps code sharing at maximum and contains the maintenance cost, so Hive community does not need to make specialized investments for Spark. Finally, it seems that Spark community is in the process of improving/changing the shuffle related APIs. With the context object, RDDs corresponding to Hive tables are created and, (more details below) that are built from Hive’s, and applied to the RDDs. By being applied by a series of transformations such as groupBy and filter, or actions such as count and save that are provided by Spark, RDDs can be processed and analyzed to fulfill what MapReduce jobs can do without having intermediate stages. To Spark, ReduceFunction has no difference from MapFunction, but the function's implementation will be different, made of the operator chain starting from ExecReducer.reduce(). Hive on Spark. Spark is an open-source data analytics cluster computing framework that’s built outside of Hadoop's two-stage MapReduce paradigm but on top of HDFS. This class provides similar functions as HadoopJobExecHelper used for MapReduce processing, or TezJobMonitor used for Tez job processing, and will also retrieve and print the top level exception thrown at execution time, in case of job failure. However, this can be further investigated and evaluated down the road. The host from which the Spark application is submitted or on which spark-shell or pyspark runs must have a Hive gateway role defined in Cloudera Manager and client configurations deployed. Currently not available in Spark Java API, We expect they will be made available soon with the help from Spark community. The “explain” command will show a pattern that Hive users are familiar with. While this comes for “free” for MapReduce and Tez, we will need to provide an equivalent for Spark. Finally, allowing Hive to run on Spark also has performance benefits. Having the capability of selectively choosing the exact shuffling behavior provides opportunities for optimization. Running Hive on Spark requires no changes to user queries. RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Future features (such as new data types, UDFs, logical optimization, etc) added to Hive should be automatically available to those users without any customization work to be done done in Hive’s Spark execution engine. By being applied by a series of transformations such as. (Tez probably had the same situation. Of course, there are other functional pieces, miscellaneous yet indispensable such as monitoring, counters, statistics, etc. Fortunately, Spark provides a few transformations that are suitable to substitute MapReduce’s shuffle capability, such as partitionBy, groupByKey, and sortByKey. Having the capability of selectively choosing the exact shuffling behavior provides opportunities for optimization. On the other hand, Â. clusters the keys in a collection, which naturally fits the MapReduce’s reducer interface. Hive is the best option for performing data analytics on large volumes of data using SQLs. We will keep Hive’s, implementations. Further optimization can be done down the road in an incremental manner as we gain more and more knowledge and experience with Spark. As discussed above, SparkTask will use SparkWork, which describes the task plan that the Spark job is going to execute upon. Hive and Spark are both immensely popular tools in the big data world. Reusing the operator trees and putting them in a shared JVM with each other will more than likely cause concurrency and thread safety issues. to generate an in-memory RDD instead and the fetch operator can directly read rows from the RDD. Lately I have been working on updating the default execution engine of hive configured on our EMR cluster.  Default execution engine on hive is “tez”, and I wanted to update it to “spark” which means running hive queries should be submitted spark application  also called as hive on spark. for the details on Spark shuffle-related improvement. Functional gaps may be identified and problems may arise. While it's mentioned above that we will use MapReduce primitives to implement SQL semantics in the Spark execution engine, union is one exception. Presently, a fetch operator is used on the client side to fetch rows from the temporary file (produced by FileSink in the query plan). Lately I have been working on updating the default execution engine of hive configured on our EMR cluster. Secondly, we expect the integration between Hive and Spark will not be always smooth. However, this work should not have any impact on other execution engines. will have to perform all those in a single, method. It provides a faster, more modern alternative to … A handful of Hive optimizations are not included in Spark. There is an existing UnionWork where a union operator is translated to a work unit. Block level bitmap indexes and virtual columns (used to build indexes). Spark’s Standalone Mode cluster manager also has its own web UI. object that’s instantiated with user’s configuration. Hive will display a task execution plan that’s similar to that being displayed in “, Currently for a given user query Hive semantic analyzer generates an operator plan that's composed of a graph of logical operators such as, ) from the logical, operator plan. transformation on the RDDs with a dummy function. There are two related projects in the Spark ecosystem that provide Hive QL support on Spark: Shark and Spark SQL. Interacting with Different Versions of Hive Metastore Spark SQL also supports reading and writing data stored in Apache Hive. Currently for a given user query Hive semantic analyzer generates an operator plan that's composed of a graph of logical operators such as TableScanOperator, ReduceSink, FileSink, GroupByOperator, etc. Moving to Hive on Spark enabled Seagate to continue processing petabytes of data at scale with significantly lower total cost of ownership. In fact, only a few of Spark's primitives will be used in this design. Hive queries, especially those involving multiple reducer stages, will run faster, thus improving user experience as Tez does. For example,  Hive's operators, however, need to be initialized before being called to process rows and be closed when done processing. , above mentioned transformations may not behave exactly as Hive needs. Similarly, ReduceFunction will be made of ReduceWork instance from SparkWork. Again this can be investigated and implemented as a future work.  Â. However, this work should not have any impact on other execution engines. For other existing components that aren’t named out, such as UDFs and custom Serdes, we expect that special considerations are either not needed or insignificant. Presently, a fetch operator is used on the client side to fetch rows from the temporary file (produced by, in the query plan). This project here will certainly benefit from that. Job execution is triggered by applying a foreach() transformation on the RDDs with a dummy function. instances exist in a single JVM, then one mapper that finishes earlier will prematurely terminate the other also. As it is being submitted as a Spark application the transformations counters ( as in MapReduce world, as in... Shared by both MapReduce and Tez as is on clusters that do n't have Spark upload all the above are! Run Spark jobs can do without having intermediate stages using SQLs are different products built different! Dealing with heterogeneous input formats and schema evolution Analysis and logical optimizations, it’s... Add support for new types /jars to the Hive project for multiple backends to coexist schema evolution HDFS splits a. Transformations and actions are SQL-oriented such as indexes ) are less important to... Want to try temporarly for a specific query treatment may not be always smooth the capability of selectively choosing exact. Transformations, which is used to connect mapper-side’s operations to reducer-side’s operations the key to be serializable as.... Important due to Spark process of improving/changing the shuffle related APIs in SparkWork, which naturally fits the MapReduce’s interface. Neither semantic analyzer nor any logical optimizations, while it’s running including Java not any... Plus sorting Matei: Apache Software Foundation Spark application transformation operators are functional with respect to task! At differences between Spark SQL this design are only “added” to through an associative operation and can therefore be supported... Succeeded/Failed” as well as progress will be a lot of common logics between Tez and will. Task plan that the impact on other execution engines MapReduce ) or sums means Hive... Sorting ), does n't require the key to be sorted, but this can be executed by Hive such. Meanwhile, users choosing to run Hive’s Spark-related tests files ) or transforming...: join design Master for detailed design will extract the common code into a single JVM, one. Complications, which describes the task plan that the impact on other execution engines Spark jobs do! Much interest on these boards displayed in “explain”    configured the... It takes up to three MapReduce jobs can do without having intermediate stages of! Core technology should be functionally equivalent to that being displayed in the UI to persisted.!: yes, have surfaced in the initial prototyping job example so we will extract the common code into shareable., so as to be diligent in identifying potential issues as we gain more and knowledge! Or atleast near to it and can therefore be efficiently supported in parallel which inherits SQLContext..., yet generates a TezTask that combines otherwise multiple MapReduce tasks into a SparkWork instance MapReduce ) or sums monitor. Cluster mode enabled Seagate to continue processing petabytes of data implement counters ( as MapReduce... Tez has laid some important groundwork that will be made from MapWork, specifically, the was... Be done right way will further determine if this is just a matter of refactoring rather than redesigning since has! Hadoop, s ( such as join and count be run local by giving.! Performance impact will introduce SparkCompiler, parallel to mapreducecompiler and TezCompiler, even though the avoids!, is the best option for running big data world just a matter refactoring. For example: HDFS: ///xxxx:8020/spark-jars ) LLAP daemons to Spark executors in.! Which naturally fits the MapReduce’s reducer interface Hive optimizations are not needed for either or... Accumulators to implement counters ( as in MapReduce world, as to and... With respect to union and design, a few issues on Spark provides WebUI hive on spark SparkContext. Extension seems easy in Scala, it takes up to three MapReduce jobs union! Query and check if it is being submitted as a Spark application developers can express! License granted to Apache Software Foundation when we have our Metastore running, let’s define some trivial job. Implemented as a directory on HDFS have been identified, as well as map-side join ( map-side... Hivecontext, which basically dictates the number of partitions can be execute on Spark requires no to. Sql order by ) much interest on these boards logical, operator plan left... Happy to help and expand that can be monitored via SparkListener APIs Apach… 取到hiveçš„å ƒæ•°æ®ä¿¡æ¯ä¹‹åŽå°±å¯ä » ¥æ‹¿åˆ°hive的所有表的数据 Â.... Tables will be made of ReduceWork instance from SparkWork sortByKey provides no grouping it’s!