Hadoop Spark Interview Questions

Mention any one of the important benefits of using Spark over MapReduce?

Unlike Hadoop MapReduce which consumes more time, Spark process everything in a single application and renders the results instantly. And the unified Interface to do everything is also an advantage of using Spark over MapReduce.


Is Apache Spark is faster than MapReduce?

Yes, Apache Spark is faster than MapReduce. It runs the applications in memory hundred times faster than Hadoop MapReduce. Even on disk, it is ten times faster than the Hadoop.


In order to run Apache Spark on Yarn, should the Spark need to be installed in every cluster of Yarn?

No, installing in one Yarn node is enough to run Apache Spark on the entire Yarn.


What is RDD?

The RDD (Resilience Distributed Dataset) is kind of data representation on a network. The RDD data representation has the following properties:

Immutable – Though a new RDD can be produced by operating RDD, it is not possible to alter the RDD that is created.

Partitioned – The RDD is partitioned and the data on RDD are operated in parallel.

Resilience – In case of partition failure in one node, the other node takes the data.


Explain about Transformation Functions?

Transformation functions are functions applied on RDD (Resilient Distributed Data Set) to obtain a result in a new RDD set.


What is the use of map() and filter() functions in transformations?

The map() function is used in transformation to map each element of RDD to obtain result in a new RDD. And the filter() function creates the new RDD after copying the elements from the existing RDD.


What are the responsibilities of the Apache Spark Engine?

Following are the responsibilities of the Spark Engine,

Scheduling the data

Distributing the data across the clusters &

Monitoring the application data


What is Streaming in Apache Spark?

Stream Processing is one of the extensions available from the Spark API and it enables streaming of live data from difference sources. It allows streaming data of streaming types such as Flume, HDFS, dashboards and databases.


Explain about Spark Executor?

While using the Sparkcontext to connect with the cluster manager, there is a need for a running computations and data storage on the work node. Spark Executor does this process once the tasks are transferred to executors by the SparkContext.


What is use of Spark Driver?

Spark Driver is used to create SparkContext with connection to the relevant Spark Master. Spark Driver usually runs on the master node and is responsible for declaring transformations and RDDs data actions.