SPARK Interview Preparation Questions

Stephen Choo Quan

March 23, 2023

212

Understand the Spark architecture: To be productive with Spark, it’s essential to understand the Spark architecture and its components like RDD, DataFrame, Dataset, and Spark SQL. You should also understand the difference between transformation and action, which will help you optimize your code.
Setup a Spark cluster: To run Spark applications, you need to set up a Spark cluster. You should know how to configure and manage a cluster, allocate resources, and ensure optimal performance.
Spark APIs: Spark provides APIs for multiple programming languages, including Scala, Java, Python, and R. You should be familiar with at least one of these languages and their corresponding APIs to work with Spark.
Data ingestion: You should know how to ingest data into Spark from different sources like HDFS, S3, or Kafka. It’s essential to choose the right file format and serialization method for efficient processing.
Data processing: Spark provides multiple APIs to process data, including RDD, DataFrame, and Dataset. You should know how to use these APIs to perform common data processing tasks like filtering, sorting, aggregating, and joining data.
Optimization techniques: Spark is designed for distributed processing, and it’s essential to optimize your code to take advantage of its parallel processing capabilities. You should be familiar with techniques like partitioning, caching, and broadcasting to improve performance.
Debugging: Debugging distributed systems like Spark can be challenging. You should be familiar with Spark’s logging and monitoring features to identify and fix issues in your code.
Machine learning: Spark provides a machine learning library, MLlib, that includes several algorithms for classification, regression, clustering, and recommendation. You should be familiar with these algorithms and their APIs to perform machine learning tasks in Spark.
Streaming: Spark Streaming provides APIs to process data in real-time from different sources like Kafka or Flume. You should know how to use these APIs to build real-time streaming applications.
Integration with other tools: Spark integrates with multiple tools like Hadoop, Hive, and Pig. You should be familiar with these tools and their APIs to work with Spark in a Hadoop ecosystem. Additionally, knowledge of other distributed systems like Cassandra or Elasticsearch can also be helpful in integrating Spark with them.

Sample Questions and Answers

What is an Application driver? (each application program will have its driver, and each can only have 1 driver, so if you have 2 programs, you will have 2 drivers)
What are executors? (worker nodes are machines that host executors)
What are the 4 ways executors allocate memory? (1. overhead memory 2.Heap memory 3. off-heap memory 4. pyspark memory) yarn will set some max allocation -mb per executor container.
What are 5 Spark memory configs? (1.memoryOverhead 2.executorMemory 3. memoryFraction 4.memoryStorageFraction 5.ExecutorCores)
What is overhead memory used for? (network buffers and shuffle exchange, pyspark will also consume some of the overhead memory, you can get out of memory exception if these are not sized properly)
user memory can be used by RDD conversion operations or RDD lineage and dependencies
in the spark memory; you further divide this 50/50 for storage memory pool to cache DF and executor memory pool to buffer DF operations
What is memory fraction? (memory fraction is a fraction of (heap space -300MB) used for execution and storage pools.
What are 2 ways drivers allocate memory? (1. driver memory and 2. driver memory overhead; when you sum them up, you get the total that can be allocated for driver memory)
What are some parameters for the spark-submit? (--class: The main class of the Spark application that will be executed. --master: The URL of the Spark cluster manager, which specifies where the application will be executed. This can be set to local for local execution or to the URL of the cluster manager in standalone or YARN mode. --deploy-mode: The deployment mode of the Spark application, which can be client or cluster. In client mode, the driver program runs on the machine where the spark-submit command is executed. In cluster mode, the driver program runs on one of the worker nodes in the cluster. --num-executors: The number of executor processes to be launched for the Spark application. --executor-memory: The amount of memory to be allocated for each executor process. --executor-cores: The number of CPU cores to be allocated for each executor process. --driver-memory: The amount of memory to be allocated for the driver program. --files: A comma-separated list of files to be included with the Spark application. --jars: A comma-separated list of JAR files to be included with the Spark application. --conf: A configuration option to be set for the Spark application, in the format key=value. eg spark-submit —class com.hellowspark —master yarn —deploy-mode cluster –driver-cores 2 –driver-memory 8G –num-executors 4 –execores -cores 4 –executors-memory 16g hello-spark.jar)
What are the different deployment modes? (Local mode: In this mode, Spark runs on a single machine, using all available CPU cores. This mode is suitable for testing and developing Spark applications on a laptop or desktop computer. Standalone mode: In this mode, Spark runs on a cluster of machines managed by a Spark cluster manager, such as the built-in Spark standalone cluster manager or Apache Mesos. This mode is suitable for running Spark applications on a dedicated cluster of machines.Apache Hadoop YARN mode: In this mode, Spark runs on a Hadoop cluster managed by the YARN resource manager, which allows Spark to share resources with other Hadoop applications. This mode is suitable for running Spark applications on a Hadoop cluster.Apache Mesos mode: In this mode, Spark runs on a cluster of machines managed by the Mesos cluster manager, which can also be used to manage other distributed applications. This mode is suitable for running Spark applications on a shared cluster of machines.Kubernetes mode: In this mode, Spark runs on a Kubernetes cluster, which allows for fine-grained resource management and dynamic scaling of Spark applications. This mode is suitable for running Spark applications on a containerized environment. Each deployment mode has its own advantages and disadvantages, depending on the requirements of the Spark application, the size of the dataset, and the available resources
What is a Fat vs. a lean executor? (single node vs distributed nodes. A fat executor is an executor that is configured with a large amount of memory and CPU cores, allowing it to execute multiple tasks in parallel and hold more data in memory. Fat executors are typically used in situations where there are fewer worker nodes in the cluster, and the workload needs to be distributed across fewer nodes. A lean executor, on the other hand, is an executor that is configured with a smaller amount of memory and CPU cores, allowing for more executors to be run on a given worker node. Lean executors are typically used in situations where there are many worker nodes in the cluster, and the workload needs to be distributed across a larger number of nodes )
What is a broadcast variable? (When a broadcast variable is created in Spark, it is first serialized on the driver node and then sent to all the worker nodes in the cluster where it is deserialized and stored in memory. When a task running on a worker node needs to access the broadcast variable, it can simply read the value from memory, without needing to fetch the variable over the network. Broadcast variables are no longer used with DF that was for RDD, but broadcast hash joins are still very much used for caching in memory the small table to avoid shuffle operation.)
What is a broadcast join? (broadcast Hash Join will copy small tables across to reduce the shuffle operation, Details here it’s important to keep stats up to date so Spark can know when to use broadcast hash join. Broadcast hash join is the most performant join, so we want to filter data down to under the threshold for broadcast and use AQE to dynamically use the join instead of a merge sort join)
What is the difference between a closure and a broadcast? Closures are used to pass functions that access variables in the enclosing scope, while broadcasts share large read-only data structures across all tasks running on a worker node. Closures can be expensive to use because they require serializing and transferring data over the network for each task. At the same time, broadcasts are more efficient because they transfer data only once per worker node The broadcast is cached once per worker node, but the closure is serialized as many times as you have tasks; Broadcast joins make use of this broadcast variable. Because the data is only transferred once per worker node, broadcasts are much more efficient than closures for sharing large read-only data structures. A closure is a function that is passed as an argument to another function, and that can access the variables in the scope of the enclosing function. In Spark, closures are used to pass user-defined functions (UDFs) to transformations and actions that are executed on RDDs. When a closure is executed on a worker node, Spark automatically serializes all the variables that are used in the closure and sends them to the worker node. This can be an expensive operation, especially if the closure uses large data structures because Spark needs to serialize and transfer the data over the network for each task that uses the closure.
What type of hint takes the highest priority? (Plan-level hints: These hints are specified in the query plan and take the highest priority. Plan-level hints can be specified using the hint method of the DataFrame or Dataset API, or by embedding a comment in the SQL query using the format /*+ HINT_NAME(...) */. Session-level hints: These hints are specified using configuration settings at the session level, such as by setting the spark.sql.session.timeZone configuration setting. Catalog-level hints: These hints are specified using configuration settings at the catalog level, such as by setting the spark.sql.catalogImplementation configuration setting. System-level hints: These hints are specified using configuration settings at the system level, such as by setting the spark.sql.shuffle.partitions configuration setting. If you put hints on both sides of a join, the broadcast will always take priority
What are transformations vs. Actions? Transformations are operations that transform RDDs into new RDDs, while actions are operations that trigger the execution of transformations and generate a final result. Transformations are lazy and do not execute until an action is called, while actions are eager and execute immediately
What is a Wide Dependency? and give examples of wide dependencies performed. (A wide dependency (or a shuffle dependency) is a type of dependency between RDDs (Resilient Distributed Datasets) that requires the data to be shuffled across the cluster. eg after grouping data from multiple partitions eg group by(), join (), cube(), rollup(), agg()
What is a narrow transformation? and Give some examples of narrow transformations. ( a type of operation that does not require shuffling data across multiple cluster nodes. Instead, a narrow transformation operates on data within a single partition, which is a logical subset of a larger dataset that can be processed on a single machine. Narrow transformations are efficient because they can be executed in parallel on each partition without requiring data to be moved between nodes. eg select(), filter(), union, distinct, withcolumn(), map(), flatmap()
What is a spark Job vs Action? ( a job is a set of transformations and actions that are executed as a single unit of work in a Spark application. A job is typically triggered by an action, which causes all the transformations that are required to produce the output to be executed in a batch. A code block defines the action, so X actions will create X blocks to submit X jobs. show() or read() can force more than one spark action, but generally, we get 1 job per action. An action can trigger one or more jobs, depending on the dependencies between the RDDs and transformations involved)
What is the difference between a Job and a Stage? ( A stage is a sequence of transformations that can be executed without shuffling the data. Each stage consists of one or more tasks, which are the smallest units of work that can be scheduled and executed on the cluster. if the action defines the stage, the wide and narrow dependencies create the stages in the job and are done serially)
What is a shuffle sort operation? (the write and read exchange is used between stages and is called a shuffle sort operation. the shuffle must compare data to move data across partitions)
what forces a shuffle? (it comes from wide dependency operations. Shuffle is to repartition the key and keep it on the exchange of the stage, aka map side repartition. second part of the shuffle is to redistribute this partition data from the write exchange to the read exchange of the next stage aka partition aggregation at the reduce side of the shuffle Details here )
what is the difference between stage and task? You can have parallel tasks depending on input partitions for narrow dependencies on the same execution plan happening in a stage at the same time
Why are there multiple stages during processing?
How can you increase processing performance? One approach to reducing wide dependencies is to perform operations that require shuffling as close to the end of the computation as possible so that subsequent stages can take advantage of the partitioning of the shuffled data.
What is slot capacity? ( The slot capacity refers to the number of tasks that can be run concurrently on an executor. Slot capacity of each node is based on the number of cores dedicated, slots are threads in the same JVM, so 4 cores means 4 slots in the JVM. Eg, if an executor has 4 available slots and an application requires a level of parallelism of 8, the slot capacity would be 2, meaning that only 2 tasks can be run concurrently on the executor. )
What is the difference between cache() and persist? Both cache() and persist() will cache but persist can take an argument of where to store it, storageLevel (useDisk, useMemory,useOffHeap,deserialized, replication=1). default storage level for persist() is memoryAndDiskDeserialized for pyspark but memoryAndDisk for scalar. cache() is an aliasfor persist(StorageLevel.MEMORY_ONLY) which stores the data only in memory and not on disk. persist() can be used to specify a more fine-grained storage level, such as MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER. cache() and persist() are both lazy operations.
When do you cache, and how do you flush the cache? to uncache, use unpersist(); you should only cache or pin in memory large DF you use often. Caching or persisting data can have a significant impact on memory and disk usage, and it’s important to carefully manage the storage level and eviction policies to avoid out-of-memory errors or excessive disk usage
How do you repartition? (To repartition an RDD or DataFrame, you can use the repartition() or coalesce() method. The repartition() method shuffles the data randomly across all the partitions, while the coalesce() method combines adjacent partitions to reduce the total number of partitions. use repartition() using a hash function or repartitionByRange() using buckets by data sampling. repartition(numPartitions, Cols.).
How many jobs are created by sparksql? (each SQL is a spark job, stages are review catalog and create logical plan >> optimize plan >> physical plan >> cost model >> select physical plan >> generate java byte code to create >> RDDs – it’s a sophisticated compiler. SparkSQL query can create multiple jobs, each consisting of one or more stages. For example, a query that involves a join between two large tables may require multiple jobs, with each job corresponding to a stage that performs a shuffle operation)
What is Adaptive Query Execution AQE? (it can coalesce small queries and reduce the number of partitions. It can dynamically determine the best shuffle partition number, AQE will simplify tuning of the shuffle partition number.
When does AQE coalesce with regard to a shuffle operation? AQE or adaptive query execution can only do the coalesce post shuffle partitions never before
What is partition pruning? (it is reading only the partition needed then predicate push down will move in to filter on reading the data in)
What is DPP? or dynamic partition pruning injects a subquery predicate onto the partitioned fact table when used as a join key to force DPP on required data partitions. Dynamic Partition Pruning DPP works best when the large fact is partitioned and the small dimension is broadcast, it does not care about the filter predicates. You can have no filter and still do DPP. Eg for the calendar table, broadcast it and enable dynamic partition pruning, it assumes the Fact table is partitioned, and the calendar is small.
What is the difference between RDD and a dataframe? (spark dF are built on RDD)
What is an accumulator, and where is it stored? (driver will keep a global copy for each executor to reference for accumulated values. The spark driver should be as near to the executors for peak performance. accumulators are used with low-level spark API but are now rarely used. When an accumulator can be used inside a transformation it is better to use it inside an action b/c it gives guaranteed accuracy on the action but can you multicount on the transformations)
collect() sends data from executors to the driver
What is logical optimization? (1. Constant folding 2. predicate pushdown 3. partition pruning 4. Null propagation 5. boolean expression simplification)
What are the key configs for AQE? AQE has 5 configs 1.adaptiveEnable = on or off if on then consider 2-5 2.coalessceInitalPartitionNum 3.coalesceMinPartitionNum 4.PartitionSizeInBytes 5.coalescePartitionsEnabled.
What is skew? To manage skew dynamically, enable adaptiveEnabled and skewJoinEnabled, and that will turn on AQE. repartition(n) is a uniform repartition and will not induce a skew. coalesce (n) can merge partitions and if not done check correctly can create a skew since you are forcing the merge.
When does AQE perform a skew partition split? you have to have both thresholds broken for the AQE to do a skew partition split, 1. be over the min threshold for partition, and 2. Be 5X the size of the avg partition.
What is a speculative task? Spark speculative task is to start a task on another node and see which runs faster. Tuning speculation: 1.specualationInterval, 2speculationMultiplier=1.5 times bigger 3specuaationQuantile=.75 takes more than 75% of the rest 3specuationMinTaskRuntime = 100ms, speculationTaskDurationThreshold = None
SPARK SETTINGS: Static Allocation is first come and first serve but there is dynamic allocation if set to true and shuffleTrackingenabled then it will release back to the cluster manager for use. executorIdleTime =60sec and schedulerBacklogTimelout = 1 sec so that is for pending tasks to make a request
schedulerMode = Fair then jobs are allocated slots on a round-robin basis, this is for multi-threaded environments
What should you consider for memory tuning? for memory tuning, you should check the GC garbage collector memory and the user objects memory.
How is memory allocated at the executor? If you ask for 8GB, you have to first take for the reserved memory for the spark engine 300MB then it allocates the rest in 60% spark memory for DF and caching and 40% user memory for User-defined Data structure, internal metadata, UDF, RDD

SPARK Interview Preparation Questions

Sample Questions and Answers

EDITOR PICKS

Estimation for Agile Developers While Status Reporting to Waterfall Managers

5 Major Reasons Why So Many Companies Fail At Social Media

Best Practices for Distributed Or Remote Teams in the Age of...

POPULAR POSTS

How to use business objects @Prompt Variable to build flexible universes...

How to Merge Data from Multiple Data Providers in WEBIntelligence (webi)

How to Calculate Number Of Days in a Month or Month...

POPULAR CATEGORY

Testing Data Pipelines Using a Test-Driven Development Approach (TDD)

Holy Trinity of Analytics