Prepare for the Databricks Certified Associate Developer for Apache Spark 3.0 exam with our extensive collection of questions and answers. These practice Q&A are updated according to the latest syllabus, providing you with the tools needed to review and test your knowledge.
QA4Exam focus on the latest syllabus and exam objectives, our practice Q&A are designed to help you identify key topics and solidify your understanding. By focusing on the core curriculum, These Questions & Answers helps you cover all the essential topics, ensuring you're well-prepared for every section of the exam. Each question comes with a detailed explanation, offering valuable insights and helping you to learn from your mistakes. Whether you're looking to assess your progress or dive deeper into complex topics, our updated Q&A will provide the support you need to confidently approach the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam and achieve success.
The code block displayed below contains an error. The code block should read the csv file located at path data/transactions.csv into DataFrame transactionsDf, using the first row as column header
and casting the columns in the most appropriate type. Find the error.
First 3 rows of transactions.csv:
1. transactionId;storeId;productId;name
2. 1;23;12;green grass
3. 2;35;31;yellow sun
4. 3;23;12;green grass
Code block:
transactionsDf = spark.read.load("data/transactions.csv", sep=";", format="csv", header=True)
Correct code block:
transactionsDf = spark.read.load('data/transactions.csv', sep=';', format='csv', header=True, inferSchema=True)
By default, Spark does not infer the schema of the CSV (since this usually takes some time). So, you need to add the inferSchema=True option to the code block.
More info: pyspark.sql.DataFrameReader.csv --- PySpark 3.1.2 documentation
The code block displayed below contains multiple errors. The code block should return a DataFrame that contains only columns transactionId, predError, value and storeId of DataFrame
transactionsDf. Find the errors.
Code block:
transactionsDf.select([col(productId), col(f)])
Sample of transactionsDf:
1. +-------------+---------+-----+-------+---------+----+
2. |transactionId|predError|value|storeId|productId| f|
3. +-------------+---------+-----+-------+---------+----+
4. | 1| 3| 4| 25| 1|null|
5. | 2| 6| 7| 2| 2|null|
6. | 3| 3| null| 25| 3|null|
7. +-------------+---------+-----+-------+---------+----+
Correct code block: transactionsDf.drop('productId', 'f')
This Question: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code
block
includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION
NO: will
make it easier for you to deal with single-error questions in the real exam.
The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as
strings without being wrapped in a col() operator.
Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the Question: can be solved by using a select statement, a drop statement, given
the
answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column
names should be expressed as strings and not as Python variable names as in the original code block.
The column names should be listed directly as arguments to the operator and not as a list.
Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.
The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f
should be replaced by transactionId, predError, value and storeId.
Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named
productId instead of telling Spark to use the column productId - for that, you need to express it as a string.
The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.
No. This still leaves you with Python trying to interpret the column names as Python variables (see above).
The select operator should be replaced by a drop operator.
Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as Python variables
(see above).
More info: pyspark.sql.DataFrame.drop --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 30 (Databricks import instructions)
The code block displayed below contains an error. The code block should trigger Spark to cache DataFrame transactionsDf in executor memory where available, writing to disk where insufficient
executor memory is available, in a fault-tolerant way. Find the error.
Code block:
transactionsDf.persist(StorageLevel.MEMORY_AND_DISK)
The storage level is inappropriate for fault-tolerant storage.
Correct. Typically, when thinking about fault tolerance and storage levels, you would want to store redundant copies of the dataset. This can be achieved by using a storage level such as
StorageLevel.MEMORY_AND_DISK_2.
The code block uses the wrong command for caching.
Wrong. In this case, DataFrame.persist() needs to be used, since this operator supports passing a storage level. DataFrame.cache() does not support passing a storage level.
Caching is not supported in Spark, data are always recomputed.
Incorrect. Caching is an important component of Spark, since it can help to accelerate Spark programs to great extent. Caching is often a good idea for datasets that need to be accessed
repeatedly.
Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.
No. Caching is either accessed through DataFrame.cache() or DataFrame.persist().
The DataFrameWriter needs to be invoked.
Wrong. The DataFrameWriter can be accessed via DataFrame.write and is used to write data to external data stores, mostly on disk. Here, we find keywords such as 'cache' and 'executor
memory' that point us away from using external data stores. We aim to save data to memory to accelerate the reading process, since reading from disk is comparatively slower. The
DataFrameWriter does not write to memory, so we cannot use it here.
More info: Best practices for caching in Spark SQL | by David Vrba | Towards Data Science
Which of the following describes how Spark achieves fault tolerance?
Due to the mutability of DataFrames after transformations, Spark reproduces them using observed lineage in case of worker node failure.
Wrong -- Between transformations, DataFrames are immutable. Given that Spark also records the lineage, Spark can reproduce any DataFrame in case of failure. These two aspects are the key to
understanding fault tolerance in Spark.
Spark builds a fault-tolerant layer on top of the legacy RDD data system, which by itself is not fault tolerant.
Wrong. RDD stands for Resilient Distributed Dataset and it is at the core of Spark and not a 'legacy system'. It is fault-tolerant by design.
Spark helps fast recovery of data in case of a worker fault by providing the MEMORY_AND_DISK storage level option.
This is not true. For supporting recovery in case of worker failures, Spark provides '_2', '_3', and so on, storage level options, for example MEMORY_AND_DISK_2. These storage levels are
specifically designed to keep duplicates of the data on multiple nodes. This saves time in case of a worker fault, since a copy of the data can be used immediately, vs. having to recompute it first.
Spark is only fault-tolerant if this feature is specifically enabled via the spark.fault_recovery.enabled property.
No, Spark is fault-tolerant by design.
Which of the following statements about the differences between actions and transformations is correct?
Actions can trigger Adaptive Query Execution, while transformation cannot.
Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is
called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.
Actions can be queued for delayed execution, while transformations can only be processed immediately.
No, there is no such concept as 'delayed execution' in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.
Actions are evaluated lazily, while transformations are not evaluated lazily.
Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.
Actions generate RDDs, while transformations do not.
No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way. Actions produce outputs in Python and data types (integers, lists, text files,...) based on
the RDDs, but they do not generate them.
Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action.
Actions do not send results to the driver, while transformations do.
No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to
the driver. They produce RDDs that remain on the worker nodes.
More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution
Full Exam Access, Actual Exam Questions, Validated Answers, Anytime Anywhere, No Download Limits, No Practice Limits
Get All 180 Questions & Answers