Prepare for the Databricks Certified Associate Developer for Apache Spark 3.0 exam with our extensive collection of questions and answers. These practice Q&A are updated according to the latest syllabus, providing you with the tools needed to review and test your knowledge.
QA4Exam focus on the latest syllabus and exam objectives, our practice Q&A are designed to help you identify key topics and solidify your understanding. By focusing on the core curriculum, These Questions & Answers helps you cover all the essential topics, ensuring you're well-prepared for every section of the exam. Each question comes with a detailed explanation, offering valuable insights and helping you to learn from your mistakes. Whether you're looking to assess your progress or dive deeper into complex topics, our updated Q&A will provide the support you need to confidently approach the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam and achieve success.
The code block displayed below contains an error. The code block should return a DataFrame where all entries in column supplier contain the letter combination et in this order. Find the error.
Code block:
itemsDf.filter(Column('supplier').isin('et'))
Correct code block:
itemsDf.filter(col('supplier').contains('et'))
A mixup can easily happen here between isin and contains. Since we want to check whether a column 'contains' the values et, this is the operator we should use here. Note that both methods are
methods of Spark's Column object. See below for documentation links.
A specific Column object can be accessed through the col() method and not the Column() method or through col[], which is an essential thing to know here. In PySpark, Column references a generic
column object. To use it for queries, you need to link the generic column object to a specific DataFrame. This can be achieved, for example, through the col() method.
More info:
- isin documentation: pyspark.sql.Column.isin --- PySpark 3.1.1 documentation
- contains documentation: pyspark.sql.Column.contains --- PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1, Question: 51 (Databricks import instructions)
Which of the following code blocks reorders the values inside the arrays in column attributes of DataFrame itemsDf from last to first one in the alphabet?
1. +------+-----------------------------+-------------------+
2. |itemId|attributes |supplier |
3. +------+-----------------------------+-------------------+
4. |1 |[blue, winter, cozy] |Sports Company Inc.|
5. |2 |[red, summer, fresh, cooling]|YetiX |
6. |3 |[green, summer, travel] |Sports Company Inc.|
7. +------+-----------------------------+-------------------+
Output of correct code block:
+------+-----------------------------+-------------------+
|itemId|attributes |supplier |
+------+-----------------------------+-------------------+
|1 |[winter, cozy, blue] |Sports Company Inc.|
|2 |[summer, red, fresh, cooling]|YetiX |
|3 |[travel, summer, green] |Sports Company Inc.|
+------+-----------------------------+-------------------+
It can be confusing to differentiate between the different sorting functions in PySpark. In this case, a particularity about sort_array has to be considered: The sort direction is given by the second
argument, not by the desc method. Luckily, this is documented in the documentation (link below). Also, for solving this Question: you need to understand the difference between sort and
sort_array. With sort, you cannot sort values in arrays. Also, sort is a method of DataFrame, while sort_array is a method of pyspark.sql.functions.
More info: pyspark.sql.functions.sort_array --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2, Question: 32 (Databricks import instructions)
Which of the following code blocks produces the following output, given DataFrame transactionsDf?
Output:
1. root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- productId: integer (nullable = true)
7. |-- f: integer (nullable = true)
DataFrame transactionsDf:
1. +-------------+---------+-----+-------+---------+----+
2. |transactionId|predError|value|storeId|productId| f|
3. +-------------+---------+-----+-------+---------+----+
4. | 1| 3| 4| 25| 1|null|
5. | 2| 6| 7| 2| 2|null|
6. | 3| 3| null| 25| 3|null|
7. +-------------+---------+-----+-------+---------+----+
The output is the typical output of a DataFrame.printSchema() call. The DataFrame's RDD representation does not have a printSchema or formatSchema method (find available methods in the RDD
documentation linked below). The output of print(transactionsDf.schema) is this: StructType(List(StructField(transactionId,IntegerType,true),StructField(predError,IntegerType,true),StructField
(value,IntegerType,true),StructField(storeId,IntegerType,true),StructField(productId,IntegerType,true),StructField(f,IntegerType,true))). It includes the same information as the nicely formatted original
output, but is not nicely formatted itself. Lastly, the DataFrame's schema attribute does not have a print() method.
More info:
- pyspark.RDD: pyspark.RDD --- PySpark 3.1.2 documentation
- DataFrame.printSchema(): pyspark.sql.DataFrame.printSchema --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2, Question: 52 (Databricks import instructions)
Which of the following code blocks returns a copy of DataFrame transactionsDf in which column productId has been renamed to productNumber?
More info: pyspark.sql.DataFrame.withColumnRenamed --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2, Question: 35 (Databricks import instructions)
Which of the following is a characteristic of the cluster manager?
The cluster manager receives input from the driver through the SparkContext.
Correct. In order for the driver to contact the cluster manager, the driver launches a SparkContext. The driver then asks the cluster manager for resources to launch executors.
In client mode, the cluster manager runs on the edge node.
No. In client mode, the cluster manager is independent of the edge node and runs in the cluster.
The cluster manager does not exist in standalone mode.
Wrong, the cluster manager exists even in standalone mode. Remember, standalone mode is an easy means to deploy Spark across a whole cluster, with some limitations. For example, in
standalone mode, no other frameworks can run in parallel with Spark. The cluster manager is part of Spark in standalone deployments however and helps launch and maintain resources across the
cluster.
The cluster manager transforms jobs into DAGs.
No, transforming jobs into DAGs is the task of the Spark driver.
Each cluster manager works on a single partition of data.
No. Cluster managers do not work on partitions directly. Their job is to coordinate cluster resources so that they can be requested by and allocated to Spark drivers.
More info: Introduction to Core Spark Concepts * BigData
Full Exam Access, Actual Exam Questions, Validated Answers, Anytime Anywhere, No Download Limits, No Practice Limits
Get All 180 Questions & Answers