Most Recent Databricks-Certified-Data-Engineer-Associate Exam Questions & Answers

Prepare for the Databricks Certified Data Engineer Associate Exam exam with our extensive collection of questions and answers. These practice Q&A are updated according to the latest syllabus, providing you with the tools needed to review and test your knowledge.

QA4Exam focus on the latest syllabus and exam objectives, our practice Q&A are designed to help you identify key topics and solidify your understanding. By focusing on the core curriculum, These Questions & Answers helps you cover all the essential topics, ensuring you're well-prepared for every section of the exam. Each question comes with a detailed explanation, offering valuable insights and helping you to learn from your mistakes. Whether you're looking to assess your progress or dive deeper into complex topics, our updated Q&A will provide the support you need to confidently approach the Databricks-Certified-Data-Engineer-Associate exam and achieve success.

The questions for Databricks-Certified-Data-Engineer-Associate were last updated on Jan 21, 2025.

Viewing page 1 out of 20 pages.
Viewing questions 1-5 out of 100 questions

Get All 100 Questions & Answers

Question No. 1

A data analyst has a series of queries in a SQL program. The data analyst wants this program to run every day. They only want the final query in the program to run on Sundays. They ask for help from the data engineering team to complete this task.

Which of the following approaches could be used by the data engineering team to complete this task?

AThey could submit a feature request with Databricks to add this functionality.

BThey could wrap the queries using PySpark and use Python's control flow system to determine when to run the final query.

CThey could only run the entire program on Sundays.

DThey could automatically restrict access to the source table in the final query so that it is only accessible on Sundays.

EThey could redesign the data model to separate the data used in the final query into a new table.

Show Answer

Correct Answer: B

This approach would allow the data engineering team to use the existing SQL program and add some logic to control the execution of the final query based on the day of the week. They could use thedatetimemodule in Python to get the current date and check if it is a Sunday. If so, they could run the final query, otherwise they could skip it. This way, they could schedule the program to run every day without changing the data model or the source table.Reference:PySpark SQL Module,Python datetime Module,Databricks Jobs

Question No. 2

A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values.

Which of the following describes why Auto Loader inferred all of the columns to be of the string type?

AThere was a type mismatch between the specific schema and the inferred schema

BJSON data is a text-based format

CAuto Loader only works with string data

DAll of the fields had at least one null value

EAuto Loader cannot infer the schema of ingested data

Show Answer

Correct Answer: B

JSON data is a text-based format that represents data as a collection of name-value pairs. By default, when Auto Loader infers the schema of JSON data, it treats all columns as strings. This is because JSON data can have varying data types for the same column across different files or records, and Auto Loader does not attempt to reconcile these differences. For example, a column named ''age'' may have integer values in some files, but string values in others. To avoid data loss or errors, Auto Loader infers the column as a string type. However, Auto Loader also provides an option to infer more precise column types based on the sample data. This option is called cloudFiles.inferColumnTypes and it can be set to true or false. When set to true, Auto Loader tries to infer the exact data types of the columns, such as integers, floats, booleans, or nested structures. When set to false, Auto Loader infers all columns as strings. The default value of this option is false.Reference:Configure schema inference and evolution in Auto Loader,Schema inference with auto loader (non-DLT and DLT),Using and Abusing Auto Loader's Inferred Schema,Explicit path to data or a defined schema required for Auto loader.

Question No. 3

A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:

DROP TABLE IF EXISTS my_table;

After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.

Which of the following describes why all of these files were deleted?

AThe table was managed

BThe table's data was smaller than 10 GB

CThe table's data was larger than 10 GB

DThe table was external

EThe table did not have a location

Show Answer

Correct Answer: A

The reason why all of the data files and metadata files were deleted from the file system after dropping the table is that the table was managed. A managed table is a table that is created and managed by Spark SQL. It stores both the data and the metadata in the default location specified by thespark.sql.warehouse.dirconfiguration property. When a managed table is dropped, both the data and the metadata are deleted from the file system.

Option B is not correct, as the size of the table's data does not affect the behavior of dropping the table. Whether the table's data is smaller or larger than 10 GB, the data files and metadata files will be deleted if the table is managed, and will be preserved if the table is external.

Option C is not correct, for the same reason as option B.

Option D is not correct, as an external table is a table that is created and managed by the user. It stores the data in a user-specified location, and only stores the metadata in the Spark SQL catalog. When an external table is dropped, only the metadata is deleted from the catalog, but the data files are preserved in the file system.

Option E is not correct, as a table must have a location to store the data. If the location is not specified by the user, it will use the default location for managed tables. Therefore, a table without a location is a managed table, and dropping it will delete both the data and the metadata.

Managing Tables

[Databricks Data Engineer Professional Exam Guide]

Question No. 4

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The cade block used by the data engineer is below:

If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?

Atrigger('5 seconds')

Btrigger()

Ctrigger(once='5 seconds')

Dtrigger(processingTime='5 seconds')

Etrigger(continuous='5 seconds')

Show Answer

Correct Answer: D

The processingTime option specifies a time-based trigger interval for fixed interval micro-batches. This means that the query will execute a micro-batch to process data every 5 seconds, regardless of how much data is available. This option is suitable for near-real time processing workloads that require low latency and consistent processing frequency. The other options are either invalid syntax (A, C), default behavior (B), or experimental feature (E).Reference:Databricks Documentation - Configure Structured Streaming trigger intervals,Databricks Documentation - Trigger.

Question No. 5

A dataset has been defined using Delta Live Tables and includes an expectations clause:

CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW

What is the expected behavior when a batch of data containing data that violates these constraints is processed?

ARecords that violate the expectation are dropped from the target dataset and loaded into a quarantine table.

BRecords that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.

CRecords that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.

DRecords that violate the expectation are added to the target dataset and recorded as invalid in the event log.

ERecords that violate the expectation cause the job to fail.

Show Answer

Correct Answer: C

Delta Live Tables expectations are optional clauses that apply data quality checks on each record passing through a query. An expectation consists of a description, a boolean statement, and an action to take when a record fails the expectation. The ON VIOLATION clause specifies the action to take, which can be one of the following: warn, drop, or fail. The drop action means that invalid records are dropped from the target dataset before data is written to the target. The failure is reported as a metric for the dataset, which can be viewed by querying the Delta Live Tables event log. The event log contains information such as the number of records that violate an expectation, the number of records dropped, and the number of records written to the target dataset.Reference:

Manage data quality with Delta Live Tables

Monitor Delta Live Tables pipelines

Delta Live Tables SQL language reference

Unlock All Questions for Databricks Databricks-Certified-Data-Engineer-Associate Exam

Full Exam Access, Actual Exam Questions, Validated Answers, Anytime Anywhere, No Download Limits, No Practice Limits

Get All 100 Questions & Answers