Most Recent Databricks-Certified-Professional-Data-Engineer Exam Dumps

Prepare for the Databricks Certified Data Engineer Professional exam with our extensive collection of questions and answers. These practice Q&A are updated according to the latest syllabus, providing you with the tools needed to review and test your knowledge.

QA4Exam focus on the latest syllabus and exam objectives, our practice Q&A are designed to help you identify key topics and solidify your understanding. By focusing on the core curriculum, These Questions & Answers helps you cover all the essential topics, ensuring you're well-prepared for every section of the exam. Each question comes with a detailed explanation, offering valuable insights and helping you to learn from your mistakes. Whether you're looking to assess your progress or dive deeper into complex topics, our updated Q&A will provide the support you need to confidently approach the Databricks-Certified-Professional-Data-Engineer exam and achieve success.

The questions for Databricks-Certified-Professional-Data-Engineer were last updated on Mar 30, 2025.

Viewing page 1 out of 24 pages.
Viewing questions 1-5 out of 120 questions

Get All 120 Questions & Answers

Question No. 1

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.

Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

AStage's detail screen and Executor's files

BStage's detail screen and Query's detail screen

CDriver's and Executor's log files

DExecutor's detail screen and Executor's log files

Show Answer

Correct Answer: B

In Apache Spark's UI, indicators of data spilling to disk during the execution of wide transformations can be found in the Stage's detail screen and the Query's detail screen. These screens provide detailed metrics about each stage of a Spark job, including information about memory usage and spill data. If a task is spilling data to disk, it indicates that the data being processed exceeds the available memory, causing Spark to spill data to disk to free up memory. This is an important performance metric as excessive spill can significantly slow down the processing.

Apache Spark Monitoring and Instrumentation: Spark Monitoring Guide

Spark UI Explained: Spark UI Documentation

Question No. 2

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

ACluster: New Job Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: Unlimited

BCluster: New Job Cluster;
Retries: None;
Maximum Concurrent Runs: 1

CCluster: Existing All-Purpose Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1

DCluster: Existing All-Purpose Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1

ECluster: Existing All-Purpose Cluster;
Retries: None;
Maximum Concurrent Runs: 1

Show Answer

Correct Answer: D

The configuration that automatically recovers from query failures and keeps costs low is to use a new job cluster, set retries to unlimited, and set maximum concurrent runs to 1. This configuration has the following advantages:

A new job cluster is a cluster that is created and terminated for each job run. This means that the cluster resources are only used when the job is running, and no idle costs are incurred.This also ensures that the cluster is always in a clean state and has the latest configuration and libraries for the job1.

Setting retries to unlimited means that the job will automatically restart the query in case of any failure, such as network issues, node failures, or transient errors.This improves the reliability and availability of the streaming job, and avoids data loss or inconsistency2.

Setting maximum concurrent runs to 1 means that only one instance of the job can run at a time.This prevents multiple queries from competing for the same resources or writing to the same output location, which can cause performance degradation or data corruption3.

Therefore, this configuration is the best practice for scheduling Structured Streaming jobs for production, as it ensures that the job is resilient, efficient, and consistent.

Question No. 3

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.

The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.

The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.

Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

ABecause the vacuum command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.

BBecause the default data retention threshold is 24 hours, data files containing deleted records will be retained until the vacuum job is run the following day.

CBecause Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.

DBecause Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.

EBecause the default data retention threshold is 7 days, data files containing deleted records will be retained until the vacuum job is run 8 days later.

Show Answer

Correct Answer: E

https://learn.microsoft.com/en-us/azure/databricks/delta/vacuum

Question No. 4

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

A* Total VMs; 1
* 400 GB per Executor
* 160 Cores / Executor

B* Total VMs: 8
* 50 GB per Executor
* 20 Cores / Executor
C.
* Total VMs: 4
* 100 GB per Executor
* 40 Cores/Executor
D.
* Total VMs:2
* 200 GB per Executor
* 80 Cores / Executor

Show Answer

Correct Answer: B

This is the correct answer because it is the cluster configuration that will result in maximum performance for a job with at least one wide transformation. A wide transformation is a type of transformation that requires shuffling data across partitions, such as join, groupBy, or orderBy. Shuffling can be expensive and time-consuming, especially if there are too many or too few partitions. Therefore, it is important to choose a cluster configuration that can balance the trade-off between parallelism and network overhead. In this case, having 8 VMs with 50 GB per executor and 20 cores per executor will create 8 partitions, each with enough memory and CPU resources to handle the shuffling efficiently. Having fewer VMs with more memory and cores per executor will create fewer partitions, which will reduce parallelism and increase the size of each shuffle block. Having more VMs with less memory and cores per executor will create more partitions, which will increase parallelism but also increase the network overhead and the number of shuffle files. Verified Reference: [Databricks Certified Data Engineer Professional], under ''Performance Tuning'' section; Databricks Documentation, under ''Cluster configurations'' section.

Question No. 5

A DLT pipeline includes the following streaming tables:

Raw_lot ingest raw device measurement data from a heart rate tracking device.

Bgm_stats incrementally computes user statistics based on BPM measurements from raw_lot.

How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table while recomputing the downstream table when a pipeline update is run?

ASet the skipChangeCommits flag to true on bpm_stats

BSet the SkipChangeCommits flag to true raw_lot

CSet the pipelines, reset, allowed property to false on bpm_stats

DSet the pipelines, reset, allowed property to false on raw_iot

Show Answer

Correct Answer: D

In Databricks Lakehouse, to retain manually deleted or updated records in the raw_iot table while recomputing downstream tables when a pipeline update is run, the property pipelines.reset.allowed should be set to false. This property prevents the system from resetting the state of the table, which includes the removal of the history of changes, during a pipeline update. By keeping this property as false, any changes to the raw_iot table, including manual deletes or updates, are retained, and recomputation of downstream tables, such as bpm_stats, can occur with the full history of data changes intact.

Databricks documentation on DLT pipelines: https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-overview.html

Unlock All Questions for Databricks Databricks-Certified-Professional-Data-Engineer Exam

Full Exam Access, Actual Exam Questions, Validated Answers, Anytime Anywhere, No Download Limits, No Practice Limits

Get All 120 Questions & Answers