Question # 1 A small company based in the United States has recently contracted a consulting firm in
India to implement several new data engineering pipelines to power artificial intelligence
applications. All the company's data is stored in regional cloud storage in the United States.
The workspace administrator at the company is uncertain about where the Databricks
workspace used by the contractors should be deployed.
Assuming that all data governance considerations are accounted for, which statement
accurately informs this decision?
A. Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must
be deployed in the region where the data is stored. B. Databricks workspaces do not rely on any regional infrastructure; as such, the decision
should be made based upon what is most convenient for the workspace administrator.C. Cross-region reads and writes can incur significant costs and latency; whenever
possible, compute should be deployed in the same region the data is stored. D. Databricks leverages user workstations as the driver during interactive development; as
such, users should always use a workspace deployed in a region they are physically near.E. Databricks notebooks send all executable code from the user's browser to virtual
machines over the open internet; whenever possible, choosing a workspace region near
the end users is the most secure.
Click for Answer
C. Cross-region reads and writes can incur significant costs and latency; whenever
possible, compute should be deployed in the same region the data is stored.
Answer Description Explanation: This is the correct answer because it accurately informs this decision. The
decision is about where the Databricks workspace used by the contractors should be
deployed. The contractors are based in India, while all the company’s data is stored in
regional cloud storage in the United States. When choosing a region for deploying a
Databricks workspace, one of the important factors to consider is the proximity to the data
sources and sinks. Cross-region reads and writes can incur significant costs and latency
due to network bandwidth and data transfer fees. Therefore, whenever possible, compute
should be deployed in the same region the data is stored to optimize performance and
reduce costs. Verified References: [Databricks Certified Data Engineer Professional], under
“Databricks Workspace” section; Databricks Documentation, under “Choose a region”
section.
Question # 2 The data engineering team maintains a table of aggregate statistics through batch nightly
updates. This includes total sales for the previous day alongside totals and averages for a
variety of time periods including the 7 previous days, year-to-date, and quarter-to-date.
This table is named store_saies_summary and the schema is as follows:
The table daily_store_sales contains all the information needed to update
store_sales_summary. The schema for this table is:
store_id INT, sales_date DATE, total_sales FLOAT
If daily_store_sales is implemented as a Type 1 table and the total_sales column might
be adjusted after manual data auditing, which approach is the safest to generate accurate
reports in the store_sales_summary table?
A. Implement the appropriate aggregate logic as a batch read against the daily_store_sales
table and overwrite the store_sales_summary table with each Update. B. Implement the appropriate aggregate logic as a batch read against the daily_store_sales
table and append new rows nightly to the store_sales_summary table. C. Implement the appropriate aggregate logic as a batch read against the daily_store_sales
table and use upsert logic to update results in the store_sales_summary table.D. Implement the appropriate aggregate logic as a Structured Streaming read against the
daily_store_sales table and use upsert logic to update results in the store_sales_summary
table. E. Use Structured Streaming to subscribe to the change data feed for daily_store_sales
and apply changes to the aggregates in the store_sales_summary table with each update.
Click for Answer
E. Use Structured Streaming to subscribe to the change data feed for daily_store_sales
and apply changes to the aggregates in the store_sales_summary table with each update.
Answer Description Explanation: The daily_store_sales table contains all the information needed to update
store_sales_summary. The schema of the table is:
store_id INT, sales_date DATE, total_sales FLOAT
The daily_store_sales table is implemented as a Type 1 table, which means that old values
are overwritten by new values and no history is maintained. The total_sales column might
be adjusted after manual data auditing, which means that the data in the table may change
over time.
The safest approach to generate accurate reports in the store_sales_summary table is to
use Structured Streaming to subscribe to the change data feed for daily_store_sales and
apply changes to the aggregates in the store_sales_summary table with each update.
Structured Streaming is a scalable and fault-tolerant stream processing engine built on
Spark SQL. Structured Streaming allows processing data streams as if they were tables or
DataFrames, using familiar operations such as select, filter, groupBy, or join. Structured
Streaming also supports output modes that specify how to write the results of a streaming
query to a sink, such as append, update, or complete. Structured Streaming can handle
both streaming and batch data sources in a unified manner.
The change data feed is a feature of Delta Lake that provides structured streaming sources
that can subscribe to changes made to a Delta Lake table. The change data feed captures
both data changes and schema changes as ordered events that can be processed by
downstream applications or services. The change data feed can be configured with
different options, such as starting from a specific version or timestamp, filtering by
operation type or partition values, or excluding no-op changes.
By using Structured Streaming to subscribe to the change data feed for daily_store_sales,
one can capture and process any changes made to the total_sales column due to manual
data auditing. By applying these changes to the aggregates in the store_sales_summary
table with each update, one can ensure that the reports are always consistent and accurate
with the latest data. Verified References: [Databricks Certified Data Engineer Professional],
under “Spark Core” section; Databricks Documentation, under “Structured Streaming”
section; Databricks Documentation, under “Delta Change Data Feed” section.
Question # 3 Each configuration below is identical to the extent that each cluster has 400 GB total of
RAM, 160 total cores and only one Executor per VM.
Given a job with at least one wide transformation, which of the following cluster
configurations will result in maximum performance? A. • Total VMs; 1
• 400 GB per Executor
• 160 Cores / ExecutorB. • Total VMs: 8
• 50 GB per Executor
• 20 Cores / ExecutorC. • Total VMs: 4
• 100 GB per Executor
• 40 Cores/ExecutorD. • Total VMs:2
• 200 GB per Executor
• 80 Cores / Executor
Click for Answer
B. • Total VMs: 8
• 50 GB per Executor
• 20 Cores / Executor
Answer Description Explanation: This is the correct answer because it is the cluster configuration that will
result in maximum performance for a job with at least one wide transformation. A wide
transformation is a type of transformation that requires shuffling data across partitions,
such as join, groupBy, or orderBy. Shuffling can be expensive and time-consuming,
especially if there are too many or too few partitions. Therefore, it is important to choose a
cluster configuration that can balance the trade-off between parallelism and network
overhead. In this case, having 8 VMs with 50 GB per executor and 20 cores per executor
will create 8 partitions, each with enough memory and CPU resources to handle the
shuffling efficiently. Having fewer VMs with more memory and cores per executor will
create fewer partitions, which will reduce parallelism and increase the size of each shuffle
block. Having more VMs with less memory and cores per executor will create more
partitions, which will increase parallelism but also increase the network overhead and the
number of shuffle files. Verified References: [Databricks Certified Data Engineer
Professional], under “Performance Tuning” section; Databricks Documentation, under
“Cluster configurations” section.
Question # 4 A data engineer is performing a join operating to combine values from a static userlookup
table with a streaming DataFrame streamingDF.
Which code block attempts to perform an invalid stream-static join? A. userLookup.join(streamingDF, ["userid"], how="inner")B. streamingDF.join(userLookup, ["user_id"], how="outer") C. streamingDF.join(userLookup, ["user_id”], how="left") D. streamingDF.join(userLookup, ["userid"], how="inner")E. userLookup.join(streamingDF, ["user_id"], how="right")
Click for Answer
E. userLookup.join(streamingDF, ["user_id"], how="right")
Answer Description Explanation: In Spark Structured Streaming, certain types of joins between a static
DataFrame and a streaming DataFrame are not supported. Specifically, a right outer join
where the static DataFrame is on the left side and the streaming DataFrame is on the right
side is not valid. This is because Spark Structured Streaming cannot handle scenarios
where it has to wait for new rows to arrive in the streaming DataFrame to match rows in the
static DataFrame. The other join types listed (inner, left, and full outer joins) are supported
in streaming-static DataFrame joins.
References:
Structured Streaming Programming Guide: Join Operations
Databricks Documentation on Stream-Static Joins: Databricks Stream-Static Joins
Question # 5 A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used. Which strategy will yield the best performance without shuffling data?
A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet. B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet. C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet. D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet. E. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.
Click for Answer
B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
Answer Description Explanation:
The key to efficiently converting a large JSON dataset to Parquet files of a specific size without shuffling data lies in controlling the size of the output files directly.
• Setting spark.sql.files.maxPartitionBytes to 512 MB configures Spark to process data in chunks of 512 MB. This setting directly influences the size of the part-files in the output, aligning with the target file size.
• Narrow transformations (which do not involve shuffling data across partitions) can then be applied to this data.
• Writing the data out to Parquet will result in files that are approximately the size specified by spark.sql.files.maxPartitionBytes, in this case, 512 MB.
• The other options involve unnecessary shuffles or repartitions (B, C, D) or an incorrect setting for this specific requirement (E).
References:
• Apache Spark Documentation: Configuration - spark.sql.files.maxPartitionBytes
• Databricks Documentation on Data Sources: Databricks Data Sources Guide
Question # 6 The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing specific fields have not been approval for the sales org. Which of the following solutions addresses the situation while emphasizing simplicity?
A. Create a view on the marketing table selecting only these fields approved for the sales team alias the names of any fields that should be standardized to the sales naming conventions. B. Use a CTAS statement to create a derivative table from the marketing table configure a production jon to propagation changes. C. Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from marketing table. D. Create a new table with the required schema and use Delta Lake's DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.
Click for Answer
A. Create a view on the marketing table selecting only these fields approved for the sales team alias the names of any fields that should be standardized to the sales naming conventions.
Answer Description Explanation:
Creating a view is a straightforward solution that can address the need for field name standardization and selective field sharing between departments. A view allows for presenting a transformed version of the underlying data without duplicating it. In this scenario, the view would only include the approved fields for the sales team and rename any fields as per their naming conventions.
References:
• Databricks documentation on using SQL views in Delta Lake: https://docs.databricks.com/delta/quick-start.html#sql-views
Question # 7 The data governance team has instituted a requirement that all tables containing Personal
Identifiable Information (PH) must be clearly annotated. This includes adding column
comments, table comments, and setting the custom table property "contains_pii" = true.
The following SQL DDL statement is executed to create a new table:
Which command allows manual confirmation that these three requirements have been
met? A. DESCRIBE EXTENDED dev.pii test B. DESCRIBE DETAIL dev.pii test C. SHOW TBLPROPERTIES dev.pii test D. DESCRIBE HISTORY dev.pii testE. SHOW TABLES dev
Click for Answer
A. DESCRIBE EXTENDED dev.pii test
Answer Description Explanation: This is the correct answer because it allows manual confirmation that these
three requirements have been met. The requirements are that all tables containing
Personal Identifiable Information (PII) must be clearly annotated, which includes adding
column comments, table comments, and setting the custom table property “contains_pii” =
true. The DESCRIBE EXTENDED command is used to display detailed information about a
table, such as its schema, location, properties, and comments. By using this command on
the dev.pii_test table, one can verify that the table has been created with the correct
column comments, table comment, and custom table property as specified in the SQL DDL
statement. Verified References: [Databricks Certified Data Engineer Professional], under
“Lakehouse” section; Databricks Documentation, under “DESCRIBE EXTENDED” section.
Question # 8 Although the Databricks Utilities Secrets module provides tools to store sensitive
credentials and avoid accidentally displaying them in plain text users should still be careful
with which credentials are stored here and which users have access to using these secrets.
Which statement describes a limitation of Databricks Secrets? A. Because the SHA256 hash is used to obfuscate stored secrets, reversing this hash will
display the value in plain text. B. Account administrators can see all secrets in plain text by logging on to the Databricks
Accounts console.C. Secrets are stored in an administrators-only table within the Hive Metastore; database
administrators have permission to query this table by default. D. Iterating through a stored secret and printing each character will display secret contents
in plain text. E. The Databricks REST API can be used to list secrets in plain text if the personal access
token has proper credentials.
Click for Answer
E. The Databricks REST API can be used to list secrets in plain text if the personal access
token has proper credentials.
Answer Description Explanation: This is the correct answer because it describes a limitation of Databricks
Secrets. Databricks Secrets is a module that provides tools to store sensitive credentials
and avoid accidentally displaying them in plain text. Databricks Secrets allows creating
secret scopes, which are collections of secrets that can be accessed by users or groups.
Databricks Secrets also allows creating and managing secrets using the Databricks CLI or
the Databricks REST API. However, a limitation of Databricks Secrets is that the
Databricks REST API can be used to list secrets in plain text if the personal access token
has proper credentials. Therefore, users should still be careful with which credentials are
stored in Databricks Secrets and which users have access to using these secrets. Verified
References: [Databricks Certified Data Engineer Professional], under “Databricks
Workspace” section; Databricks Documentation, under “List secrets” section.
Up-to-Date
We always provide up-to-date Databricks-Certified-Professional-Data-Engineer exam dumps to our clients. Keep checking website for updates and download.
Excellence
Quality and excellence of our Databricks Certified Data Engineer Professional practice questions are above customers expectations. Contact live chat to know more.
Success
Your SUCCESS is assured with the Databricks-Certified-Professional-Data-Engineer exam questions of passin1day.com. Just Buy, Prepare and PASS!
Quality
All our braindumps are verified with their correct answers. Download Databricks Certification Practice tests in a printable PDF format.
Basic
$80
Any 3 Exams of Your Choice
3 Exams PDF + Online Test Engine
Buy Now
Premium
$100
Any 4 Exams of Your Choice
4 Exams PDF + Online Test Engine
Buy Now
Gold
$125
Any 5 Exams of Your Choice
5 Exams PDF + Online Test Engine
Buy Now
Passin1Day has a big success story in last 12 years with a long list of satisfied customers.
We are UK based company, selling Databricks-Certified-Professional-Data-Engineer practice test questions answers. We have a team of 34 people in Research, Writing, QA, Sales, Support and Marketing departments and helping people get success in their life.
We dont have a single unsatisfied Databricks customer in this time. Our customers are our asset and precious to us more than their money.
Databricks-Certified-Professional-Data-Engineer Dumps
We have recently updated Databricks Databricks-Certified-Professional-Data-Engineer dumps study guide. You can use our Databricks Certification braindumps and pass your exam in just 24 hours. Our Databricks Certified Data Engineer Professional real exam contains latest questions. We are providing Databricks Databricks-Certified-Professional-Data-Engineer dumps with updates for 3 months. You can purchase in advance and start studying. Whenever Databricks update Databricks Certified Data Engineer Professional exam, we also update our file with new questions. Passin1day is here to provide real Databricks-Certified-Professional-Data-Engineer exam questions to people who find it difficult to pass exam
Databricks Certification can advance your marketability and prove to be a key to differentiating you from those who have no certification and Passin1day is there to help you pass exam with Databricks-Certified-Professional-Data-Engineer dumps. Databricks Certifications demonstrate your competence and make your discerning employers recognize that Databricks Certified Data Engineer Professional certified employees are more valuable to their organizations and customers. We have helped thousands of customers so far in achieving their goals. Our excellent comprehensive Databricks exam dumps will enable you to pass your certification Databricks Certification exam in just a single try. Passin1day is offering Databricks-Certified-Professional-Data-Engineer braindumps which are accurate and of high-quality verified by the IT professionals. Candidates can instantly download Databricks Certification dumps and access them at any device after purchase. Online Databricks Certified Data Engineer Professional practice tests are planned and designed to prepare you completely for the real Databricks exam condition. Free Databricks-Certified-Professional-Data-Engineer dumps demos can be available on customer’s demand to check before placing an order.
What Our Customers Say
Jeff Brown
Thanks you so much passin1day.com team for all the help that you have provided me in my Databricks exam. I will use your dumps for next certification as well.
Mareena Frederick
You guys are awesome. Even 1 day is too much. I prepared my exam in just 3 hours with your Databricks-Certified-Professional-Data-Engineer exam dumps and passed it in first attempt :)
Ralph Donald
I am the fully satisfied customer of passin1day.com. I have passed my exam using your Databricks Certified Data Engineer Professional braindumps in first attempt. You guys are the secret behind my success ;)
Lilly Solomon
I was so depressed when I get failed in my Cisco exam but thanks GOD you guys exist and helped me in passing my exams. I am nothing without you.