Azure Data Engineer Interview Questions Level 2

Top 100 Azure Data Engineer Interview Questions

Level-2

51. list the differences between ADF and Synapse Analytics?

ADF	Synapse Analytics
Uses ADF Studio	Uses Synapses Studio
Stored Pipelines in ADF Store / GitHub	Stored Pipelines in ADLS / Git Hub
Used to create Pipelines	Used to create Pipelines, Notebooks
Can implement CDC for Incremental Loads	Cannot implement CDC for Incremental Loads
Cannot create Dedicated SQL Pool	Can create Dedicated SQL Pool
Cannot run TSQL Scripts directly on SQL Pool	Can run the TSQL Scripts directly on SQL Pool
Cannot deploy Apache Spark Pools for Staging	Can deploy Apache Spark Pools for Staging
Slow	Faster
Cheaper	Costly
Cannot be used create Notebooks	Can be used to create notebooks

(As on date)

52. In what scenarios do we need to use Copy Data Tool only? And not Data Flow Activity?

i. To implement Basic ETL avoiding Spark Clusters, using DIUs. Applicable for Small Data Loads.
ii. To implement File Incremental Loads
iii. To implement Table Level Incremental Load using UPSERT

53. In what scenarios we need to use Data Flow Activity and not Copy Data Tool?

i. To implement ETL with Spark Clusters
ii. To implement Big Data Loads
iii. To implement flexible Transformations like Joins, Group By, Pivot, Derived Columns, Lookup, etc..
iv. To work with ETL Operations on more than 1 source or more than one destination
v. To perform active / LIVE Debugging during Pipeline design

54. In what scenarios we need to use Staging as a mandatory resource?

i. To load data into SQL Pool (DWH) using Copy Data Tool
ii. To load data into SQL Pool (DWH) using Data Flow

Recommend Scenarios to use staging for ADF Pipelines:

Multi Level Transformations
During implementation of Pipeline partitions
During implementation of Broadcast options

55. How can you optimize an ADF Pipeline? Give step by step approach?

Step 1: Validate the Pipeline.

Step 2: Debug the Pipeline. Identity the transformations that take longer time.

Step 3: Verify and implement Pipeline Tuning Options if needed:

Staging
Increase Capacity of DIU or Spark Cluster
Implement Compression
Implement Best Effort Logging

Step 4: Verify and implement SQL Pool Tuning options as needed:

Distributions
Partitions
Broadcast options

56. What techniques do you use to optimize SQL Pool?

Distributions
Partitions
Indexing

57. What techniques do you use to optimize Copy Data activity?

Staging
Best effort logging
Compressions
DIU (Data Integration Units)
DOCP (Degree of Copy Parallelism)
Concurrent connections

58. What techniques do you use to optimize Data Flow activity?

Staging
Best effort logging
Spark cluster size
Partitioning
Broadcast

59. What are the limitations of the Upsert Copy method?

Common key column specification

60. What are the types of Spark clusters involved in ADF?

Debug cluster
Execution cluster

61. Where are the ADF artifacts (pipelines, data flows, datasets, linked services, etc.) published to?

ADF Store

62. What are the limitations of Copy Data activity?

We cannot use more than one source or destination
We cannot implement flexible ETL
Compared with Data Flow Spark clusters, this activity is slower unless we use PolyBase
Debug is not possible

63. What are the limitations of Data Flow activity?

PolyBase is not supported
File incremental loads are not supported

64. You are a Data Engineer. It has few database entries. You were given access to one Azure PaaS server. How do you identify or differentiate between Azure SQL Databases for OLTP workloads and SQL Pools for DWH workloads?

Answer:

From SSMS Tool: Once you connect to Azure SQL Server then we need to query:

select * from sys.databases

From result, if we see DTU Compute then it is Azure SQL Database.
If we see DWH Compute then it is Azure SQL Pool.

65.You are a Data Engineer for an online travel company. You performed data loads using a COPY INTO T-SQL script, but you experienced very slow data loads due to a huge volume of data. However, due to certain client restrictions, you do not have the option to use pipelines or notebooks. What are your next steps? How can you address or troubleshoot this slow data load problem?

Answer:

Partition the Target Table
Set Distribution for the Target Table

66. How do you optimize source data extraction if you need to get data from an Azure SQL Database?

Answer:

Set more number of DIUs for the Pipeline
Define indexes at the source table

67. How do you optimize destination data loads if you need to write data to an Azure SQL Pool?

Answer:

Set more number of DIUs for the Pipeline
Set more number of DWHs for the DWH
Create the tables on destination with Partitions and Distributions.

68. What is the difference between Azure SQL Server deployment and regular (on-premise) SQL Server installation?

Azure SQL Server Deployment	On-Premise SQL Server Deployment
From Cloud Account, Browser. No OS involved	OS mandatory
No option for Instance, Instance Name, Location	Option to choose Instance
No concept of Service, Collation	Option to set Service, Collation
No option to control system db locations	Option to control System DB Locations

69. What is the difference between Azure SQL Database in the cloud and regular (on-premise) SQL Database creation?

Azure SQL Database	On-Premise SQL Database
No option to choose File Locations	Flexible File Location
No option to add new Files	Option to add new Files (Secondary Files)
Flexible Pricing Plans	Free, unlimited Storage with SQL Server License

70. What are the precautions to establish a connection to Azure SQL Server from on-premise tools?

Configure Firewall
Allow services and resources to access the server
Port #1433 should be allowed

71. What is the use of Azure SQL Database? What is the use of Dedicated SQL Pool? When should you use which?

Azure SQL Database:

Used to store OLTP Data
License based on Size, DTU [Data Transaction Unit]

Azure SQL Pool:

Used to store old, Historical Data
License based on DWU [Data Warehouse Unit]; Unlimited Size.

72. How to identify if a given database is an Azure SQL DB or a Synapse DB?

From SSMS Connections:

Synapse DB has yellow colour symbol compared to SQL DB in black colour

From T-SQL Script:

select * from sys.databases

From Azure Portal Script:

All Resources > SQL Server Resource > SQL Databases >

DTUS: Azure SQL Database
DWUs: Synapse Databases

73. When to use which Distribution?

Hash Distribution:

Each row of the table is distributed to one distribution and across the compute nodes.
Used when joins and aggregations on large tables.

Round-robin:

Distributes data evenly across the table
Delivers fast performance when used as a staging table for loads.

Replicated tables:

A table that is replicated caches a full copy of the table on each compute node
Provides the fastest query performance for small tables.

74. What is the difference between DOP and DOCP?

Degree of Parallelism (DOP): is to set to execute the query plans parallelly by a processor
Degree of Copy Parallelism (DOCP): is to number of connections that can read from your source or write to your sink in parallel

75. What are the limitations of Azure Synapse Database?

Not good for smaller loads
Not good for Low volume of data loads
Costlier compared to Azure SQL DB

76. How to perform Data Loads into Azure Synapse Database? Explain all possible options.

Using **Copy Statement**, load data from local SQL server to Synapse
Using **ADF**, polybase option can load data from blob storage, data lake & Azure SQL DB

77. What are the uses of ADF?

To perform ETL Operations from any source to any destination
To perform Incremental Loads to DWH
To perform Aggregated Data Loads to DWH

78. Where do you store the ADF Pipelines?

ADF Store
GitHub

79. What are the Activities in ADF?

Activities are operations we perform in ADF for our day to today ETL needs.

Important Activities in ADF:

Copy Data Activity
Data Flow Activity
Stored Procedure Activity
SQL Pool Stored Procedure Activity
Lookup Activity, etc..

80. What is “Triggering” in ADF?

Triggering is mechanism to execute / invoke the ADF Pipeline.

81. What is DIU?

DIU stands for **Data Integration Unit**.

This is a measure of Processor and Memory allocation for ETL Executions.

82. What are the Prerequisites for ADF Pipeline Creations?

Need to setup linked services for both source and sink
Create Datasets
Design the pipeline with at least one activity (SP, Copy Data, Data Flow, etc..)

83. What are different ways to create ADF Pipelines?

ADF Studio > Home > Ingest
ADF Studio > Home > Orchestrate
ADF Studio > Left: Author > +New > Pipeline
ADF Studio > Left: Author > +New > Copy Data
ADF Studio > Left: Author > +New > Template > Specify Template Inputs > Save.

84. What is a Linked Service? How to verify or test a Linked Service?

Linked Service is an ADF Object used to connect with Data Source(s) and Sink(s).

During the creation of linked service in ADF: we need to specify the authentication details > Test Connection.

Additionally during Dataset Creation: The LinkedService is auto tested.

85. What are the different types of Integration Runtimes? Purpose of each?

**Auto Resolving IR:** Used to establish connections to Azure Data Platform
**Self Hosted IR (SHIR):** Used to establish connections to non-Azure Data Platform

86. How Integration Runtime is linked / associated with Pricing?

In ADF: IRs are always free to configure.
However, SHIR involves On-Premises Connections [Outside Cloud]. Involve more latency. Hence more DIUs will be consumed.
Hence such pipelines involve more cost.

87. Consider a real-world environment with a group of clients connected to a Data Center for OLTP Data Operations. You are a Data Engineer to perform ETL Operations. Where do you need to configure the Self Hosted IR?

Answer: Option 3: In a dedicated **Middle Tier (MT) Server**

88. What are the limitations of SHIR?

Cannot be reused
Only applicable for Copy Data Activity. Not applicable for Data Flow Activity.

89. What are the uses of Copy Data Tool in ADF?

To perform Aggregated Data Loads
To perform File Incremental Loads
To perform Table Incremental Loads (Upserts)
To perform BULK INSERTS

90. What are the limitations of Copy Data Tool?

We can’t use transformations/activities using this tool like Lookup, Derived Column, Surrogate Key, Sort, Stored Procedure…
We cannot have more than one source linked service
We cannot have more than one destination linked service
We cannot reuse the Datasets

91. What is “Binary Copy” Option?

We used this in our File Incremental Loads Example.
Using this option, we can define the Data set is transferred as is from source to sink skipping the PARSE or VALIDATION within ADF.
This option will auto result in detection of structural changes from source and get loaded to sink.

92. How to deploy a Copy Data Pipeline?

Deploy means to PUBLISH a pipeline.
We cannot publish a new pipeline manually.
The pipeline will be auto published / auto deployed once we complete the Copy Data Wizard.
Additionally, we can perform changes to the resulting pipeline. We need to manually publish the changes.

93. What are the steps to EDIT a Pipeline resulted from Copy Data Tool?

From ADF Studio: Left > Author > Right : We see the pipeline
Click on Copy Data Activity > Perform changes > Publish.

94. How to perform File Level Incremental Loads in ADF?

Using Binary Copy Option with Copy Data Wizard. We need to schedule the pipeline with cadence to get the Incremental Loads option.
Pls refer to the Incremental example in our classes.