Top 100 Azure Data Engineer Interview Questions
Level-2
51. list the differences between ADF and Synapse Analytics?
| ADF | Synapse Analytics |
|---|---|
| Uses ADF Studio | Uses Synapses Studio |
| Stored Pipelines in ADF Store / GitHub | Stored Pipelines in ADLS / Git Hub |
| Used to create Pipelines | Used to create Pipelines, Notebooks |
| Can implement CDC for Incremental Loads | Cannot implement CDC for Incremental Loads |
| Cannot create Dedicated SQL Pool | Can create Dedicated SQL Pool |
| Cannot run TSQL Scripts directly on SQL Pool | Can run the TSQL Scripts directly on SQL Pool |
| Cannot deploy Apache Spark Pools for Staging | Can deploy Apache Spark Pools for Staging |
| Slow | Faster |
| Cheaper | Costly |
| Cannot be used create Notebooks | Can be used to create notebooks |
(As on date)
52. In what scenarios do we need to use Copy Data Tool only? And not Data Flow Activity?
- i. To implement Basic ETL avoiding Spark Clusters, using DIUs. Applicable for Small Data Loads.
- ii. To implement File Incremental Loads
- iii. To implement Table Level Incremental Load using UPSERT
53. In what scenarios we need to use Data Flow Activity and not Copy Data Tool?
- i. To implement ETL with Spark Clusters
- ii. To implement Big Data Loads
- iii. To implement flexible Transformations like Joins, Group By, Pivot, Derived Columns, Lookup, etc..
- iv. To work with ETL Operations on more than 1 source or more than one destination
- v. To perform active / LIVE Debugging during Pipeline design
54. In what scenarios we need to use Staging as a mandatory resource?
- i. To load data into SQL Pool (DWH) using Copy Data Tool
- ii. To load data into SQL Pool (DWH) using Data Flow
Recommend Scenarios to use staging for ADF Pipelines:
- Multi Level Transformations
- During implementation of Pipeline partitions
- During implementation of Broadcast options
55. How can you optimize an ADF Pipeline? Give step by step approach?
Step 1: Validate the Pipeline.
Step 2: Debug the Pipeline. Identity the transformations that take longer time.
Step 3: Verify and implement Pipeline Tuning Options if needed:
- Staging
- Increase Capacity of DIU or Spark Cluster
- Implement Compression
- Implement Best Effort Logging
Step 4: Verify and implement SQL Pool Tuning options as needed:
- Distributions
- Partitions
- Broadcast options
56. What techniques do you use to optimize SQL Pool?
- Distributions
- Partitions
- Indexing
57. What techniques do you use to optimize Copy Data activity?
- Staging
- Best effort logging
- Compressions
- DIU (Data Integration Units)
- DOCP (Degree of Copy Parallelism)
- Concurrent connections
58. What techniques do you use to optimize Data Flow activity?
- Staging
- Best effort logging
- Spark cluster size
- Partitioning
- Broadcast
59. What are the limitations of the Upsert Copy method?
- Common key column specification
60. What are the types of Spark clusters involved in ADF?
- Debug cluster
- Execution cluster
61. Where are the ADF artifacts (pipelines, data flows, datasets, linked services, etc.) published to?
ADF Store
62. What are the limitations of Copy Data activity?
- We cannot use more than one source or destination
- We cannot implement flexible ETL
- Compared with Data Flow Spark clusters, this activity is slower unless we use PolyBase
- Debug is not possible
63. What are the limitations of Data Flow activity?
- PolyBase is not supported
- File incremental loads are not supported
64. You are a Data Engineer. It has few database entries. You were given access to one Azure PaaS server. How do you identify or differentiate between Azure SQL Databases for OLTP workloads and SQL Pools for DWH workloads?
Answer:
From SSMS Tool: Once you connect to Azure SQL Server then we need to query:
select * from sys.databases- From result, if we see DTU Compute then it is Azure SQL Database.
- If we see DWH Compute then it is Azure SQL Pool.
65.You are a Data Engineer for an online travel company. You performed data loads using a COPY INTO T-SQL script, but you experienced very slow data loads due to a huge volume of data. However, due to certain client restrictions, you do not have the option to use pipelines or notebooks. What are your next steps? How can you address or troubleshoot this slow data load problem?
Answer:
- Partition the Target Table
- Set Distribution for the Target Table
66. How do you optimize source data extraction if you need to get data from an Azure SQL Database?
Answer:
- Set more number of DIUs for the Pipeline
- Define indexes at the source table
67. How do you optimize destination data loads if you need to write data to an Azure SQL Pool?
Answer:
- Set more number of DIUs for the Pipeline
- Set more number of DWHs for the DWH
- Create the tables on destination with Partitions and Distributions.
68. What is the difference between Azure SQL Server deployment and regular (on-premise) SQL Server installation?
| Azure SQL Server Deployment | On-Premise SQL Server Deployment |
|---|---|
| From Cloud Account, Browser. No OS involved | OS mandatory |
| No option for Instance, Instance Name, Location | Option to choose Instance |
| No concept of Service, Collation | Option to set Service, Collation |
| No option to control system db locations | Option to control System DB Locations |
69. What is the difference between Azure SQL Database in the cloud and regular (on-premise) SQL Database creation?
| Azure SQL Database | On-Premise SQL Database |
|---|---|
| No option to choose File Locations | Flexible File Location |
| No option to add new Files | Option to add new Files (Secondary Files) |
| Flexible Pricing Plans | Free, unlimited Storage with SQL Server License |
70. What are the precautions to establish a connection to Azure SQL Server from on-premise tools?
- Configure Firewall
- Allow services and resources to access the server
- Port #1433 should be allowed
71. What is the use of Azure SQL Database? What is the use of Dedicated SQL Pool? When should you use which?
Azure SQL Database:
- Used to store OLTP Data
- License based on Size, DTU [Data Transaction Unit]
Azure SQL Pool:
- Used to store old, Historical Data
- License based on DWU [Data Warehouse Unit]; Unlimited Size.
72. How to identify if a given database is an Azure SQL DB or a Synapse DB?
From SSMS Connections:
- Synapse DB has yellow colour symbol compared to SQL DB in black colour
From T-SQL Script:
select * from sys.databasesFrom Azure Portal Script:
All Resources > SQL Server Resource > SQL Databases >
- DTUS: Azure SQL Database
- DWUs: Synapse Databases
73. When to use which Distribution?
Hash Distribution:
- Each row of the table is distributed to one distribution and across the compute nodes.
- Used when joins and aggregations on large tables.
Round-robin:
- Distributes data evenly across the table
- Delivers fast performance when used as a staging table for loads.
Replicated tables:
- A table that is replicated caches a full copy of the table on each compute node
- Provides the fastest query performance for small tables.
74. What is the difference between DOP and DOCP?
- Degree of Parallelism (DOP): is to set to execute the query plans parallelly by a processor
- Degree of Copy Parallelism (DOCP): is to number of connections that can read from your source or write to your sink in parallel
75. What are the limitations of Azure Synapse Database?
- Not good for smaller loads
- Not good for Low volume of data loads
- Costlier compared to Azure SQL DB
76. How to perform Data Loads into Azure Synapse Database? Explain all possible options.
- Using **Copy Statement**, load data from local SQL server to Synapse
- Using **ADF**, polybase option can load data from blob storage, data lake & Azure SQL DB
77. What are the uses of ADF?
- To perform ETL Operations from any source to any destination
- To perform Incremental Loads to DWH
- To perform Aggregated Data Loads to DWH
78. Where do you store the ADF Pipelines?
- ADF Store
- GitHub
79. What are the Activities in ADF?
Activities are operations we perform in ADF for our day to today ETL needs.
Important Activities in ADF:
- Copy Data Activity
- Data Flow Activity
- Stored Procedure Activity
- SQL Pool Stored Procedure Activity
- Lookup Activity, etc..
80. What is “Triggering” in ADF?
Triggering is mechanism to execute / invoke the ADF Pipeline.
81. What is DIU?
DIU stands for **Data Integration Unit**.
This is a measure of Processor and Memory allocation for ETL Executions.
82. What are the Prerequisites for ADF Pipeline Creations?
- Need to setup linked services for both source and sink
- Create Datasets
- Design the pipeline with at least one activity (SP, Copy Data, Data Flow, etc..)
83. What are different ways to create ADF Pipelines?
- ADF Studio > Home > Ingest
- ADF Studio > Home > Orchestrate
- ADF Studio > Left: Author > +New > Pipeline
- ADF Studio > Left: Author > +New > Copy Data
- ADF Studio > Left: Author > +New > Template > Specify Template Inputs > Save.
84. What is a Linked Service? How to verify or test a Linked Service?
Linked Service is an ADF Object used to connect with Data Source(s) and Sink(s).
During the creation of linked service in ADF: we need to specify the authentication details > Test Connection.
Additionally during Dataset Creation: The LinkedService is auto tested.
85. What are the different types of Integration Runtimes? Purpose of each?
- **Auto Resolving IR:** Used to establish connections to Azure Data Platform
- **Self Hosted IR (SHIR):** Used to establish connections to non-Azure Data Platform
86. How Integration Runtime is linked / associated with Pricing?
- In ADF: IRs are always free to configure.
- However, SHIR involves On-Premises Connections [Outside Cloud]. Involve more latency. Hence more DIUs will be consumed.
- Hence such pipelines involve more cost.
87. Consider a real-world environment with a group of clients connected to a Data Center for OLTP Data Operations. You are a Data Engineer to perform ETL Operations. Where do you need to configure the Self Hosted IR?
Answer: Option 3: In a dedicated **Middle Tier (MT) Server**
88. What are the limitations of SHIR?
- Cannot be reused
- Only applicable for Copy Data Activity. Not applicable for Data Flow Activity.
89. What are the uses of Copy Data Tool in ADF?
- To perform Aggregated Data Loads
- To perform File Incremental Loads
- To perform Table Incremental Loads (Upserts)
- To perform BULK INSERTS
90. What are the limitations of Copy Data Tool?
- We can’t use transformations/activities using this tool like Lookup, Derived Column, Surrogate Key, Sort, Stored Procedure…
- We cannot have more than one source linked service
- We cannot have more than one destination linked service
- We cannot reuse the Datasets
91. What is “Binary Copy” Option?
- We used this in our File Incremental Loads Example.
- Using this option, we can define the Data set is transferred as is from source to sink skipping the PARSE or VALIDATION within ADF.
- This option will auto result in detection of structural changes from source and get loaded to sink.
92. How to deploy a Copy Data Pipeline?
- Deploy means to PUBLISH a pipeline.
- We cannot publish a new pipeline manually.
- The pipeline will be auto published / auto deployed once we complete the Copy Data Wizard.
- Additionally, we can perform changes to the resulting pipeline. We need to manually publish the changes.
93. What are the steps to EDIT a Pipeline resulted from Copy Data Tool?
- From ADF Studio: Left > Author > Right : We see the pipeline
- Click on Copy Data Activity > Perform changes > Publish.
94. How to perform File Level Incremental Loads in ADF?
- Using Binary Copy Option with Copy Data Wizard. We need to schedule the pipeline with cadence to get the Incremental Loads option.
- Pls refer to the Incremental example in our classes.
95. How to perform Table Level Incremental Loads in ADF?
- Option 1:Using Copy Data Tool: Upsert Copy Method
- Option 2: Using water mark table (Control Table) with Stored Procedure Activity
96. How to manage connections in ADF?
From ADF Studio: Left: Manage > Right: Linked services > Click @ Required Linked Service > Perform Edits > Save.
97. How to Dynamically Control the connections in ADF?
Using Data Factory Connection Parameters [Ex: Template @ Control Tables]
98. What are the precautions for Incremental Loads?
Business Key at the source (OLTP) should be readonly. For this, you can define it as Identity.
99. What are the different Tuning Options in ADF?
- DIU & DOCP
100. How DIUs are different from DoCP?
- DIU’s: are combination of memory, CPU and network resources
- DOCP: is set on DB’s to read/write data parallelly