Skip to main content

Top 100 Azure Data Engineer Interview Questions 

Level-2

51. list the differences between ADF and Synapse Analytics?

ADFSynapse Analytics
Uses ADF StudioUses Synapses Studio
Stored Pipelines in ADF Store / GitHubStored Pipelines in ADLS / Git Hub
Used to create PipelinesUsed to create Pipelines, Notebooks
Can implement CDC for Incremental LoadsCannot implement CDC for Incremental Loads
Cannot create Dedicated SQL PoolCan create Dedicated SQL Pool
Cannot run TSQL Scripts directly on SQL PoolCan run the TSQL Scripts directly on SQL Pool
Cannot deploy Apache Spark Pools for StagingCan deploy Apache Spark Pools for Staging
SlowFaster
CheaperCostly
Cannot be used create NotebooksCan be used to create notebooks

(As on date)

52. In what scenarios do we need to use Copy Data Tool only? And not Data Flow Activity?

  • i. To implement Basic ETL avoiding Spark Clusters, using DIUs. Applicable for Small Data Loads.
  • ii. To implement File Incremental Loads
  • iii. To implement Table Level Incremental Load using UPSERT

53. In what scenarios we need to use Data Flow Activity and not Copy Data Tool?

  • i. To implement ETL with Spark Clusters
  • ii. To implement Big Data Loads
  • iii. To implement flexible Transformations like Joins, Group By, Pivot, Derived Columns, Lookup, etc..
  • iv. To work with ETL Operations on more than 1 source or more than one destination
  • v. To perform active / LIVE Debugging during Pipeline design

54. In what scenarios we need to use Staging as a mandatory resource?

  • i. To load data into SQL Pool (DWH) using Copy Data Tool
  • ii. To load data into SQL Pool (DWH) using Data Flow

Recommend Scenarios to use staging for ADF Pipelines:

  1. Multi Level Transformations
  2. During implementation of Pipeline partitions
  3. During implementation of Broadcast options

55. How can you optimize an ADF Pipeline? Give step by step approach?

Step 1: Validate the Pipeline.

Step 2: Debug the Pipeline. Identity the transformations that take longer time.

Step 3: Verify and implement Pipeline Tuning Options if needed:

  • Staging
  • Increase Capacity of DIU or Spark Cluster
  • Implement Compression
  • Implement Best Effort Logging

Step 4: Verify and implement SQL Pool Tuning options as needed:

  • Distributions
  • Partitions
  • Broadcast options

56. What techniques do you use to optimize SQL Pool?

  • Distributions
  • Partitions
  • Indexing

57. What techniques do you use to optimize Copy Data activity?

  • Staging
  • Best effort logging
  • Compressions
  • DIU (Data Integration Units)
  • DOCP (Degree of Copy Parallelism)
  • Concurrent connections

58. What techniques do you use to optimize Data Flow activity?

  • Staging
  • Best effort logging
  • Spark cluster size
  • Partitioning
  • Broadcast

59. What are the limitations of the Upsert Copy method?

  • Common key column specification

60. What are the types of Spark clusters involved in ADF?

  • Debug cluster
  • Execution cluster

61. Where are the ADF artifacts (pipelines, data flows, datasets, linked services, etc.) published to?

  • ADF Store

62. What are the limitations of Copy Data activity?

  1. We cannot use more than one source or destination
  2. We cannot implement flexible ETL
  3. Compared with Data Flow Spark clusters, this activity is slower unless we use PolyBase
  4. Debug is not possible

63. What are the limitations of Data Flow activity?

  1. PolyBase is not supported
  2. File incremental loads are not supported

64. You are a Data Engineer. It has few database entries. You were given access to one Azure PaaS server. How do you identify or differentiate between Azure SQL Databases for OLTP workloads and SQL Pools for DWH workloads?

Answer:

From SSMS Tool: Once you connect to Azure SQL Server then we need to query:

select * from sys.databases
  • From result, if we see DTU Compute then it is Azure SQL Database.
  • If we see DWH Compute then it is Azure SQL Pool.

65.You are a Data Engineer for an online travel company. You performed data loads using a COPY INTO T-SQL script, but you experienced very slow data loads due to a huge volume of data. However, due to certain client restrictions, you do not have the option to use pipelines or notebooks. What are your next steps? How can you address or troubleshoot this slow data load problem?

Answer:

  1. Partition the Target Table
  2. Set Distribution for the Target Table

66. How do you optimize source data extraction if you need to get data from an Azure SQL Database?

Answer:

  1. Set more number of DIUs for the Pipeline
  2. Define indexes at the source table

67. How do you optimize destination data loads if you need to write data to an Azure SQL Pool?

Answer:

  1. Set more number of DIUs for the Pipeline
  2. Set more number of DWHs for the DWH
  3. Create the tables on destination with Partitions and Distributions.

68. What is the difference between Azure SQL Server deployment and regular (on-premise) SQL Server installation?

Azure SQL Server DeploymentOn-Premise SQL Server Deployment
From Cloud Account, Browser. No OS involvedOS mandatory
No option for Instance, Instance Name, LocationOption to choose Instance
No concept of Service, CollationOption to set Service, Collation
No option to control system db locationsOption to control System DB Locations

69. What is the difference between Azure SQL Database in the cloud and regular (on-premise) SQL Database creation?

Azure SQL DatabaseOn-Premise SQL Database
No option to choose File LocationsFlexible File Location
No option to add new FilesOption to add new Files (Secondary Files)
Flexible Pricing PlansFree, unlimited Storage with SQL Server License

70. What are the precautions to establish a connection to Azure SQL Server from on-premise tools?

  1. Configure Firewall
  2. Allow services and resources to access the server
  3. Port #1433 should be allowed

71. What is the use of Azure SQL Database? What is the use of Dedicated SQL Pool? When should you use which?

Azure SQL Database:

  • Used to store OLTP Data
  • License based on Size, DTU [Data Transaction Unit]

Azure SQL Pool:

  • Used to store old, Historical Data
  • License based on DWU [Data Warehouse Unit]; Unlimited Size.

72. How to identify if a given database is an Azure SQL DB or a Synapse DB?

From SSMS Connections:

  • Synapse DB has yellow colour symbol compared to SQL DB in black colour

From T-SQL Script:

select * from sys.databases

From Azure Portal Script:

All Resources > SQL Server Resource > SQL Databases >

  • DTUS: Azure SQL Database
  • DWUs: Synapse Databases

73. When to use which Distribution?

Hash Distribution:

  • Each row of the table is distributed to one distribution and across the compute nodes.
  • Used when joins and aggregations on large tables.

Round-robin:

  • Distributes data evenly across the table
  • Delivers fast performance when used as a staging table for loads.

Replicated tables:

  • A table that is replicated caches a full copy of the table on each compute node
  • Provides the fastest query performance for small tables.

74. What is the difference between DOP and DOCP?

  • Degree of Parallelism (DOP): is to set to execute the query plans parallelly by a processor
  • Degree of Copy Parallelism (DOCP): is to number of connections that can read from your source or write to your sink in parallel

75. What are the limitations of Azure Synapse Database?

  • Not good for smaller loads
  • Not good for Low volume of data loads
  • Costlier compared to Azure SQL DB

76. How to perform Data Loads into Azure Synapse Database? Explain all possible options.

  1. Using **Copy Statement**, load data from local SQL server to Synapse
  2. Using **ADF**, polybase option can load data from blob storage, data lake & Azure SQL DB

77. What are the uses of ADF?

  1. To perform ETL Operations from any source to any destination
  2. To perform Incremental Loads to DWH
  3. To perform Aggregated Data Loads to DWH

78. Where do you store the ADF Pipelines?

  1. ADF Store
  2. GitHub

79. What are the Activities in ADF?

Activities are operations we perform in ADF for our day to today ETL needs.

Important Activities in ADF:

  1. Copy Data Activity
  2. Data Flow Activity
  3. Stored Procedure Activity
  4. SQL Pool Stored Procedure Activity
  5. Lookup Activity, etc..

80. What is “Triggering” in ADF?

Triggering is mechanism to execute / invoke the ADF Pipeline.

81. What is DIU?

DIU stands for **Data Integration Unit**.

This is a measure of Processor and Memory allocation for ETL Executions.

82. What are the Prerequisites for ADF Pipeline Creations?

  1. Need to setup linked services for both source and sink
  2. Create Datasets
  3. Design the pipeline with at least one activity (SP, Copy Data, Data Flow, etc..)

83. What are different ways to create ADF Pipelines?

  1. ADF Studio > Home > Ingest
  2. ADF Studio > Home > Orchestrate
  3. ADF Studio > Left: Author > +New > Pipeline
  4. ADF Studio > Left: Author > +New > Copy Data
  5. ADF Studio > Left: Author > +New > Template > Specify Template Inputs > Save.

84. What is a Linked Service? How to verify or test a Linked Service?

Linked Service is an ADF Object used to connect with Data Source(s) and Sink(s).

During the creation of linked service in ADF: we need to specify the authentication details > Test Connection.

Additionally during Dataset Creation: The LinkedService is auto tested.

85. What are the different types of Integration Runtimes? Purpose of each?

  1. **Auto Resolving IR:** Used to establish connections to Azure Data Platform
  2. **Self Hosted IR (SHIR):** Used to establish connections to non-Azure Data Platform

86. How Integration Runtime is linked / associated with Pricing?

  • In ADF: IRs are always free to configure.
  • However, SHIR involves On-Premises Connections [Outside Cloud]. Involve more latency. Hence more DIUs will be consumed.
  • Hence such pipelines involve more cost.

87. Consider a real-world environment with a group of clients connected to a Data Center for OLTP Data Operations. You are a Data Engineer to perform ETL Operations. Where do you need to configure the Self Hosted IR?

Answer: Option 3: In a dedicated **Middle Tier (MT) Server**

88. What are the limitations of SHIR?

  1. Cannot be reused
  2. Only applicable for Copy Data Activity. Not applicable for Data Flow Activity.

89. What are the uses of Copy Data Tool in ADF?

  • To perform Aggregated Data Loads
  • To perform File Incremental Loads
  • To perform Table Incremental Loads (Upserts)
  • To perform BULK INSERTS

90. What are the limitations of Copy Data Tool?

  • We can’t use transformations/activities using this tool like Lookup, Derived Column, Surrogate Key, Sort, Stored Procedure…
  • We cannot have more than one source linked service
  • We cannot have more than one destination linked service
  • We cannot reuse the Datasets

91. What is “Binary Copy” Option?

  • We used this in our File Incremental Loads Example.
  • Using this option, we can define the Data set is transferred as is from source to sink skipping the PARSE or VALIDATION within ADF.
  • This option will auto result in detection of structural changes from source and get loaded to sink.

92. How to deploy a Copy Data Pipeline?

  • Deploy means to PUBLISH a pipeline.
  • We cannot publish a new pipeline manually.
  • The pipeline will be auto published / auto deployed once we complete the Copy Data Wizard.
  • Additionally, we can perform changes to the resulting pipeline. We need to manually publish the changes.

93. What are the steps to EDIT a Pipeline resulted from Copy Data Tool?

  • From ADF Studio: Left > Author > Right : We see the pipeline
  • Click on Copy Data Activity > Perform changes > Publish.

94. How to perform File Level Incremental Loads in ADF?

  • Using Binary Copy Option with Copy Data Wizard. We need to schedule the pipeline with cadence to get the Incremental Loads option.
  • Pls refer to the Incremental example in our classes.

95. How to perform Table Level Incremental Loads in ADF?

  1. Option 1:Using Copy Data Tool: Upsert Copy Method
  2. Option 2: Using water mark table (Control Table) with Stored Procedure Activity

96. How to manage connections in ADF?

From ADF Studio: Left: Manage > Right: Linked services > Click @ Required Linked Service > Perform Edits > Save.

97. How to Dynamically Control the connections in ADF?

Using Data Factory Connection Parameters [Ex: Template @ Control Tables]

98. What are the precautions for Incremental Loads?

Business Key at the source (OLTP) should be readonly. For this, you can define it as Identity.

99. What are the different Tuning Options in ADF?

  • DIU & DOCP

100. How DIUs are different from DoCP?

  • DIU’s: are combination of memory, CPU and network resources
  • DOCP: is set on DB’s to read/write data parallelly