Saturday, May 18, 2019

Down the ETL / ELT Lane

This post is inspired by ETL / ELT coaching that I am doing for one of my colleagues. I figured if I am going to do this for her, why not others benefit from this too! 

ETL / ELT is an abbreviation for Extract, Transform, Load / Extract, Load, Transform. This is a procedure which lies at the core of many important areas like Data Warehousing, Analytics, Sales Performance Management etc. This concept has been there for years now.

I am not going to explain the concepts of ETL / ELT, because those will be found by a simple google search. I might provide with few one liners here and there but mainly what I intend to do here is to take you through the ETL / ELT journey using examples.

Few one liners to begin with

Extract - Extracting data from source - Source can be a enterprise system (Transaction system, Warehouse etc) or Source can simply be a flat file

Transform - Transforming the data into intended format by using various operations like aggregate, merge fields, merge records, rank, calculate etc.

Load - Loading transformed data in the target system


As part of "Transform", you can use various operations like - 
  • Calculate - To derive new values from existing using arithmetic, string, logic, date etc functions
  • Merge field - Merging columns data from different sources
  • Merge Records - Merging rows data from different sources
  • Aggregate - Finding sum, min, max, last, average from values in a column
  • Rank - Ranking values in a column
  • Roll-up using hierarchy - Roll-up figures to parent entity eg. Sales Rep to Manager

This is not an exhaustive list of operations, but most commonly used and should be enough for beginners rather than overwhelming ones self with lot of information.


Also, please note that this post is not going to be a tool specific exercise but an attempt to get to understand the operations with simple examples. Do remember there are multiple right ways to achieve the same output!

We will start with simple example and build on the example as we progress through our journey.


Let's take an example of a telecom company - say ABC Telecom. 

We have sample transaction data of this company.

We also have necessary master data - Employee, Role, Region, Product, Hierarchy.

In this file - Here


Based on the transaction data, here is our 1st Problem Statement / RequirementLet's say we want to find out and create a report for - Who sold the highest qty everyday?

Following would be our steps to achieve that.


1) Merge transaction data with employee data

This merge will be on the basis of common fields/columns in both the data sets. Also, this will be the left merge with transaction data being on the left input/initial input. We will only take relevant fields forward, dropping unnecessary fields for this requirement.

Merge condition will be - 
Transaction Data.Employee ID = Employees.Employee ID

Output - Date, Employee ID, Employee Name, Product, Qty, Amt

   
2) Aggregate data to find out total qty sold     

2nd step will use output of 1st data as input and use aggregate operation wherein data will be aggregated by Date and Employee ID, Employee Name and Calculation will be - Sum of Qty. We will store Sum of Qty in another field called Tot_Qty.

Output - Date, Employee ID, Employee Name, Tot_Qty

Note that we have dropped Product and Amt columns as well in the output of this step.


3) Rank to find out highest qty sold

In this step, we will use output from 2nd step and rank the Tot_Qty in Descending order. In doing so, we will also group the data by Date as we want highest quantity seller for everyday.
  
Output - Date, Employee ID, Employee Name, Tot_Qty, Rank


4) Post/Insert this data to a table

From the output of step 3, we will filter all the data with Rank=1.

Output - Date, Employee ID, Employee Name, Tot_Qty


In 4 simple steps you would get the required report!



2nd Problem Statement/Requirement - Who brought in least revenue?

Explanation here will be similar to above mentioned requirement.

1) Merge transaction data with Employee master

Merge condition will be - 
Transaction Data.Employee ID = Employees.Employee ID

Output - Date, Employee ID, Employee Name, Product, Qty, Amt

   
2) Aggregate data to find out total revenue

Aggregate by - Employee ID, Employee Name, Calculation - Sum of Amt to be stored in Tot_Amt. If you noticed, we have dropped Date column as well, as we want Employee with least revenue from entire data set rather than everyday.

Output - Employee ID, Employee Name, Tot_Amt


3) Rank to find out least revenue     

Rank by - Tot_Amt in ascending order as least revenue is required     

Output - Employee ID, Employee Name, Tot_Amt, Rank


4) Post this to a table     


3rd Problem Statement/Requirement - Which region has revenue > $10000?

1) Merge transaction data with Employees-Region table

On a closer look on the data set, you will realize that Employees-Region has only Region ID, hence we will also have to merge the Employees-Region table with Region master. By using this 2nd merge, we can get required region name from Region master.

The conditions will be - 
1st - Transaction Data.Employee ID = Employees-Region.Employee ID
2nd - Employees-Region.Region ID = Regions.Region ID

Output - Date, Employee ID, Employee Name, Region ID, Region Name, Product, Qty, Amt

   
2) Aggregate data to find out total Amt

Aggregate by - Region ID, Region Name, Calculation - Sum of Amt to be stored in Tot_Amt. We have dropped all the other columns, as we want Region and revenue.

Output - Region ID, Region Name, Tot_Amt


3) Filter out the Regions with Tot_Amt < 10000.

Calculate operation can be used for this purpose.

Output - Employee ID, Employee Name, Tot_Amt, Rank


4) Post this to a table     
-----------------------------------------------------------------------------------------


There are few more Problem Statements / Requirements which you might want to try for practice. I will provide hints for these, however for solutions keep watching this space. Another blog coming soon.

  • Best selling product - Based on Highest Qty sold?
    • Can be achieved in similar manner as 3rd requirement

  • Who was highest quantity seller for most number of days?
    • Use output of 1st requirement and build up on that using aggregate operation

  • Who got the least revenue on most number of days?
    • Use output of 2nd requirement and build up on that using aggregate operation

  • Which product stood best selling product most days?
    • Use output of 1st unsolved requirement and build up on that using aggregate operation

  • Which DSM has highest revenue?
    • Use Employee hierarchy and Roll-up operation

  • Which DSM has highest qty?
    • Use Employee hierarchy and Roll-up operation

Let me know in comments section if this helps or if you have questions.

No comments:

Post a Comment