Saturday, May 18, 2019

Down the ETL / ELT Lane

This post is inspired by ETL / ELT coaching that I am doing for one of my colleagues. I figured if I am going to do this for her, why not others benefit from this too! 

ETL / ELT is an abbreviation for Extract, Transform, Load / Extract, Load, Transform. This is a procedure which lies at the core of many important areas like Data Warehousing, Analytics, Sales Performance Management etc. This concept has been there for years now.

I am not going to explain the concepts of ETL / ELT, because those will be found by a simple google search. I might provide with few one liners here and there but mainly what I intend to do here is to take you through the ETL / ELT journey using examples.

Few one liners to begin with

Extract - Extracting data from source - Source can be a enterprise system (Transaction system, Warehouse etc) or Source can simply be a flat file

Transform - Transforming the data into intended format by using various operations like aggregate, merge fields, merge records, rank, calculate etc.

Load - Loading transformed data in the target system


As part of "Transform", you can use various operations like - 
  • Calculate - To derive new values from existing using arithmetic, string, logic, date etc functions
  • Merge field - Merging columns data from different sources
  • Merge Records - Merging rows data from different sources
  • Aggregate - Finding sum, min, max, last, average from values in a column
  • Rank - Ranking values in a column
  • Roll-up using hierarchy - Roll-up figures to parent entity eg. Sales Rep to Manager

This is not an exhaustive list of operations, but most commonly used and should be enough for beginners rather than overwhelming ones self with lot of information.


Also, please note that this post is not going to be a tool specific exercise but an attempt to get to understand the operations with simple examples. Do remember there are multiple right ways to achieve the same output!

We will start with simple example and build on the example as we progress through our journey.


Let's take an example of a telecom company - say ABC Telecom. 

We have sample transaction data of this company.

We also have necessary master data - Employee, Role, Region, Product, Hierarchy.

In this file - Here


Based on the transaction data, here is our 1st Problem Statement / RequirementLet's say we want to find out and create a report for - Who sold the highest qty everyday?

Following would be our steps to achieve that.


1) Merge transaction data with employee data

This merge will be on the basis of common fields/columns in both the data sets. Also, this will be the left merge with transaction data being on the left input/initial input. We will only take relevant fields forward, dropping unnecessary fields for this requirement.

Merge condition will be - 
Transaction Data.Employee ID = Employees.Employee ID

Output - Date, Employee ID, Employee Name, Product, Qty, Amt

   
2) Aggregate data to find out total qty sold     

2nd step will use output of 1st data as input and use aggregate operation wherein data will be aggregated by Date and Employee ID, Employee Name and Calculation will be - Sum of Qty. We will store Sum of Qty in another field called Tot_Qty.

Output - Date, Employee ID, Employee Name, Tot_Qty

Note that we have dropped Product and Amt columns as well in the output of this step.


3) Rank to find out highest qty sold

In this step, we will use output from 2nd step and rank the Tot_Qty in Descending order. In doing so, we will also group the data by Date as we want highest quantity seller for everyday.
  
Output - Date, Employee ID, Employee Name, Tot_Qty, Rank


4) Post/Insert this data to a table

From the output of step 3, we will filter all the data with Rank=1.

Output - Date, Employee ID, Employee Name, Tot_Qty


In 4 simple steps you would get the required report!



2nd Problem Statement/Requirement - Who brought in least revenue?

Explanation here will be similar to above mentioned requirement.

1) Merge transaction data with Employee master

Merge condition will be - 
Transaction Data.Employee ID = Employees.Employee ID

Output - Date, Employee ID, Employee Name, Product, Qty, Amt

   
2) Aggregate data to find out total revenue

Aggregate by - Employee ID, Employee Name, Calculation - Sum of Amt to be stored in Tot_Amt. If you noticed, we have dropped Date column as well, as we want Employee with least revenue from entire data set rather than everyday.

Output - Employee ID, Employee Name, Tot_Amt


3) Rank to find out least revenue     

Rank by - Tot_Amt in ascending order as least revenue is required     

Output - Employee ID, Employee Name, Tot_Amt, Rank


4) Post this to a table     


3rd Problem Statement/Requirement - Which region has revenue > $10000?

1) Merge transaction data with Employees-Region table

On a closer look on the data set, you will realize that Employees-Region has only Region ID, hence we will also have to merge the Employees-Region table with Region master. By using this 2nd merge, we can get required region name from Region master.

The conditions will be - 
1st - Transaction Data.Employee ID = Employees-Region.Employee ID
2nd - Employees-Region.Region ID = Regions.Region ID

Output - Date, Employee ID, Employee Name, Region ID, Region Name, Product, Qty, Amt

   
2) Aggregate data to find out total Amt

Aggregate by - Region ID, Region Name, Calculation - Sum of Amt to be stored in Tot_Amt. We have dropped all the other columns, as we want Region and revenue.

Output - Region ID, Region Name, Tot_Amt


3) Filter out the Regions with Tot_Amt < 10000.

Calculate operation can be used for this purpose.

Output - Employee ID, Employee Name, Tot_Amt, Rank


4) Post this to a table     
-----------------------------------------------------------------------------------------


There are few more Problem Statements / Requirements which you might want to try for practice. I will provide hints for these, however for solutions keep watching this space. Another blog coming soon.

  • Best selling product - Based on Highest Qty sold?
    • Can be achieved in similar manner as 3rd requirement

  • Who was highest quantity seller for most number of days?
    • Use output of 1st requirement and build up on that using aggregate operation

  • Who got the least revenue on most number of days?
    • Use output of 2nd requirement and build up on that using aggregate operation

  • Which product stood best selling product most days?
    • Use output of 1st unsolved requirement and build up on that using aggregate operation

  • Which DSM has highest revenue?
    • Use Employee hierarchy and Roll-up operation

  • Which DSM has highest qty?
    • Use Employee hierarchy and Roll-up operation

Let me know in comments section if this helps or if you have questions.

Visualizing end product is not always possible, that does not mean you shouldn't start..!!

So happens many times with individual developers / freelancers and sometimes even with companies / clients / businesses that -

  •          I have an idea in mind and it would be very useful if I implement it / create a portal / an app / a product about it, but I don’t know what all it should contain.
  •          I/We have been thinking about this idea for quite a long time, however I/We cannot finalize what all functionalities I/We should include so that it becomes a complete solution.
  •          We have had many meetings to brainstorm on that but we couldn’t come to conclusion about how we should go about that until we have all the pieces to put together.
  •          Etc etc.
In my opinion, even if you have broad level idea of what problem your app / portal / product should solve and some basic functionalities, you should just draft them and start working on it instead of going on planning / brainstorming about it. Till the time you complete what you have drafted, new ideas about the functionalities / requirements do pop-up in mind. And even if you are not able to think of what more to add, you can just release this version to your internal team / colleagues / friends and take feedback from them, that surely does help you think more and to better your product.

Better yet, if the users are going to use that no matter what, have your own plan / pipeline of requirements to implement however simultaneously release the basic version of app / portal / product and incorporate end users’ feedbacks as well in the subsequent releases. This way you get 2 benefits – you get actual feedback from end users and incorporating that ensures usability, and creation of loyal user base.

Sure this approach as well does not guarantee a success, but it is worth taking chances than to keep thinking/discussing and never start.


Journey of thousand miles starts with a single step..!!

Business Intelligence is not a solution to your business problems!!

Damn! I should have chosen some other heading for the post or posted it anonymously. :P

Some BI enthusiasts and some die hard BI fans must be wanting to kill me over this. However, please know, I am one too, and before you make up your mind, allow me to explain first.

Let’s take an example of an automobile company ARM Motors Ltd.

Assume that the profits of ARM are continuously dropping since last year. Now, the management team is worried and gathered to find out the issues. They have some data here and there, on the basis of which they are trying to make some sense, to find out probable causes, and remedies. It’s become a real mess though as they have been using different legacy systems, complete data is not available, whatever is available is not available at one single location. Basically there is no single version of truth.

Now, CIO/CTO proposes we have a BI system and BI would solve our problems.
They discuss, finalize, and implement a BI solution. And after some 6 months or a year, they observe the bleeding is stopped. Profits have started increasing a little.

You would say, I am contradicting with the heading, BI did solve the business problem.

WAIT. What happened in fact is – BI enabled the decision makers to identify the actual problems and to take the corrective action.

When the BI solution was implemented, the relevant stakeholders were made available with the data they needed, in the format required, and at the time it was necessary. Due to this, they could find out what was the root cause of drop in profits – let’s say it was drop in sales that too only in a couple of big territories out of 8-10 they operate in, and probably because of the campaigns the competitors were carrying out in the region.

Now that they had identified the problem, all they needed is to take a corrective action only in case of those 2 territories. (This corrective action is the solution to the problem!!)

In absence of BI, probably it wouldn't be possible to identify the root cause or even if it is found out, it would probably be difficult to identify the bleeding points and they might end up taking corrective actions in all territories which probably would cause another problems.

BI just enables to find out right information, to the right person, at the right time, at the right place to help and improve the informed decision making in order to solve the issues business might be facing.

So, the argument I am trying to make is Business Intelligence is not a solution to your business problems, it is a tool or a step to identify and solve those.


What would help if you are hungry – only making a pizza or making and eating a pizza??

OLTP & OLAP in layman terms

I am not going to get into the technical details of OLTP or OLAP but just try to explain the concept in the words layman would understand.

Now, continuing with the example of Mz. X buying grocery from Big Bazaar which was mentioned in Business Intelligence 101, when receipt was printed for Mz. X, well, that wasn’t the only the task happened there, if we were to list down the tasks –
  1. Invoice or receipt number was generated and stored in the database.
  2. The items which Mz. X purchased, were inserted in the database against the invoice number.
  3. If she had a loyalty card, points were automatically credited against the card number.
  4. Probably, the stock for those 15 items was automatically reduced from the inventory.
  5. If the stock became below the desired level, it was probably auto-ordered to the vendor or from the warehouse.
  6. If so, there was an order number that would be generated automatically.
  7. The ordered items would be tagged against the order number.
  8. Etc etc.

And this might not be the exhaustive list of tasks those were triggered automatically at the back end, there might be number of other tasks as well.

These tasks are called transactions and system that is used for processing these is called OLTP or OnLine Transaction Processing.

Let’s assume that only the first 4 tasks from the above listed are common for all the customers those buy from Big Bazaar. Hence, the number of transactions –  

Invoice or receipt number was generated and stored in the database
1
Total = 18 Transactions
The items which Mz. X purchased, were inserted in the database against the invoice number
1
If she had a loyalty card, points were automatically credited against the card number
1
(With our assumption of 10 items per customer, the number of transactions become 13)
Probably, the stock for those 15 items was automatically reduced from the inventory
15

Continuing with assumptions from Business Intelligence101, there would be 13 transactions for each customer and total would be 500*13 = 6500 transactions daily for one store.

Thus at regional level 6500*10 = 65000 daily.
At national level, assuming just 4 regions East, West, North, South, 65000*4 = 2,60,000 daily.
And monthly: 2,60,000*30 = 78,00,000 transactions..!!

Now, that’s quite a number for a transactional system for being able to carry out analysis on!!

That’s why there is OLAP or OnLine Analytical Processing.

The data from transactional system (OLTP) is loaded into the data warehouse through ETL process, OLAP or OnLine Analytical Processing is used to analyze the data stored in the data warehouse. The example of the analysis carried out would be the product wise sales analysis, product wise revenue analysis, brand wise sales analysis, brand wise revenue analysis so on and so forth.

This completes an attempt to explain OLTP vs OLAP in layman terms.

Open for discussion in comments section..!!

Business Intelligence 101

Today if you google for definition of business intelligence, you will find more than 2,50,00,000 results.

Wikipedia provides the simplest definition. Apt and straight. It says, “Business intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes.”

To further simplify it, let me explain it with an example.

Let’s say Mz. X visits Big Bazaar for grocery shopping. She buys 15 different products from there. Now, when she is at billing counter, all the items are listed in a receipt. She pays the bill, collects the receipt, takes goods that she has bought and leaves.

Let’s assume 500 customers visit Big Bazaar daily and average number of products that a customer buys is 10. Hence, it comes to around 500 receipts every day and 5000 line items daily. Hence, in a month 1,50,000 line items.

Now, if a manager at Big Bazaar has to analyze the product wise sales for a month so as to be able to forecast and decide what order to place for the next month, he has to process 1,50,000 line items at a store level. One might argue that this can be done in an excel sheet. But, what if this analysis is to be done at regional level, where say 10 Big Bazaar outlets are there in the region, the number of line items become 15,00,000. Whhhoooooaaaaaa!

This is where Business Intelligence comes into picture..!!

Going back to the definition to relate it to our example – raw data would be the 15,00,000 line items to analyze and meaningful and useful information would be the product wise sales analysis, product wise revenue analysis, brand wise sales analysis, brand wise revenue analysis so on and so forth.


What business intelligence would help you do here is to represent data in tabular format or graphical format or many other formats in order to analyze and make decisions based on fact rather than intuition. And that’s what one of the biggest utility of Business Intelligence is..!!