Thursday, July 12, 2012

Matching/Joining Data Approaches in IBM Datastage


Joining/Matching data between different data source is considered one of the most performance challenging task while implementing an ETL using IBM DataStage, The purpose of this post is to provide guidance for choosing the suitable joining/matching approach according to the source data size

Assume we have the following two data source tables that are related to each other




In order to consider the suitable joining matching approach we might have the following cases:

   1) “The data Size of child table is >= Parent table”  And “loading Parent table could fit the memory” 

Solution approach: Use lookup stage




   2) “The data Size of child table is >= Parent table”  And “loading Parent table could not fit the memory”

Solution approach: Use Join stage



   3) “The data Size of child table is <= Parent table”  And “Child table is relatively Large compared with parent table”  E.g. child table is 50 million records and parent table is 30 million records

Solution approach: Use Join stage



  4) “The data Size of child table is <= Parent table”  And “Child table is relatively small compared with parent table” E.g. child table is 1,000 records and parent table is 30 million records


Solution approach: Use database inner join

             SELECT *
             FROM CHILD_TABLE INNER JOIN PARENT_TABLE
             ON CHILD_TABLE.COLUMN_NAME=PARENT_TABLE.COLUMN_NAME


Enjoy :)


1 comment:

  1. Hi Kimo,

    thanks for your effort and appreciate it but the above is not Matching/Joining Data Approaches in IBM Data-stage because as you know it depend on lot of factor like is data sorted or not , No. of CPU , disk space size and allocated scratch space not only data volume on tow sources.

    so the above approach can be the case in our projects but it's not IBM Data-stage Matching/Joining Data Approach.

    Below is a good link for A discussion on join,lookup and merge stages in Data-stage for how and when to use every stage when Joining data:

    http://www.dstagebox.com/2011/05/a-discussion-on-joinlookup-and-merge-stages-in-datastage/

    thanks :)

    ReplyDelete