Joining/Matching data between different data source is
considered one of the most performance challenging task while implementing an ETL
using IBM DataStage, The purpose of this post is to provide guidance for
choosing the suitable joining/matching approach according to the source data
size
Assume we have the following two data source tables that are
related to each other
In order to consider the suitable joining matching approach we might have the following cases:
1) “The data Size of child table is >= Parent table” And “loading Parent table could fit the memory”
Solution approach: Use lookup stage
2) “The data Size of child table is >= Parent table” And “loading Parent table could not fit the memory”
Solution approach: Use Join stage
3) “The data Size of child table is <= Parent table” And “Child table is relatively Large compared with parent table” E.g. child table is 50 million records and parent table is 30 million records
Solution approach: Use Join stage
4) “The data Size of child table is <= Parent table” And “Child table is relatively small compared with parent table” E.g. child table is 1,000 records and parent table is 30 million records
Solution approach: Use database inner join
SELECT *
FROM CHILD_TABLE INNER JOIN PARENT_TABLE
ON CHILD_TABLE.COLUMN_NAME=PARENT_TABLE.COLUMN_NAME
ON CHILD_TABLE.COLUMN_NAME=PARENT_TABLE.COLUMN_NAME
Enjoy :)