Thu 19 Mar 2009
Transform Question in DataStage
Posted by Jack under DataStage
No Comments
Found this old question in the Data Direct Technologies Forum. I think it makes a good example exercise, since it can be solved multiple ways, but is short enough to address in a single (long) post.
I have a record in this format
H,0002
L,B254,4
L,B221,19
L,B064,4
H,0001
L,B208,1
L,B186,8
H,0004
L,B178,17
L,B132,19
L,B121,17
L,B025,12H,xxxx (xxxx represents the customer unique no)
L, yyyy.z (yyyy is the product part no, and z is the quantity. A customer could order more than one product, as a result so many Ls below H.I want to transform this, so that the target file has the customer number(No H), the quantity of each products ordered, with the part number.
In order to get to the output column, though, the part no, of each product is checked against a part master that contain only part numbers
The reject file, must have the same format as the input file.
I have formatted the quote to make it clearer what is being sought.
My Approach

Parallel DataStage Job showing one solution
SEQ_Input
This is our starting point. We read in the file as a 2 column input file, as we will parse out the additional product line data later in the process. The first column, Record_Type (REC_TYPE) will be used immediately in the attached transformer.
XFM_Cust_No
Using the REC_TYPE to identify when our customer number changes (REC_TYPE = ‘H’), we duplicate that customer number as a new field on the product/quantity records. The other task we do with this stage is to only allow the ‘L’ (Product Records) to continue on. This is set to be a Sequential operation, so that we can be certain that the product records are tied to the correct customer record.
We assume the first record is a customer record. If it is not, the customer number for the first Product Records is undefined. For a production system, I would define a default value that could be checked later.
CI_Prod_Qty
We expand the Record_Value (REC_VAL) field into a PROD_NO and a QTY field. This step could have been done with the previous transformer. Using this stage helps make it clear what is being done, and if the file format changes later, we can add additional fields easily.
SEQ_Master_Parts_List/LKP_Master_Parts
This 2-stage combination reads in the Master Parts List and checks our records against the list. Parts that match are allowed to proceed. Parts that do not are rejected and sent to be reformatted and saved off as a reject file.
SEQ_Output
Handles all formatting (delimiters, quotes, etc.) for the desired output format.
XFM_Reject
Converts the Reject stream into two output streams: A Customer and a Product/Qty stream. Each of these streams are formatted back to their original, source format. In addition, 2 fields are added: CUST_NO and SORT_ORDER.
FNL_Rej
This is a Sorted, Sequential Funnel. It combined the two streams into one, ordered by the CUST_NO and SORT_ORDER fields. Those two fields are not passed on to the next stage.
RD_Rej
Since the Customer record will be duplicated coming in to the funnel – once for each product record – we want to remove duplicates looking at the full output line. As a Sequential stage, this compares each line against the next. With the sort order from the Funnel intact, this will remove the duplicate customer records for us.
This could have an impact on the product records if there are multiple records with the same customer, product and quantity. If that is possible, it will be necessary to update this job to prevent that from happening. There are multiple ways to do that. Left out of this solution to help keep it simpler.
SEQ_Rej
Handles all formatting (delimiters, quotes, etc.) for the reject file, matching it to the input file format.
Parameterization
As is common, many parameters are defined to make it easier to handle items that regularly change. It is possible to over-parameterize a job. The items I chose to parameterize are all related to the input and output:
- Input Directory
- Input File Name
- Output Directory
- Output File Name
- Lookup Directory
- Lookup File Name
- Reject Directory
- Reject File Name
More items could be parameterized. However, the remaining items are unlikely to change. If they were added, they would begin to grow the parameter set and (possibly) environment variables. While not a large issue, it does take time to maintain and use them when those lists begin to get large.
Other Approaches
This solution is not the only one possible. In fact, there are more ‘elegant’ and ’simpler’ solutions possible. I chose this division of work because it makes sense to me, it is very easy to see what each stage does, and modifications will be easy to incorporate should additional requirements appear or if there is a need to incorporate data quality checks or additional logic.
With certain changes, it would be possible to solve this problem with 2 or 3 fewer stages. It might even be doable with 4 fewer stages. If anybody does solve it with 6 or fewer stages, I would love to see the solution.
If there are any questions about this solution, please, leave a comment and I will respond.
If you enjoyed this post, make sure you subscribe to my RSS feed!
