DataStage


Found this old question in the Data Direct Technologies Forum.  I think it makes a good example exercise, since it can be solved multiple ways, but is short enough to address in a single (long) post.

I have a record in this format
H,0002
L,B254,4
L,B221,19
L,B064,4
H,0001
L,B208,1
L,B186,8
H,0004
L,B178,17
L,B132,19
L,B121,17
L,B025,12

H,xxxx (xxxx represents the customer unique no)
L, yyyy.z (yyyy is the product part no, and z is the quantity. A customer could order more than one product, as a result so many Ls below H.

I want to transform this, so that the target file has the customer number(No H), the quantity of each products ordered, with the part number.

In order to get to the output column, though, the part no, of each product is checked against a part master that contain only part numbers

The reject file, must have the same format as the input file.

I have formatted the quote to make it clearer what is being sought.

My Approach

(more…)

Being a DataStage ETL developer, I have spent plenty of time over the past few years working with a tool that has proven to be both powerful and, at times, frustrating.  Powerful, in that it can handle a huge amount of data with a lot of flexibility, often with good performance, in a manner than can be maintained and expanded with a very small core of developers (even just one).  Frustration, since that power and flexibility is very sensitive to details of the implementation and error messages are often confusing or ambiguous.

When I learned DataStage, I thought the class was very easy – with the exercises very simple to get through.  The classroom exercises hid a lot of the flexibility and complexity of the tool.  Primarily, this was done by looking at each of the available stages in isolation.  Exercises were setup to build simple jobs that explored the basic functionality of the stage.

What the class did very little of was discuss maintenance of jobs, parameterization, partitioning, performance tuning and common sources of errors.  I’ve learned about those with experience.  Some of this experience was easily gained, others took a lot of time to work through and figure out and there are some things that I have not used or mastered yet.

The two best tricks that I know for developing with DataStage are as follows:
1.  Be very meticulous – especially when building many similar jobs.  Having a checklist of items for every job being built is useful.
2.  Build and test incrementally.  Adding 2 stages, compiling and testing before adding more makes development easier and troubleshooting faster.

If you have any tips for development, questions about DataStage, or comparisons with other ETL tools, I welcome your feedback.