Thu 26 Feb 2009
Learning DataStage
Posted by Jack under DataStage, Software
No Comments
Being a DataStage ETL developer, I have spent plenty of time over the past few years working with a tool that has proven to be both powerful and, at times, frustrating. Powerful, in that it can handle a huge amount of data with a lot of flexibility, often with good performance, in a manner than can be maintained and expanded with a very small core of developers (even just one). Frustration, since that power and flexibility is very sensitive to details of the implementation and error messages are often confusing or ambiguous.
When I learned DataStage, I thought the class was very easy – with the exercises very simple to get through. The classroom exercises hid a lot of the flexibility and complexity of the tool. Primarily, this was done by looking at each of the available stages in isolation. Exercises were setup to build simple jobs that explored the basic functionality of the stage.
What the class did very little of was discuss maintenance of jobs, parameterization, partitioning, performance tuning and common sources of errors. I’ve learned about those with experience. Some of this experience was easily gained, others took a lot of time to work through and figure out and there are some things that I have not used or mastered yet.
The two best tricks that I know for developing with DataStage are as follows:
1. Be very meticulous – especially when building many similar jobs. Having a checklist of items for every job being built is useful.
2. Build and test incrementally. Adding 2 stages, compiling and testing before adding more makes development easier and troubleshooting faster.
If you have any tips for development, questions about DataStage, or comparisons with other ETL tools, I welcome your feedback.
If you enjoyed this post, make sure you subscribe to my RSS feed!
