<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Exjackly &#187; DataStage</title>
	<atom:link href="http://www.exjackly.com/archives/category/software/datastage-software/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.exjackly.com</link>
	<description>Personal Thoughts</description>
	<lastBuildDate>Thu, 11 Mar 2010 15:00:33 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Transform Question in DataStage</title>
		<link>http://www.exjackly.com/archives/2009/transform-question-in-datastage/</link>
		<comments>http://www.exjackly.com/archives/2009/transform-question-in-datastage/#comments</comments>
		<pubDate>Thu, 19 Mar 2009 19:30:26 +0000</pubDate>
		<dc:creator>Jack</dc:creator>
				<category><![CDATA[DataStage]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[example]]></category>

		<guid isPermaLink="false">http://www.exjackly.com/?p=150</guid>
		<description><![CDATA[Found this old question in the Data Direct Technologies Forum.  I think it makes a good example exercise, since it can be solved multiple ways, but is short enough to address in a single (long) post.
I have a record in this format
H,0002
L,B254,4
L,B221,19
L,B064,4
H,0001
L,B208,1
L,B186,8
H,0004
L,B178,17
L,B132,19
L,B121,17
L,B025,12
H,xxxx (xxxx represents the customer unique no)
L, yyyy.z (yyyy is the product part no, [...]]]></description>
			<content:encoded><![CDATA[<p>Found this old <a title="Support Forum: Datastage" href="http://forums.datadirect.com/ddforums/message.jspa?messageID=3812" target="_blank">question</a> in the <a title="Data Direct Technologies" href="http://www.datadirect.com/index.ssp">Data Direct Technologies</a> Forum.  I think it makes a good example exercise, since it can be solved multiple ways, but is short enough to address in a single (long) post.</p>
<blockquote><p>I have a record in this format<br />
H,0002<br />
L,B254,4<br />
L,B221,19<br />
L,B064,4<br />
H,0001<br />
L,B208,1<br />
L,B186,8<br />
H,0004<br />
L,B178,17<br />
L,B132,19<br />
L,B121,17<br />
L,B025,12</p>
<p>H,xxxx (xxxx represents the customer unique no)<br />
L, yyyy.z (yyyy is the product part no, and z is the quantity. A customer could order more than one product, as a result so many Ls below H.</p>
<p>I want to transform this, so that the target file has the customer number(No H), the quantity of each products ordered, with the part number.</p>
<p>In order to get to the output column, though, the part no, of each product is checked against a part master that contain only part numbers</p>
<p>The reject file, must have the same format as the input file.</p>
</blockquote>
<p>I have formatted the quote to make it clearer what is being sought.</p>
<h2>My Approach</h2>
<h3><span id="more-150"></span></h3>
<div id="attachment_157" class="wp-caption alignnone" style="width: 310px"><img class="size-medium wp-image-157" title="Sample Solution" src="http://www.exjackly.com/wp-content/uploads/2009/03/problem1-300x225.jpg" alt="Parallel DataStage Job showing one solution" width="300" height="225" /><p class="wp-caption-text">Parallel DataStage Job showing one solution</p></div>
<h3>SEQ_Input</h3>
<p>This is our starting point.  We read in the file as a 2 column input file, as we will parse out the additional product line data later in the process.  The first column, Record_Type (REC_TYPE) will be used immediately in the attached transformer.</p>
<h3>XFM_Cust_No</h3>
<p>Using the REC_TYPE to identify when our customer number changes (REC_TYPE = &#8216;H&#8217;), we duplicate that customer number as a new field on the product/quantity records.  The other task we do with this stage is to only allow the &#8216;L&#8217; (Product Records) to continue on.  This is set to be a Sequential operation, so that we can be certain that the product records are tied to the correct customer record.</p>
<p>We assume the first record is a customer record.  If it is not, the customer number for the first Product Records is undefined.  For a production system, I would define a default value that could be checked later.</p>
<h3>CI_Prod_Qty</h3>
<p>We expand the Record_Value (REC_VAL) field into a PROD_NO and a QTY field.  This step could have been done with the previous transformer.  Using this stage helps make it clear what is being done, and if the file format changes later, we can add additional fields easily.</p>
<h3>SEQ_Master_Parts_List/LKP_Master_Parts</h3>
<p>This 2-stage combination reads in the Master Parts List and checks our records against the list.  Parts that match are allowed to proceed.  Parts that do not are rejected and sent to be reformatted and saved off as a reject file.</p>
<h3>SEQ_Output</h3>
<p>Handles all formatting (delimiters, quotes, etc.) for the desired output format.</p>
<h3>XFM_Reject</h3>
<p>Converts the Reject stream into two output streams: A Customer and a Product/Qty stream.  Each of these streams are formatted back to their original, source format.  In addition, 2 fields are added: CUST_NO and SORT_ORDER.</p>
<h3>FNL_Rej</h3>
<p>This is a Sorted, Sequential Funnel.  It combined the two streams into one, ordered by the CUST_NO and SORT_ORDER fields.  Those two fields are not passed on to the next stage.</p>
<h3>RD_Rej</h3>
<p>Since the Customer record will be duplicated coming in to the funnel &#8211; once for each product record &#8211; we want to remove duplicates looking at the full output line.  As a Sequential stage, this compares each line against the next.  With the sort order from the Funnel intact, this will remove the duplicate customer records for us.</p>
<p>This could have an impact on the product records if there are multiple records with the same customer, product and quantity.  If that is possible, it will be necessary to update this job to prevent that from happening.  There are multiple ways to do that.  Left out of this solution to help keep it simpler.</p>
<h3>SEQ_Rej</h3>
<p>Handles all formatting (delimiters, quotes, etc.) for the reject file, matching it to the input file format.</p>
<h2>Parameterization</h2>
<p>As is common, many parameters are defined to make it easier to handle items that regularly change.  It is possible to over-parameterize a job.  The items I chose to parameterize are all related to the input and output:</p>
<ul>
<li>Input Directory</li>
<li>Input File Name</li>
<li>Output Directory</li>
<li>Output File Name</li>
<li>Lookup Directory</li>
<li>Lookup File Name</li>
<li>Reject Directory</li>
<li>Reject File Name</li>
</ul>
<p>More items could be parameterized.  However, the remaining items are unlikely to change.  If they were added, they would begin to grow the parameter set and (possibly) environment variables.  While not a large issue, it does take time to maintain and use them when those lists begin to get large.</p>
<h2>Other Approaches</h2>
<p>This solution is not the only one possible. In fact, there are more &#8216;elegant&#8217; and &#8217;simpler&#8217; solutions possible.  I chose this division of work because it makes sense to me, it is very easy to see what each stage does, and modifications will be easy to incorporate should additional requirements appear or if there is a need to incorporate data quality checks or additional logic.</p>
<p>With certain changes, it would be possible to solve this problem with 2 or 3 fewer stages.  It might even be doable with 4 fewer stages.  If anybody does solve it with 6 or fewer stages, I would love to see the solution.</p>
<p>If there are any questions about this solution, please, leave a comment and I will respond.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.exjackly.com/archives/2009/transform-question-in-datastage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Learning DataStage</title>
		<link>http://www.exjackly.com/archives/2009/learning-datastage/</link>
		<comments>http://www.exjackly.com/archives/2009/learning-datastage/#comments</comments>
		<pubDate>Thu, 26 Feb 2009 19:18:27 +0000</pubDate>
		<dc:creator>Jack</dc:creator>
				<category><![CDATA[DataStage]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.exjackly.com/?p=146</guid>
		<description><![CDATA[Being a DataStage ETL developer, I have spent plenty of time over the past few years working with a tool that has proven to be both powerful and, at times, frustrating.  Powerful, in that it can handle a huge amount of data with a lot of flexibility, often with good performance, in a manner than [...]]]></description>
			<content:encoded><![CDATA[<p>Being a DataStage ETL developer, I have spent plenty of time over the past few years working with a tool that has proven to be both powerful and, at times, frustrating.  Powerful, in that it can handle a huge amount of data with a lot of flexibility, often with good performance, in a manner than can be maintained and expanded with a very small core of developers (even just one).  Frustration, since that power and flexibility is very sensitive to details of the implementation and error messages are often confusing or ambiguous.</p>
<p>When I learned DataStage, I thought the class was very easy &#8211; with the exercises very simple to get through.  The classroom exercises hid a lot of the flexibility and complexity of the tool.  Primarily, this was done by looking at each of the available stages in isolation.  Exercises were setup to build simple jobs that explored the basic functionality of the stage.</p>
<p>What the class did very little of was discuss maintenance of jobs, parameterization, partitioning, performance tuning and common sources of errors.  I&#8217;ve learned about those with experience.  Some of this experience was easily gained, others took a lot of time to work through and figure out and there are some things that I have not used or mastered yet.</p>
<p>The two best tricks that I know for developing with DataStage are as follows:<br />
1.  Be very meticulous &#8211; especially when building many similar jobs.  Having a checklist of items for every job being built is useful.<br />
2.  Build and test incrementally.  Adding 2 stages, compiling and testing before adding more makes development easier and troubleshooting faster.</p>
<p>If you have any tips for development, questions about DataStage, or comparisons with other ETL tools, I welcome your feedback.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.exjackly.com/archives/2009/learning-datastage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
<!-- WP Super Cache is installed but broken. The path to wp-cache-phase1.php in wp-content/advanced-cache.php must be fixed! -->