Thursday, September 5, 2013

Tech Note: Extract Parameter Values From a Database And Use Them In Subsequent Job Activities In DataStage

A common requirement when building a DataStage jobs is to read some configuration values and use them in subsequent jobs. One of the best ways to do this is to use a Parameter set and have a job that writes the configuration values as a parameter value file. Bellow we will review how to build such a job and use the generated parameter value files.
The Parameter sets in DataStage are used to store a collection of parameters in a named object. The values for these parameters can be stored in a value file on the file system. Step-by-step guide how to work with Parameter sets in DataStage could be found on the IBM's site here. Because the values for the parameter sets are picked up at run-time, this will allow us to recreate these value files just before the start of the job where they are used. To do so we need to know the value file format:

Parameter1Name=Parameter1Value
Parameter2Name=Parameter2Value
...
ParameterNName=ParameterNValue

The above could be easily achieved within a single job with a Transformer stage and Sequential file stage. If you like SQL better you can skip the transformer stage and do the formatting right in the source database stage. Shown below are the DataStage job that produces the value file, the transformer derivation used and the stage that outputs the file:






Now comes the question - where to store the value file. The directory for each DataStage project, has sub directory "ParameterSets" where each Parameter set has its own sub-directory. There you will find all value files for a Parameter set. You can provide this info to the job that generates the value file as shown below (please note the use of the @PATH system variable). The "MyParamsSet" is the name of the parameter set you want to use and the "valueFileDB" is the name of the value file where you will put the values loaded from the database. The more experienced DataStage of course knows, that you can leave out the @PATH, since the default path is the project directory anyways.


Now we are ready to use the parameter set, but we need to make sure that the job that uses it will be started after the job that generates the value file - this is how the sequence looks like:


Of course you could achieve the same with a user status as discussed in our previous post. But as you can see with parameter set value files you leave parsing the parameter names and values to DataStage. No complex Value Expressions in your Job Activities, but simply the predefined value file. Its cleaner, faster to implement and easier to maintain!

Tech Note: Why Is Using The UserStatus In DataStage Jobs Not Always A Good Idea?

In project interviews or reviews we are regularly queried how a user status can be used in DataStage to build workflows and DataStage sequences. E.g. a recent interview question sounded like this: "How can you set the job user status in a parallel job?". Of course there is a technical way to do this. However, in most cases you do not need to deal with the user status at all and there are other, better ways to achieve the same result with less effort. So let us explain and recap:

Job Status

Any DataStage job* that has been compiled once (or has been imported with binaries)** has a job status which can be queried with the DataStage Director in the Status View, via the dsjob command line (dsjob -jobinfo <project> <job>) or via DataStage BASIC (e.g. DSGetJobInfo(<jobhandle>,DSJ.JOBSTATUS). Unless you start the job the status is NOT RUNNING. When a job is started normally its status changes to RUNNING (the other option would be a validation run). After the execution the job status is either

  • RUN OK (job finished without warnings) resp. VAL OK (when started in validation mode), 
  • RUN WARN (finished with warnings resp. VAL WARN, 
  • RUN FAILED (aborted with a fatal error) resp. VAL FAILED or 
  • CRASHED (aborted because of an undetermined cause, e.g. because the DataStage engine died).
  • into a DataStage before/after routine to call it as a before/after subroutine for the job or to call it as before/after subroutine from a DataStage BASIC Transformer or
  • into a server transformation function to call it within any expression field of the DataStage BASIC Transformer Stage.
A failed, stopped or crashed job needs to be reset before it can be rerun, thus we have another status RESET which indicates that the job finished a reset run. Only restartable sequences can be restarted without reset, they are displayed in the Director as "Aborted/Restartable" which is is not a separate status but rather the combination of the facts that the sequence is in status RUN FAILED and that it is restartable - if you query the jobs status with dsjob, it will also tell you if the job is restartable in a separate field.
A running job can be stopped and yields a status of STOPPED for server jobs and usually RUN FAILED for parallel jobs (stopping a parallel job is like forcing the job to abort). There is a also an interim job status which is the job status before the after-job subroutine is executed and after the job stages and (for sequences / control jobs) all controlled jobs have finished processing, but I would not rely on that too much. 
DataStage 9.1 adds WLM. Now a job can also be in the state QUEUED (4) before it is starting to run. If its queued too long and WLM_QUEUE_WAIT_TIMEOUT is reached the job is being STOPPED. As usual the DataStage documentation is rather meager about the new states, not sure if they are mentioned at all.

User Status

Next to the job status every compiled job - even parallel jobs - can have a user status which is just a character string that is saved as part of the job runtime configuration (in the status file of the job in the DataStage engine repository) and which can be also queried with dsjob or in Datastage BASIC with DSGetJobInfo(<jobhandle>, DSJ.USERSTATUS). Although the user status string itself does not have a length limitation I would use it only for max 3 kilobytes  (since querying longer status messages is not really supported from dsjob; still if you want you can put megabytes in the user status field, but don't try to open the job log afterwards in the director or designer...).
As the name says the status must be set by the user, i.e. by the developer, and requires a bit of programming: you must call the function DSSetUserStatus. You can call it from an external job control code (if you use your own custom job control rather than job sequences) or you can wrap the function 
To set the user status within the job you need the BASIC transformer stage, originally a server stage that can also be used in parallel jobs (for this purpose only!). The BASIC Transformer is not exposed in the palette for parallel jobs, but you can still find the stage in the repository.
The user status is retained until the job is recompiled, the job status is cleared (via the Director or the DataStage Engine shell) or a new user status is set.

What are the status good for?

The job status is of course used to monitor and operate jobs and to built workflows, DataStage sequences or batches. If you have a sequence of dependent jobs and one of the job aborts, usually you do not want to run the subsequent jobs, but only those jobs which are required to clean up the left-overs from the aborted job, so the failed job can be safely restarted. The clean-up I am talking about does not mean to reset the job, but e.g. to remove partially written files or databse changes, which might impact the rerun. You easily achieve this by setting the triggers in the job activity stages in a DataStage sequence according to the job status - one triggers the subsequent job activites in case the job finished ok (RUNOK/RUNWARN) and another trigger in case of job unexpected termination (usually otherwise or by expliciting setting RUNFAIL). 
If all your jobs are restartable by themselves (which I highly recommend for sake of simplicity), you can safely use the "Automatically handle activities that fail" of Datastage sequences and leave the trigger for the subsequent jobs to unconditional. With it the DataStage sequence aborts itself if one job fails, unless you have an exception handler for the sequence defined. Usually you want to raise aborts to the top level sequence rather than to handle them in each low-level sequence, to be able to leverage the restart feature of sequences (See "Add checkpoints so sequence is restartable on failure"). When you restart a restartable sequence it will continue from the point of failure and also execute all activities which are checked as "Do not checkpoint run".

The user status does not replace the job status and cannot be safely used to implement a proper failure or restart handling like we can do it with the job status, because: you can never be sure that the user status was correctly set before a job abort or crash and you never know if the job would finish as expected after you set the user status. 
Still a user status can be used to amend the job status so you can distinguish different states in case your job finished OK.
One example is a staging job that does some quality checks on the source data. The number of failed checks can be set as the user status which in turn can be queried by a calling sequence, similarly like querying the job status: the user status like the job status is exposed as a job activity variable to subsequent stages of the sequence. Thus you could e.g. send an email if the number of failed checks was greater than 0. Another common use case for the user status is to check if a database table is empty or to check in a database table if some downstream processing can be skipped.
Frequently the user status is also misused to extract parameter values from a database to use them in subsequent job activities. For this it is much simpler to use a parameter set and let a job write the parameter value file which then can be used by the subsequent job activities in a sequence.

And the short answer is...

...not to use the user status as a replacement for the job status. It can be used for additional branching logic in sequences, but not to hand over parameter values from a database table to subsequent job activities.

*Except for mainframe jobs, which we are not covering in this article at all.
**Uncompiled DataStage jobs do not have a status and even though they are displayed as "Not compiled" in the Director, which is not a status, but rather an interpretation of the Director of the fact that there is no binary attached to the job. Trying to query the job status of an uncompiled job with dsjob  fails with an error that you cannot attach to the job (DSAttachJob works only for compiled jobs).

Tech Note: How to Grid-Enable An Existing DataStage Installation

Recently we have tried to follow the official IBM Documentation about Grid deployment: http://pic.dhe.ibm.com/infocenter/iisinfsv/v9r1/index.jsp?topic=%2Fcom.ibm.swg.im.iis.ds.parjob.dev.doc%2Ftopics%2Fgrid_intro.html. Unfortunately it turns out that it does not work.


Forget about this check box - it does not work! 
The correct way to do it is to use the Grid deployment toolkit provided by IBM. Here are the steps we followed to grid-enable an existing DataStage 9.1 installation:

1. Install IBM Tivoli Workload Scheduler LoadLeveler
This is one of the many supported resource managers but we selected it since it comes from IBM and we love IBM software :). The installation procedure could be found here: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.loadl.doc%2Fllbooks.html. In addition there is an IBM Redbook: "Workload Management with LoadLeveler", rather outdated but still a valuable source.
Please mind that Grid Toolkit 4.2.6+ is not compatible with LoadLevler 4.1.1.0-0, so consider installing 4.1.1.14 or later. LoadLeveler 5.x would also work, but requires Redhat / Centos 6.

2. Install IBM Grid Toolkit  
  • Source the dsenv: . /opt/IBM/InformationServer/Server/DSEngine/dsenv
  • Copy the grid toolkit files under $DSHOME/../PXEngine
  • Execute install.sh and provide the required responses
    • Server type: Information Server Install
    • Resource Manager: LoadLeveler
    • Path to Resource Manager commands: /home/loadl/bin
    • Grid JOBDIR: /home/dsadm/grid_jobdir
    • Maximum number of nodes per job (global): 3
    • List of valid LoadLeveler Classes: small medium large No_Class
    • Enter a default Class for running jobs: medium
  • Source once again the dsenv
  • Restart DataStage
3. Configure ssh
  • Replace rsh with ssh in $APT_ORCHHOME/etc/remsh
  • Enable password-less ssh login for the unix user used to run the jobs
Now you should be able to add the Grid parameters to your job and successfully execute them using the grid environment. More details about the grid environment installation could be found in another IBM Redbook: "Deploying a Grid Solution with the IBM InfoSphere Information Server".

Processing XML Data with DataStage

SKILL SHARING WITH MILA MASLENKOVA

Today we had a terrific skill sharing session with our BI and DI consultant Mila Maslenkova about the XML Pack vs the new XML Stage in DataStage. Mila showed how to compose and read complex XML documents with the legacy XML Pack and the new XML Stage which was introduced with DataStage 8.5. We learned how the new XML Stage works and what it can do better than the legacy XML Pack, but we also learned about the limitations of the XML Stage: The new stage supports XML natively, i.e. there is no longer the need to "flatten" and recompose XML data to perform hierarchical transformations (i.e. XML to XML) like regrouping the XML document nodes. Unfortunately value calculations on text or attribute nodes (e.g. strip characters from a string, date field calculations, ...) are not directly supported in the assembly editor of the new XML stage and still require flattening of the XML data into relational data to be able to process them in a normal transformer stage. Also the XML Stage has a significantly longer startup time compared to the stages of the XML Pack.


The short summary: once you understood all the operations of the new XML Stage it simplifies and makes XML processing with DataStage more consistent when compared to the old XML Pack, but on the other hand the XML Pack still has its value especially when it comes to quickly read or write small XML documents. So it is no surprise, that the XML Pack is still part of the latest DataStage release (9.1).

Presentation Proposals for IBM Information On Demand 2013 submitted

We prepared two very interesting session proposals for this years IOD in Las Vegas. The first presentation is "A Unicredit Case Study of Data Protection with DataStage and Protegrity" and reveals how we implemented data protection for Uncredit's Global CFO Data Warehouse. Unicredit and Protegrity will be co-speakers on this session! The second proposal is "A Deep Dive Into DataStage Performance Tuning" which shows how to get the ultimate performance out of large scale DataStage deployments. Now we expect the IBM conference board to select our proposals - keep the fingers crossed! Read on for more details about the sessions.


A Unicredit Case Study of Data Protection with DataStage and Protegrity

Session abstract
Data privacy is not only a hot topic, but a critical requirement when implementing a multinational enterprise data warehouse (EDWH) in the financial services industry. Designing a waterproof, flexible and scalable data protection solution is a daunting task.We showcase how we managed this task for Unicredit's EDWH and deep dive into the technical solution which combines DataStage with the data protection software Protegrity and the Teradata DBMS. On our dive we will explain different approaches to data protection including tokenization. We will present lessons-learned and best practices from the Unicredit implementation. Finally we will point out how the implementation could be adapted to other BigData environments.

Why you should attend this session
Attendees will learn why data protection is important and how it can be efficiently implemented using DataStage and Protegrity. They will gain an understanding of different data protection methods, what vaultless tokenization is about and why Unicredit decided to use tokenization rather than encryption or data masking. They will get a comprehensive insight into Unicredit's approach, architecture and best practices for protecting data in a multinational enterprise data warehouse.


A Deep Dive Into DataStage Performance Tuning

Session abstract
Since the advent of massively scalable ETL engines like the DataStage Parallel Engine it is a common misconception that poorly performing ETL processes are mainly a hardware sizing problem. In this session we explain why this is not the case and we present a bunch of DataStage tips and techniques, design patterns and configuration options which proved successful to avoid performance bottlenecks and to dramatically speedup the production batch runs at Unicredit's EDWH implementation. We show how to customize a DataStage Grid deployment to improve IO throughput and overall resource allocation. We present DataStage Parallel job design patterns and the Balanced Optimization feature which help to reduce resource usage or wait cycles.

Why you should attend this session
Attendees will learn how to distinguish and resolve different classes of performance bottlenecks. They will get a better understanding how to setup a performance optimized DataStage infrastructure supporting multiple projects with different workloads and learn how Grid deployments can be enhanced to resolve IO and other resource bottlenecks. Developers will learn job design patterns to reduce resource usage and how to leverage Balanced Optimization to improve existing jobs.

IBM BigData Certification - Check!

After attending the IBM BigData Boot Camp in Prague from 5th - 8th of March 2013 we now have already 3 certified IBM Big Data Consultants: Congratulations to Daniel Angelov, Radoslav Ivanov and Timm Rüger who passed the IBM BigData Fundamentals Technical Professional v1 certification exam!

BigInsights is one part of IBM answer to the BigData hype: IBM not only wrapped the open source Apache Hadoop & related Apache projects into one easily installable "BigInsights" package, but also added enterprise features which were missing like authentication and better latency, compression and performance. BigInsights comes in a freely available version and an enterprise version. The latter has a nice Excel-like application included called BigSheets, which allows even business users with limited technical know-how to run map reduce jobs on large data sets and to immediately inspect results. BigSheet Worksheets can simply be chained together to implement a complete BigData workflow. BigInsights comes with powerful accelerators like Text Analytics which allows to extract information from large text or web documents using IBM's AQL (Annotation Query  Language).
While BigInsight is for data in rest, IBM Streams is for data in fast motion. The programming paradigm in Streams is similiar to IBM DataStage's data flow design. The difference is that while DataStage might support up to millisecond latency, Stream supports microseconds! Streams can be used to built real-time data & complex event processing systems for supply chain management in the retail industry, for algorithmic trading in the investment banking industry or for network traffic analysis and fraud detection in IT departments just to name a few possible scenarios. It has operators that support efficient execution of window aggregation functions on a stream of data (e.g. moving average).
Last but no least IBM's Big Data offering also includes Data Explorer which is an enterprise  search engine and portal at the same time. It can be used to federate and index information from many different sources to make the information accessible and searchable over the web.
Check out and enjoy the following link collection for further information!

IBM Big Data

IBM BigInsights 

IBM Big Data Accelerators & Analytics 

IBM Streams (Ultra-low latency data stream processing)

IBM Data Explorer (Search Engine & Portal)