Many thanks for visiting my Blog..!!Please share this blog using below share buttons and leave your Comments/Feedback/Appreciations on Tab: Feedback
Share This Blog..!!

Environment Variables Use For Performance Tuning

Environment Variables Use For Performance Tuning of datastage Jobs...!DK®
This Post will help your to understand the environment variables usage and settings to increase the performance of datastage jobs.DK®


Here is the some example Parameter Sets full of Environment Variables to illustrate how this works.  The first two scenarios show how to create a Parameter Set for very high and very low volume jobs.  This lets you setup your project wise variables.DK®

For High volume Data jobs the first environment variables to look at are:DK®
  • $APT_CONFIG_FILE: lets you define the biggest config file with the most number of nodes.Select the configuration file with high nodes.
  • $APT_SCORE_DUMP: when switched on it creates a job run report that shows the partitioning used, degree of parallelism, data buffering and inserted operators.  Useful for finding out what your high volume job is doing.
  • $APT_PM_PLAYER_TIMING: this reporting option lets you see what each operator in a job is doing, especially how much data they are handling and how much CPU they are consuming.  Good for spotting bottlenecks.
One way to speed up very high volume jobs is to pre-sort the data using sort stage in previous job and make sure it is not resorted in the DataStage job you are tuning.  This is done by turning off auto sorting in high volume jobs:DK®

  • APT_NO_SORT_INSERTION: stops the job from automatically adding a sort command to the start of a job that has stages that need sorted data such as Remove Duplicates.  You can also add a sort stage to the job and set it to a value of "Previously Sorted" to avoid this is a specific job path.
    One way to speed up very high volume jobs is to pre-sort the data and make sure it is not resorted in the DataStage job.  This is done by turning off auto sorting in high volume jobs:
    DK®

    Buffering is another thing that can be tweaked, it controls how data is passed between stages, usually you just leave it alone but on a very high volume job you might want custom settings:
    DK®
    • APT_BUFFER_MAXIMUM_MEMORY: Sets the default value of Maximum memory buffer size.
    • APT_BUFFER_DISK_WRITE_INCREMENT: For systems where small to medium bursts of I/O are not desirable, the default 1MB write to disk size chunk size may be too small. APT_BUFFER_DISK_WRITE_INCREMENT controls this and can be set larger than 1048576 (1 MB). The setting may not exceed max_memory * 2/3.
    • APT_IO_MAXIMUM_OUTSTANDING: Sets the amount of memory, in bytes, allocated to a WebSphere DataStage job on every physical node for network communications. The default value is 2097152 (2MB). When you are executing many partitions on a single physical node, this number may need to be increased.
    • APT_FILE_EXPORT_BUFFER_SIZE: if your high volume jobs are writing to sequential files you may be overheating your file system, increasing the size of this value can deliver data to files in bigger chunks to combat long latency.
    These are just some of the I/O and buffering settings.

    Low Volume JobDK®

    By default a low volume job will tend to run slowly on a massively scalable DataStage server. Many less environment variables to set as low volume jobs don't need any special configuration.  Just make sure the job is not trying to partition data as that could be over kill when you don't have a lot of data to process.  Partitioning and repartitioning data on volumes of less than 1000 rows makes the job start and stop more slowly:
    • APT_EXECUTION_MODE: By default, the execution mode is parallel, with multiple processes. Set this variable to one of the following values to run an application in sequential execution mode: ONE_PROCESS, MANY_PROCESS and NO_SERLIALIZE.
    • $APT_CONFIG_FILE: lets you define a config file that will run these little jobs on just one node so they don't try any partitioning and repartitioning.
    • $APT_IO_MAXIMUM_OUTSTANDING: when a datastage job starts on a node it is allocated some memory for network communications - especially the partitioning and repartitioning between nodes.  This is set to 2MB but when you have a squadron of very small jobs that don't partition you can reduce this size to make the job start faster and free up RAM memory.

    Other Parameter Sets

    You can set up all your default project Environment Variables to handle all data volumes in between.  You can still have a Parameter Set for medium volume jobs if you have specific config files you want to use. 
    You might also create a Parameter Set called PX_MANY_STAGES which is for any job that has dozens of stages in it regardless of data volumes.DK®
    • APT_THIN_SCORE: Setting this variable decreases the memory usage of steps with 100 operator instances or more by a noticeable amount. To use this optimization, set APT_THIN_SCORE=1 in your environment. There are no performance benefits in setting this variable unless you are running out of real memory at some point in your flow or the additional memory is useful for sorting or buffering. This variable does not affect any specific operators which consume large amounts of memory, but improves general parallel job memory handling.
    This can be combined with the large volume Parameter Set in a job so you have extra configuration for high volume jobs with many stages.
    You might also create a ParameterSet for a difficult type of source data file when default values don't work, eg. PX_MFRAME_DATA:DK®
    • APT_EBCDIC_VERSION: Certain operators, including the import and export operators, support the A€Ċ“ebcdicA€ property specifying that field data is represented in the EBCDIC character set. The APT_EBCDIC_VERSION variable indicates the specific EBCDIC character set to use.
    • APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL: When set, allows zero length null_field value with fixed length fields. This should be used with care as poorly formatted data will cause incorrect results. By default a zero length null_field value will cause an error.
    SAS is another operator that has a lot of configurable environment variables because when you are reading or writing native SAS datasets or running a SAS transformation you are handing some of the control over to SAS - these environment variables configure this interaction:
    • APT_HASH_TO_SASHASH: can output data hashed using sashash - the hash algorithm used by SAS.
    • APT_SAS_ACCEPT_ERROR: When a SAS procedure causes SAS to exit with an error, this variable prevents the SAS-interface operator from terminating. The default behavior is for WebSphere DataStage to terminate the operator with an error.
    • APT_NO_SAS_TRANSFORMS: WebSphere DataStage automatically performs certain types of SAS-specific component transformations, such as inserting an sasout operator and substituting sasRoundRobin for RoundRobin. Setting the APT_NO_SAS_TRANSFORMS variable prevents WebSphere DataStage from making these transformations.
    You can group all known debug parameters into a single debug file to make it easier for support to find:DK®
    • APT_SAS_DEBUG: Set this to set debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which may then be copied into the WebSphere DataStage log.A  Don't put this into your SAS Parameter Set as the support team might not be able to find it or know it exists.
    • APT_SAS_DEBUG_IO: Set this to set input/output debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which may then be copied into the WebSphere DataStage log.
    • APT_SAS_SCHEMASOURCE_DUMP: When using SAS Schema Source, sauses the command line to be written to the log when executing SAS. You use it to inspect the data contained in a -schemaSource. Set this if you are getting an error when specifying the SAS data set containing the schema source. 

No comments:

Post a Comment

disqus