We are using the “Config-Manager” method to configure the IBM Z OMEGAMON environment. In our environment the GENERATE-process generates more than 90 files and 600.000 lines of output.
During the GENERATE-Process most time we have to terminate the omegamon monitors at the LPAR where the GENERATE-process have been started.
Today it takes ...
... 3 minutes and more to take the monitoring environment down.
... 2 Minutes until the GENERATE-Process have run.
... ? minutes to get the monitoring environment up and useable again [within this time the elapsed for the GENERATE-Job and the control time for the GENERATE-Output (> 600.000 lines) is not added]
Now if the GENERATE-Job terminates with a 'RC > 4' we and all other customers have to investigate a lot of time to analyse the GENERATE-Output (> 600.000 lines) and to identify the cause of the problem.
During this big bunch of time our whole environment and the included key-business applications on the regarding LPAR are in a blind (unmonitored) state and we are not able to identify and solve possible performance problems.
Also our historical sampling (CICS, IMS, DB2, MQ, JVM) and also our reporting to the 'Elastic Stack' via IBM Z OMEGAMON DATA PROVIDER is out of order.
So, if the GENERATE-Job is terminating with a RC > 4 we and all other customers have to investigate a lot of time to analyse the GENERATE-Output (> 600.000 lines) and to identify the cause of the problem.
After the problem have been fixed the customer has to submit the job again and eventuelly gets 'RC > 4' again. This is very, very time consuming and unpleasant in a 24/7 clocked high mainframe world.
In our initial case we have had a couple of occurrences where we got a 'RC=8' after changing a configuration parameter (message in KCIPRINT: “KFU00016E Program has failed; PROGRAM=KPDUTIL RC=8”).
We have removed the change and the GENERATE-Process ends with 'RC=8' again. After hours, many retries and tedious searching in each failed job with its > 600.000 lines we found somewhere
in the > 600.000 lines that there was a “dataset in use” condition.
In a other case we have gotten the occurrence of 'KFU00007E Program has abnormally terminated; PROGRAM=IEWL ABEND=SD37' in the 600.000 lines and spent also a lot of time to isolate und fix the problen.
To eleminate this unnecessary und frustrating effort for all of the IBM Z OMEGAMON customers in a mainframe world with big lack of manpower we suggest that in case of a failed GENERATE process ...
... add the utility messages (which are somewhere in the 600.000 lines) additionally to a separete KCIPRINT-file (for example: timestamp, Job-Step, Program-Name, whole message) .
This will significant reduce the time for problem analysis, bring back the monitoring environment much more faster and raise customers satisfaction.
Hello, thank you for opening this product enhancement request. I am pleased to announce this enhancement has been delivered with APAR OA65222. Please reach out if you have any further questions. Regards IBM OMEGAMON Product Management
We are constantly trying to improve diagnostic capabilities in OMEGAMON. As part of this improvement a significant doc update for the Troubleshooting Guide was published in late November 022. This new information has a better explanation on how to investigate issues: https://www.ibm.com/docs/en/om-shared?topic=manager-troubleshooting
Hello,
sorry for the delay I want to present you an output where we have searched for a long time to find the reason for the error message.
Your explanation is right. If you are familiar with the product you max get a fast result. In my case I have an experianced user of omegamon in the background and together we need often a long time to find the causer of the problem.
We have many situations where we got a message that any routine have been ended with a RC=8. Unforunally the utility was called very often and we have need a lot of time to find the reason where the problem have been happend and why the problem occurs. In most of those issues we think it will be usefull to get the original message that causes the error. In many situation you are under stress and you did not have much time. In this cases it will be helfull too to have the messages centralized and did not have to check several locations for any unexpected messages.
Regards Franz-Georg