Issues with DLI Batch Backout and Checkpoints

See this idea on ideas.ibm.com

Current Situation

The situation is: IMS 12.1, batch jobs running in DLI mode, dynamic backout is not enabled, no IRLM, no RSR tracking, DBRC is Force, databases are registered to DBRC, application program takes checkpoints.

Our normal procedure is for DLI applications to take one checkpoint at the beginning of the step, and then regular checkpoints thereafter. But since performing checkpoint restarts is not a consequence-free process, we do not perform a checkpoint restart for jobs unless they have run for a long time. (“Long” is defined as “long enough that performing the checkpoint restart is worth the trouble.”) Instead we restart the step at the beginning.

This means that if the step performed database updates, we must perform Batch Backout (using DFSBBO00) to the beginning of the step. For steps that are taking checkpoints, this means we BBO to the first checkpoint.

When a DLI step abnormally ends, IMS provides a helpful message to indicate if you need to do Batch Backout:

DFS036A BATCH BACKOUT NOT REQUIRED FOR jobname imsid
or
DFS036A BATCH BACKOUT IS REQUIRED FOR jobname imsid

Except that it doesn't.

If the message says “BATCH BACKOUT IS REQUIRED”, then this does mean what it says.

But if the message says “BATCH BACKOUT NOT REQUIRED” it actually means one of two things:

A. The step has not updated any databases.
B. The step has updated databases, but there were no updates since the last checkpoint.

This is because DBRC and IMS are making an assumption: they assume that if a DLI step abends, and dynamic backout is not enabled, the programmer is going to backout to the last checkpoint (if required) and then perform a true checkpoint restart.

So the first problem is, if you see the “BATCH BACKOUT NOT REQUIRED” message, how do you know which situation are you in? Are you in situation A, which means that no BBO is required under any circumstance? Or are you in situation B, which means that you really do need to do BBO if you intend to restart anywhere other than the last checkpoint?

One way to tell is to search the IEFRDER log files for the particular log record types that indicate database updates. This does work, except that a) it isn't so easy when the IERFDER is on tape, b) it is inefficient for larger logs, and c) it doesn't actually tell you if you need to do BBO for a particular checkpoint.

Forcing Backout

It used to be that you could run BBO anyway. If you were in case A it just wouldn't backout anything; it would give an error return code that indicated that there was nothing found to backout. If you were in case B it would go ahead and backout to the requested checkpoint.

This hasn't been the case for some time. Now if you run BBO when “BATCH BACKOUT NOT REQUIRED”, it abends with U0041 and error message:

DFS041I DBRC SIGNON REQUEST, RC=28 IMTJ

What this message really means is that DBRC doesn't have a “subsystem entry for the log supplied to batch backout”. So, it means one of two things:

1. You have supplied BBO with the wrong log file. Trying to proceed with this file would be VERY BAD.
2. You have supplied BBO with the right log file, but DBRC doesn't want to proceed.

Now your problem is, how do you tell which of these situations are you in? Can you go ahead and backout to the previous checkpoint? Or would that cause catastrophic destruction of the database? That answer is: there is no way to know, aside from dumping DBRC and inspecting the PRILOG records.

The manual says that if you are in situation B (no updates since the last checkpoint) and you want to backout to a previous checkpoint, you should run BBO using DBRC=C and with the BYPASS LOGVER control card.

This does work, but realize that we are telling DBRC to skip all log verification. It won't even verify that the provided log file is in DBRC in a PRILOG (I think), nor that there hasn't been a reorg after it was created, or that it is even for the right system, and so on. We might as well not have DBRC at all.

Idea priority

Medium

Post comment

Guest

Reply
| Aug 16, 2024

Reposting the original analysis; it was lost in the migration from the RFE to Ideas...
IMS: DLI Batch Backout Issues

Current Situation

The situation is: IMS 12.1, batch jobs running in DLI mode, dynamic backout is not enabled, no IRLM, no RSR tracking, DBRC is Force, databases are registered to DBRC, application program takes checkpoints.

Our normal procedure is for DLI applications to take one checkpoint at the beginning of the step, and then regular checkpoints thereafter. But since performing checkpoint restarts is not a consequence-free process, we do not perform a checkpoint restart for jobs unless they have run for a long time. (“Long” is defined as “long enough that performing the checkpoint restart is worth the trouble.”) Instead we restart the step at the beginning.

This means that if the step performed database updates, we must perform Batch Backout (using DFSBBO00) to the beginning of the step. For steps that are taking checkpoints, this means we BBO to the first checkpoint.

When a DLI step abnormally ends, IMS provides a helpful message to indicate if you need to do Batch Backout:

DFS036A BATCH BACKOUT NOT REQUIRED FOR jobname imsid
                or
DFS036A BATCH BACKOUT IS REQUIRED FOR jobname imsid

Except that it doesn’t.

If the message says “BATCH BACKOUT IS REQUIRED”, then this does mean what it says.

But if the message says “BATCH BACKOUT NOT REQUIRED” it actually means one of two things:

A.     The step has not updated any databases.
B.      The step has updated databases, but there were no updates since the last checkpoint.

This is because DBRC and IMS are making an assumption: they assume that if a DLI step abends, and dynamic backout is not enabled, the programmer is going to backout to the last checkpoint (if required) and then perform a true checkpoint restart.

So the first problem is, if you see the “BATCH BACKOUT NOT REQUIRED” message, how do you know which situation are you in? Are you in situation A, which means that no BBO is required under any circumstance? Or are you in situation B, which means that you really do need to do BBO if you intend to restart anywhere other than the last checkpoint?

One way to tell is to search the IEFRDER log files for the particular log record types that indicate database updates. This does work, except that a) it isn’t so easy when the IERFDER is on tape, b) it is inefficient for larger logs, and c) it doesn’t actually tell you if you need to do BBO for a particular checkpoint.

Forcing Backout

It used to be that you could run BBO anyway. If you were in case A it just wouldn’t backout anything; it would give an error return code that indicated that there was nothing found to backout. If you were in case B it would go ahead and backout to the requested checkpoint.

This hasn’t been the case for some time. Now if you run BBO when “BATCH BACKOUT NOT REQUIRED”, it abends with U0041 and error message:

DFS041I DBRC SIGNON REQUEST, RC=28 IMTJ

What this message really means is that DBRC doesn’t have a “subsystem entry for the log supplied to batch backout”. So, it means one of two things:

1.       You have supplied BBO with the wrong log file. Trying to proceed with this file would be VERY BAD.
2.       You have supplied BBO with the right log file, but DBRC doesn’t want to proceed.

Now your problem is, how do you tell which of these situations are you in? Can you go ahead and backout to the previous checkpoint? Or would that cause catastrophic destruction of the database? That answer is: there is no way to know, aside from dumping DBRC and inspecting the PRILOG records.

The manual says that if you are in situation B (no updates since the last checkpoint) and you want to backout to a previous checkpoint, you should run BBO using DBRC=C and with the BYPASS LOGVER control card.

This does work, but realize that we are telling DBRC to skip all log verification. It won’t even verify that the provided log file is in DBRC in a PRILOG (I think), nor that there hasn’t been a reorg after it was created, or that it is even for the right system, and so on. We might as well not have DBRC at all.

Issues

We can identify a number of issues up to this point:

1.       IMS assumes that DLI backouts will be to the last checkpoint.
2.       The DFS036A message is not specific enough.
3.       BBO doesn’t distinguish between incorrect logs and correct log but backout not needed.
4.       There is no convenient way to BBO to the start of a step.
5.       There is no way to backout to a previous checkpoint without bypassing verification.

1.       IMS assumes that DLI backouts will be to the last checkpoint

This is the root cause of the problem.

This is a valid assumption for MPP, BMP transactions and DLI with dynamic backout (BBO=Y), but not with other DLI steps.

There are a number of reasons why a programmer may not want to backout to the last checkpoint. For example:
·         It is higher risk. Coding programs to work properly when checkpoint restarted is non-trivial.
·         It puts a number of restrictions on modification of input and output files.
·         JCL changes are required on the restart.
·         Automated restart systems (such as CA-11) do not support checkpoint restarts.
·         You can’t delete records from input files (such as the record causing the abend) if the file was itself produced by a step that took checkpoints (because of short blocks)
·         It is challenging to recover from space abends on output files, due to short blocks

But the programmer may still want the step to take checkpoints, even though she may not want to perform a checkpoint restart. One reason is to be able to write the application in a way that supports running both as a BMP or in DLI mode. Or maybe sometimes the job runs long enough that the checkpoint restart would actually be used. Or they want to use the same code in production vs. test.

Realize that this assumption is deeper than just the format of the DFS036A message. What actually is happening at the DBRC level is that if there are no updates after the last checkpoint, DBRC considers the step as a normal termination and deauthorizes the subsystem.

This means that after the abend, other jobs are permitted to read all the updates made to the database to that point, even though the programmer may want to backout to a previous checkpoint. This is an integrity issue.

2.       The DFS036A message is not specific enough.

As noted above, the DFS036A BATCH BACKOUT NOT REQUIRED message means either that no updates have occurred or that updates have occurred but BBO is not needed if you restart from the last checkpoint. And there is no reasonable way tell which situation it is.

What we is needed is different message text for “BBO is not required if starting from last checkpoint” and “BBO is required if starting before last checkpoint”.

3.       BBO doesn’t distinguish between incorrect logs and correct log but backout not needed.

If we submit a BBO where the IMSLOGR log(s) do not contain a subsystem with an ABNORMAL TERM subsystem (SSYS) record, DBRC gives the RC=28 abend. The user can’t tell if the problem is an incorrect log file, or if it is the right log file but DBRC doesn’t think that batch backout is needed to issue #1 above.

4.       There is no convenient way to BBO to the start of a step.

If a job does not take any checkpoints then you can submit it without a CHKPT statement, and it will backout to the beginning of the step (the IMS started record). But if the job did take checkpoints, there is no way to do that. Leaving out the CHKPT statement backs out to the last checkpoint.

The solution is for the user to determine which was the first checkpoint and then manually enter it into the job.

This has risks:
a.       We have had cases where the programmer enters an incorrect checkpoint id and then backs out to the wrong point, causing havoc.
b.       It assumes that the program was coded to perform an initial checkpoint before any database updates. If the program does perform updates before a checkpoint, there is currently no way at all to backout to the beginning of the step. You have to resort to a forward recovery.

5.       There is no way to backout to a previous checkpoint without bypassing verification.

If we are in the situation where we want to backout to a previous checkpoint, the only way past the RC=28 abend is to BYPASS LOGVER, but this doesn’t just bypass the checkpoint edit, it completely bypasses the log verification.

This is bad because of issue #3:
A.     BBO ends with RC=28.
B.      User can’t tell if the log file is wrong, or if it is right but there were no updates after last checkpoint.
C.      The solution to “no updates after last checkpoint” is to bypass verification.
D.     But if the RC=28 was because the log file was wrong, bypassing verification is the last thing you would want to do. It permits the wrong log file to be used to backout.

Recommendations

1.       The root cause is the assumption that DLI steps will be restarted from the last checkpoint, even when dynamic backout is off.

We suggest that there should be a way to tell IMS not to assume this. That is, we want it to assume that, for DLI steps, any database updates followed by abnormal termination should result in keeping the subsystem in abnormal termination status, regardless of how and when the checkpoints are taken.

Ideally it would be an IMS system-level default set in DBRC, with the ability to override the default via the PARMs on a DLI batch job.

2.       The DFS036A message should distinguish between no database updates, and no updates after last checkpoint.

Note that if #1 is implemented, we would expect the message to indicate BBO is required when the database has been updated (but not since the last checkpoint), because the subsystem record would still be.

3.       We would like BBO to distinguish between incorrect log files and subsystems that don’t need to backout if restarting from last checkpoint.

Note that if #1 is implemented, we would expect BBO to accept the log file in the case where the database hasn’t been updated since the last checkpoint, because the subsystem record would still be in DBRC.

4.       We would like a control card to be added to BBO to request backout to the beginning of the step, regardless of whether the step took checkpoints or not.

5.       We suggest that BBO should have a way to request a backout to a previous checkpoint (or the beginning of the step) even though no updates have been made since the last checkpoint, without having to completely disable log verification.

0 reply Hide replies

Guest

Reply
| Jan 28, 2019

Hi Rick,
Thank you for your interest in keeping IMS a vital and successful product. Software development has continuously evolved during IMS's lifetime, and so has IMS itself. We have kept pace with, adopted, and implemented many industry standard best practices within our organization, including Continuous Delivery, Design Thinking, and Agile.

When choosing new features to add from the list of requirements in our backlog, we assess which will bring the most value to as many clients as possible and prioritize those.

At this time, after reviewing this request for enhancement and assessing its potential value, we have decided to reject it. The reason we are rejecting RFE 56145 is we have not had many clients ask for this enhancement. There doesn’t seem to be much interest from other clients. We would rather invest in high priority items that bring the most value to many clients.

We appreciate your input to IMS, and we hope that you will continue to submit ideas for improvements as customer feedback is a key component to shaping the future direction of IMS.

Thank you.
Sincerely,
Deepak Kohli - deepakk@us.ibm.com

0 reply Hide replies

Guest

Reply
| Jul 15, 2014

Attachment (Description): The attachment is the entire write up from my application DBA - which is the entirety of the text that is broken up above to fit in the various sections.

Batch.txt

Batch.txt
Batch.txt

0 reply Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

Issues with DLI Batch Backout and Checkpoints

Please enter your email address

RELATED IDEAS

Issues with DLI Batch Backout and Checkpoints