Tuesday, 22 April 2014

Error Recovery

Terminology

A system consists of a set of hardware and software components and is designed to provide a specified service.
Failure of a system occurs when the system does not perform its services in the manner specified.
An erroneous state of the system is a state which could lead to a system failure by a sequence of valid state transitions
A fault is an anomalous physical condition.
An error is a manifestation of a fault in a system, which can lead to system failure.

Recovery

Failure recovery is a process that restores an erroneous state to an error-free state. (after a failure, restoring system to its "normal" state)

Failure Classification

process failure
system failure
secondary storage failure
communication medium failure

What are some causes for each?

Tolerating Process Failures

signal process to recover internally
restart process from a prior state
abort process

What are some situations where each is appropriate?

Recovering from System Failures

amnesia -- restart in predefined state
partial amnesia -- reset part of the state to predefined
pause -- roll back to before failure
halting -- give up

Tolerating Secondary Storage Failures

archiving (periodic backup)
mirroring (continuous)
activity logging

Tolerating Communication Medium Failures

ack & resend
more complex fault-tolerant algorithms

Backward versus Forward Error Recovery

skip forward to a new correct state
- requires contextual knowledge of what "forward" is
go back to a previous correct state
- overhead: takes time to save and restore state
- fault may repeat (¥ cycling)
- recovery may be impossible

Backward Error Recovery

based on recovery points
two approaches:
1. operation-based recovery
2. state-based recovery

System Model

Stable Storage

does not lose information in the event of system failure
is used to keep logs & recovery points
algorithms in this chapter assume an underlying stable storage system already exists

Two Approaches to Fault Tolerance

operation based
- record a log (audit trail) of the operations performed
- restore previous state by reversing steps
state based
- record a snapshot of the state (checkpoint)
- restore state by reloading snapshot (rollback)

Practical systems employ a combination of the two approaches, e.g., logging with periodic full-DB snapshots for archive.

Fundamental Issues in Crash Recovery

disk writes are only atomic by sector
updates and commits require multiple writes
a crash may occur between writes
log contains record of updates, commits, and aborts
data is written to disk asynchronously
- DB is cached
- log is buffered

The textbook jumps right into the problem of supporting crash recovery, without first reviewing any basic transaction models. The following are two more basic models than those mentioned in the text.

Basic Deferred-Update Model

save a transaction's updates as it runs, in temporary storage
use the saved updates to update the database when the transaction commits
update: record a redo record (e.g. the new value of the item being updated) in an intention list
read: combine the intention list and the database to determine the desired value
commit: update the database by applying the intention list in order, starting with the first operation done by the transaction
abort: discard the transaction's intention list

Basic Update-In-Place Model

update the DB as transaction runs
undo the updates if the transaction aborts
update: record an undo record (e.g., the old value of the item being updated) to an undo log, and then update the database
read:
commit: discard the transaction's undo log
abort: use the undo records in the transaction's undo log to back out the transaction's updates, by backing out the operations in the reverse of the order in which they were originally done

What provides for disk crash recovery?

Extended Update-In-Place Model

update: modify the online DB and record both undo and redo records, including:
- name/location of object
- old state/value of object (for undo)
- new state/value of object (for redo)
in a safe order
read: read the current value and apply the transaction's undo log
commit: discard/invalidate the transaction's undo log
abort: use the undo records in the transaction's undo log to back out the transaction's updates

Where does the stable storage fit in?

Crash Recovery with Update-In-Place

We now have a way to reconstruct the DB system in event of a crash, starting from an archived snapshot and the subsequent log:

transactions not logged as committed are treated as aborted
back out active or aborted transactions, using undo records
do DB updates that may have been in cache, using redo records

If we are starting with a snapshot, why do we need to worry about active, uncommitted, and aborted transactions?

Problem: DB write before log write

There is a defect in the above scheme

if the cached new value of X is written to DB on disk
and then the system crashes, before the old value of X is written to log

How to solve?

Solution: Write-Ahead-Log

Before a block is written to DB disk, make sure the corresponding undo record is completely written to the log disk.
The log must be forced to disk as part of committing a transaction.

Crash Recovery with Write-Ahead-Log

redo phase: redo all the updates in the log, including undo operations of aborted transactions
undo phase: abort all transactions that have no commit or abort record in the log, using the usual undo records in the log

State Based Approach

based on checkpoints of entire state of process
recovery does rollback to checkpointed state
use of shadow pages can reduce size of checkpoints

Problems in Distributed/Concurrent Systems

communicating processes must coordinate checkpoints & rollbacks
lost messages
orphan messages
livelocks

Orphan Messages

Note domino effect if Z is rolled back

Lost Messages

What is the difference between a lost message and an orphan message?

Livelock

These all motivate need for coordinating checkpoints & recovery

Strongly Consistent Set of Checkpoints

There is no information flow between any processes in the set during the time interval spanned by the checkpoints, i.e., no messages in transit.

Consistent Set of Checkpoints

There may be information flow between the processes, but each message recorded as received should be recorded as sent. That is, there are no orphan messages.
What is the remaining problem here?

No comments:

Inspiring Quotes

An inspiring quote may be just what you need to turn your day around. Here are some of the most inspiring quotes ever spoken or written.

I hated every minute of training, but I said, “Don’t quit. Suffer now and live the rest of your life as a champion.”

–Muhammad Ali

“You can have anything you want if you are willing to give up the belief that you can’t have it.”
–Robert Anthony

“There is no man living that can not do more than he thinks he can.”

–Henry Ford

“The best way to predict the future is to create it.”

–Dr. Forrest C. Shaklee

“It’s not about time, it’s about choices. How are you spending your choices?”

–Beverly Adamo

“Success…seems to be connected with action. Successful people keep moving. They make mistakes, but they don’t quit.”
–Conrad Hilton

“Destiny is not a matter of chance; it’s a matter of choice.”

–Anonymous

“The future belongs to those who believe in the beauty of their dreams.”
–Eleanor Roosevelt

“The quality of a person’s life is in direct proportion to their commitment to excellence, regardless of their chosen field of endeavor.”
–Vince Lombardi

“It is never too late to be what you might have been.”
–George Eliot

“Do not let what you can not do; interfere with what you can do.”
–John Wooden

“One man with courage makes a majority.”
–Andrew Jackson

“Failure is the opportunity to begin again more intelligently.”
–Henry Ford

“Try not to become a man of success but rather try to become a man of value.”
–Albert Einstein

“The mind is its own place, and in itself can make a heaven of Hell, a hell of Heaven.”

–John Milton

"If u are student, working and preparing give a little extra effort after regular work. A small sacrifice of TV time, fun time, or facebook time can bring a lot of better things to life than you ever imagined."

-- Naam likhna jaroori nai samajhta.

Thank you for reading, be sure to pass this along!

Real Computer Science begins where we almost stop reading ...