CRASH(VIII)                  2/12/75                  CRASH(VIII)







NAME

     crash - what to do when the system crashes



DESCRIPTION

     This section gives at least a few clues about how to proceed

     if the system crashes.  It can't pretend to be complete.



     How to bring it back up.   If the reason for  the  crash  is

     not  evident  (see  below for guidance on `evident') you may

     want to try to dump the system if you feel up to  debugging.

     At  the  moment a dump can be taken only on magtape.  With a

     tape mounted and ready, stop the machine, load  address  44,

     and  start.   This should write a copy of all of core on the

     tape with an EOF mark.  Caution: Any error is taken to  mean

     the  end of core has been reached.  This means that you must

     be sure the ring is in, the tape is ready, and the  tape  is

     clean  and  new.   If the dump fails, you can try again, but

     some of the registers will be lost.  See below for  what  to

     do with the tape.



     In restarting after a crash,  always  bring  up  the  system

     single-user.    This   is   accomplished  by  following  the

     directions in boot procedures (VIII) as  modified  for  your

     particular  installation;  a single-user system is indicated

     by having a particular value in the switches (173030  unless

     you've  changed  init) as the system starts executing.  When

     it is running, perform a dcheck and  icheck  (VIII)  on  all

     file systems which could have been in use at the time of the

     crash.  If any serious file system problems are found,  they

     should  be repaired.  When you are satisfied with the health

     of your disks, check and set the  date  if  necessary,  then

     come  up  multi-user.   This  is most easily accomplished by

     changing the single-user value in the switches to  something

     else, then logging out by typing an EOT.



     To even boot UNIX at all, three files (and  the  directories

     leading  to them) must be intact.  First, the initialization

     program /etc/init must be present and executable.  If it  is

     not, the CPU will loop in user mode at location 6.  For init

     to work correctly, /dev/tty8 and /bin/sh  must  be  present.

     If  either  does not exist, the symptom is best described as

     thrashing.  Init will go into a  fork/exec  loop  trying  to

     create a Shell with proper standard input and output.



     If you cannot get the system to boot, a runnable system must

     be  obtained from a backup medium.  The root file system may

     then be doctored as  a  mounted  file  system  as  described

     below.  If there are any problems with the root file system,

     it is probably prudent to go to a  backup  system  to  avoid

     working on a mounted file system.



     Repairing disks.   The first rule to keep in mind is that an

     addled  disk  should  be  treated  gently;  it  shouldn't be

     mounted unless necessary, and if it is very valuable yet  in

     quite  bad  shape, perhaps it should be dumped before trying

     surgery on  it.   This  is  an  area  where  experience  and

     informed courage count for much.



     The problems reported by  icheck  typically  fall  into  two

     kinds.  There can be problems with the free list: duplicates

     in the free list, or free blocks also in files.   These  can

     be  cured  easily  with  an  icheck  -s.   If the same block

     appears in more than one file or  if  a  file  contains  bad

     blocks,  the  files  should  be  deleted,  and the free list

     reconstructed.  The best way to delete such a file is to use

     clri  (VIII),  then remove its directory entries.  If any of

     the affected files is really precious, you can try  to  copy

     it to another device first.



     Dcheck may report files which have  more  directory  entries

     than links.  Such situations are potentially dangerous; clri

     discusses a special case of the problem.  All the  directory

     entries  for  the  file  should be removed.  If on the other

     hand there are more links than directory entries,  there  is

     no danger of spreading infection, but merely some disk space

     that is lost for use.  It is sufficient to copy the file (if

     it has any entries and is useful) then use clri on its inode

     and remove any directory entries that do exist.



     Finally, there may be inodes reported by dcheck that have  0

     links  and  0  entries.  These occur on the root device when

     the system is stopped with pipes open,  and  on  other  file

     systems  when  the  system  stops  with files that have been

     deleted while still open.  A clri will free the  inode,  and

     an icheck -s will recover any missing blocks.



     Why did it crash?   UNIX types  a  message  on  the  console

     typewriter when it voluntarily crashes.  Here is the current

     list of such messages, with enough information to provide  a

     hope  at  least  of  the  remedy.   The message has the form

     `panic: ...', possibly  accompanied  by  other  information.

     Left  unstated in all cases is the possibility that hardware

     or software error produced the message  in  some  unexpected

     way.



     blkdev

          The getblk routine was called with a nonexistent  major

          device  as  argument.   Definitely hardware or software

          error.



     devtab

          Null device table entry for the major  device  used  as

          argument  to  getblk.   Definitely hardware or software

          error.



     iinit

          An I/O error reading the super-block for the root  file

          system during initialization.



     out of inodes

          A mounted file system has no more i-nodes when creating

          a  file.  Sorry, the device isn't available; the icheck

          should tell you.



     no fs

          A device has disappeared from the mounted-device table.

          Definitely hardware or software error.



     no imt

          Like `no fs', but produced elsewhere.



     no inodes

          The in-core inode table is full.  Try increasing NINODE

          in param.h.  Shouldn't be a panic, just a user error.



     no clock

          During   initialization,   neither   the    line    nor

          programmable clock was found to exist.



     swap error

          An unrecoverable  I/O  error  during  a  swap.   Really

          shouldn't be a panic, but it is hard to fix.



     unlink - iget

          The directory containing a file being deleted can't  be

          found.  Hardware or software.



     out of swap space

          A program needs to be swapped out, and there is no more

          swap  space.   It  has  to  be  increased.  This really

          shouldn't be a panic, but there is no easy fix.



     out of text

          A pure procedure program is  being  executed,  and  the

          table  for  such  things  is full.  This shouldn't be a

          panic.



     trap

          An unexpected trap  has  occurred  within  the  system.

          This is accompanied by three numbers: a `ka6', which is

          the contents of the segmentation register for the  area

          in  which  the  system's stack is kept; `aps', which is

          the location where  the  hardware  stored  the  program

          status  word  during  the trap; and a `trap type' which

          encodes which trap occurred.  The trap types are:



          0     bus error

          1     illegal instruction

          2     BPT/trace

          3     IOT

          4     power fail

          5     EMT

          6     recursive system call (TRAP instruction)

          7     11/70 cache parity, or programmed interrupt

          10    floating point trap

          11    segmentation violation



     In some of these cases it is possible for  octal  20  to  be

     added  into the trap type; this indicates that the processor

     was in user mode when the trap occurred.   If  you  wish  to

     examine the stack after such a trap, either dump the system,

     or use the console switches to examine  core;  the  required

     address mapping is described below.



     Interpreting dumps.   All file  system  problems  should  be

     taken  care of before attempting to look at dumps.  The dump

     should be read into the file /usr/sys/core; cp (I) will  do.

     At  this point, you should execute ps -alxk and who to print

     the process table and the users who were on at the  time  of

     the  crash.  You should dump ( od (I)) the first 30 bytes of

     /usr/sys/core.  Starting at location 4,  the  registers  R0,

     R1,  R2,  R3,  R4,  R5,  SP and KDSA6 (KISA6 for 11/40s) are

     stored.  If the dump had to be restarted,  R0  will  not  be

     correct.  Next, take the value of KA6 (location 22(8) in the

     dump) multiplied by 100(8) and dump 1000(8)  bytes  starting

     from  there.   This  is the per-process data associated with

     the process running at the time of the crash.   Relabel  the

     addresses  140000  to  141776.   R5  is C's frame or display

     pointer.  Stored at (R5) is  the  old  R5  pointing  to  the

     previous  stack  frame.   At  (R5)+2  is the saved PC of the

     calling procedure.   Trace  this  calling  chain  until  you

     obtain  an  R5 value of 141756, which is where the user's R5

     is stored.  If the chain is broken, you have to look  for  a

     plausible  R5,  PC  pair  and  continue from there.  Each PC

     should be looked up in the system's name list using  db  (I)

     and  its  `:'  command,  to get a reverse calling order.  In

     most cases this procedure will  give  an  idea  of  what  is

     wrong.   A  more  complete discussion of system debugging is

     impossible here.



SEE ALSO

     clri, icheck, dcheck, boot procedures (VIII)



BUGS