[linux-elitists] Cluster filesystems

Jason Spence jspence@lightconsulting.com
Thu Jan 1 20:50:16 PST 2004

I did my annual hardware inventory and found out that I'm pushing a
couple terabytes of space across all my boxes.  Some of the paths to
those disks aren't all that fast, but the space is there.

The last time I tried Intermezzo, GFS, Coda, OCFS, etc, I found them
to be lacking in stability and documentation, and recovery from the
simultaneous update problem[1] was really spotty and labor intensive.

A few questions:

 - Are there any cluster filesystems for our favorite OSes that don't

 - Is there a UNIX style solution to the simultaneous update problem?
   All the current non sucky solutions I've seen involve either
   preventing writes during disconnected operation or changing the
   file abstraction in some way, whether it be involving the
   filesystem with counting lines in the file or doing a VMS style
   record scheme.  Both strike me as kind of counter to the
   traditional way of doing things in UNIX...

 - Which project is getting all the eyeballs these days?

 - People keep giving me funny looks when I talk about using multiple
   logins to a 1394 device as a shared storage mechanism.  Is there
   something inherently unreliable about firewire, or is this just
   marketing perceptions coloring the public view of the technology?

 - iSCSI for local transport?

 - Automatic distributed backup software?  It seems a bit silly to me
   that I can't buy anything less than a bazillion MIPS and a disk
   with a capacity only expressable in scientific notation (of which I
   use maybe 1% in between bouts of Actual Work), yet it can't
   automatically spend its spare time imaging hard drives from other
   boxen in my LDAP administrative domain.

 - Or even better yet, why hasn't anyone implemented process
   migration?  As far as I can tell (but haven't tried (yet)), you
   just need to do this for the process, its threads, and any
   dependent processes (like the X server if you're migrating
   something that's doodling on the X display):

    1) Have a common filesystem (including devnodes)

    2) Read /proc/pid/map to get a memory map

    3) SIGSTOP(pid)

    4) ptrace(PT_READ_something, pid, someaddr, 0); [2]
       for all the memory sections (and get the registers too)

    5) Figure out file descriptors, SysV IPC usage, sockets, etc and
       write them down somewhere

    6) Kill everything on the source host

    7) Move all the paperwork over to the target host over the network
       and start a dummy program in unused VM space that reallocates
       all the resources [3], writes the process sections into its VM
       space, spawns threads, copies all the registers over for each
       thread, [5] and then jumps to the PC value retrieved after the
       SIGSTOP for each thread. [6] [7]

    8) Pray.

  The idea being that in a well maintained network, controlled
  shutdowns do not have to impact the use of applications hosted on
  the machine being shut down (much).

  Oh nuts, you'd have to intercept all the hardware I/O on the source
  to reconstruct the state of hardware devices if you're migrating
  something that talks directly to hardware.  Oh wait, that would be
  bad because the device might not like being reinitialized.  Hmm.

 - Jason           Last known location:  3.2 miles northwest of Union City, CA

"I am ready to meet my Maker.  Whether my Maker is prepared for the
great ordeal of meeting me is another matter."
		-- Winston Churchill

[1] Think of a conflict in CVS, but with an actual file on a
    distributed filesystem.

[2] Which is really slow, but could be implemented using shared memory
    instead via a new syscall or something.

[3] As far as I can tell, files, locks, environment variables, mmaped
    stuff, SysV IPC stuff, fifos, UNIX domain sockets, keyboard LED
    states, terminal states, shared libraries, threads, signal masks,
    video card registers and memory (in the X server case), what the
    process had for lunch, etc can be handled using existing syscalls;
    only in-flight TCP or other stream oriented connections would need
    special kernel help and IP spoofing, no?  [4]

[4] Although I suppose you could tell the source machine to shoot the
    process in the head with prejudice, set up temporary firewall
    rules to prevent the RSTs from making it out, use more firewall
    rules, a raw socket, and a socket/connect to trick the target
    machine into thinking it just connected to the remote host, and
    then removing the firewall rules before starting up the migrated
    process if you were really really trying to avoid writing any
    custom kernel stuff... similar tricks for setting up the incoming
    connections too.

[5] Which I believe you have to have a helper process do via ptrace on
    the target, since implementing the equivalent of this in inline
    assembler on every platform would be a lot more work than writing
    a portable abstraction using ptrace:

    movl (regs, $0), %eax
    movl (regs, $4), %ebx
    movl (regs, $8), %ecx
    fldl (regs, $32)
    fldl (regs, $40)
    movapsq (regs, $96), %xmm0
    jmp (regs, $whatever)

[6] And perhaps have a buddy process wipe the part of VM space
containing the helper program at this point, just so no one accuses us
of not being paranoid enough.

[7] Need to investigate how ptrace works on systems with first-class
threads instead of process like threads, though, because I'm not
entirely sure ptracing threads doesn't screw up other threads'
context on some architectures that have special concurrency support
in the hardware...

More information about the linux-elitists mailing list