[linux-elitists] Cluster filesystems
Thu Jan 1 20:50:16 PST 2004
I did my annual hardware inventory and found out that I'm pushing a
couple terabytes of space across all my boxes. Some of the paths to
those disks aren't all that fast, but the space is there.
The last time I tried Intermezzo, GFS, Coda, OCFS, etc, I found them
to be lacking in stability and documentation, and recovery from the
simultaneous update problem was really spotty and labor intensive.
A few questions:
- Are there any cluster filesystems for our favorite OSes that don't
- Is there a UNIX style solution to the simultaneous update problem?
All the current non sucky solutions I've seen involve either
preventing writes during disconnected operation or changing the
file abstraction in some way, whether it be involving the
filesystem with counting lines in the file or doing a VMS style
record scheme. Both strike me as kind of counter to the
traditional way of doing things in UNIX...
- Which project is getting all the eyeballs these days?
- People keep giving me funny looks when I talk about using multiple
logins to a 1394 device as a shared storage mechanism. Is there
something inherently unreliable about firewire, or is this just
marketing perceptions coloring the public view of the technology?
- iSCSI for local transport?
- Automatic distributed backup software? It seems a bit silly to me
that I can't buy anything less than a bazillion MIPS and a disk
with a capacity only expressable in scientific notation (of which I
use maybe 1% in between bouts of Actual Work), yet it can't
automatically spend its spare time imaging hard drives from other
boxen in my LDAP administrative domain.
- Or even better yet, why hasn't anyone implemented process
migration? As far as I can tell (but haven't tried (yet)), you
just need to do this for the process, its threads, and any
dependent processes (like the X server if you're migrating
something that's doodling on the X display):
1) Have a common filesystem (including devnodes)
2) Read /proc/pid/map to get a memory map
4) ptrace(PT_READ_something, pid, someaddr, 0); 
for all the memory sections (and get the registers too)
5) Figure out file descriptors, SysV IPC usage, sockets, etc and
write them down somewhere
6) Kill everything on the source host
7) Move all the paperwork over to the target host over the network
and start a dummy program in unused VM space that reallocates
all the resources , writes the process sections into its VM
space, spawns threads, copies all the registers over for each
thread,  and then jumps to the PC value retrieved after the
SIGSTOP for each thread.  
The idea being that in a well maintained network, controlled
shutdowns do not have to impact the use of applications hosted on
the machine being shut down (much).
Oh nuts, you'd have to intercept all the hardware I/O on the source
to reconstruct the state of hardware devices if you're migrating
something that talks directly to hardware. Oh wait, that would be
bad because the device might not like being reinitialized. Hmm.
- Jason Last known location: 3.2 miles northwest of Union City, CA
"I am ready to meet my Maker. Whether my Maker is prepared for the
great ordeal of meeting me is another matter."
-- Winston Churchill
 Think of a conflict in CVS, but with an actual file on a
 Which is really slow, but could be implemented using shared memory
instead via a new syscall or something.
 As far as I can tell, files, locks, environment variables, mmaped
stuff, SysV IPC stuff, fifos, UNIX domain sockets, keyboard LED
states, terminal states, shared libraries, threads, signal masks,
video card registers and memory (in the X server case), what the
process had for lunch, etc can be handled using existing syscalls;
only in-flight TCP or other stream oriented connections would need
special kernel help and IP spoofing, no? 
 Although I suppose you could tell the source machine to shoot the
process in the head with prejudice, set up temporary firewall
rules to prevent the RSTs from making it out, use more firewall
rules, a raw socket, and a socket/connect to trick the target
machine into thinking it just connected to the remote host, and
then removing the firewall rules before starting up the migrated
process if you were really really trying to avoid writing any
custom kernel stuff... similar tricks for setting up the incoming
 Which I believe you have to have a helper process do via ptrace on
the target, since implementing the equivalent of this in inline
assembler on every platform would be a lot more work than writing
a portable abstraction using ptrace:
movl (regs, $0), %eax
movl (regs, $4), %ebx
movl (regs, $8), %ecx
fldl (regs, $32)
fldl (regs, $40)
movapsq (regs, $96), %xmm0
jmp (regs, $whatever)
 And perhaps have a buddy process wipe the part of VM space
containing the helper program at this point, just so no one accuses us
of not being paranoid enough.
 Need to investigate how ptrace works on systems with first-class
threads instead of process like threads, though, because I'm not
entirely sure ptracing threads doesn't screw up other threads'
context on some architectures that have special concurrency support
in the hardware...
More information about the linux-elitists