[linux-elitists] Unpacking MS Entourage database files (or proprietary data formats for fun and profit)

Karsten M. Self karsten@linuxmafia.com
Fri Oct 3 17:22:53 PDT 2008


Some time ago I found myself, for reasons known to some but
provacatively deffered to disclosure at a later date, with a strong
hankering to access the contents of a Microsoft Entourage database file.
Microsoft Entourage is, as you might suspect, a Microsoft product for OS
X, roughly analagous to Microsoft Outlook, serving as a personal
information manager, with email, contacts, calendar, to-do lists, and
related functions.  It maintains all related data in a single binary
blob datatabse for which no specification or public documentation is
available.

This wasn't just any database, but one which had been recovered from a
failed hard drive and apparently somewhat internally corrupted..  First
attempts at recovery, through Entourage's own database recovery
procedure, not surprisingly, failed.

Googling for recovery tools showed little in the way of anything useful,
though several sources suggested using binary editors to view content.
Doing so showed that there were large identifiable text blocks within
the database, including content that was clearly RFC 2822 email content.

Running 'strings' against the database produced output in which the
email and other content was more clearly evident, along with a number of
frequently occuring four-byte codes, similar to the following:

     1     5886 DELE
     2     5381 MLRC
     3     3070 popM
     4     2560 MLID
     5     2529 mesg
     6     2020 MSrc
     7     1022 Name
     8      648 bDlt
     9      648 bStd
    10      648 StNm
    11      648 StRl
    12      648 StTo
    13      648 TZID
    14      600 ngMs
    15      594 CLRC
    16      512 DLFr
    17      512 DLNm
    18      512 DLRl
    19      512 DLTo
    20      511 CFtz

These seemed to generally indicate record types, with "DELE" apparently
denoting deleted records, "mesg" outbound mail, "MSrc" inbound mail,
"Attc" attachments, etc.

I found that a relatively simple awk script, posted below, hosted
at, and which I'm making available under the GPL:

    http://linuxmafia.com/~karsten/Download/parse-entourage-database.awk

Instructions for use:

  - Run 'strings' over your Entourage database redirecting output to
    file.  If you're working with international / UTF-8 data you may
    want to play with string's '-e' encoding option.
  - Run parse-entourage-database.awk over the output file above.  An
    output is created per record file, sequentially numbered,
    identified by file type (MSrc., mesg, Attc, etc.).  You may have to
    do some manual cleanup particularly to remove extraneous linefeeds
    and some garbage data.

This is not elegant code.  It's hackish, brutish, and short.  It also
works, and is better than anything else I've been able to turn up
online.  I consider it a useful starting point.  There are some pretty
plainly evident internal structures in the file which should be further
parseable, in particular date and time stamps associated with various
entries which would be nice to get.  That's for a continued effort.

In my case, I was able to recover several thousand inbound and outbound
messages, hundreds of email contacts, and hundreds of file attachments.
Enough to prove intent to commit immigration visa fraud and secure an
annulment of marriage.


Code follows:

------------------------------------------------------------------------
#!/usr/bin/gawk -f
# Karsten M. Self
# Tue Mar 18 21:28:18 PDT 2008
# 
#
#   Extract data from a Microsoft Entourage #   database
#   Copyright (C) 2008 Karsten M. Self
#
#   This program is free software; you can redistribute it and/or modify
#   it under the terms of the GNU General Public License as published by
#   the Free Software Foundation; either version 2 of the License, or
#   (at your option) any later version.
#
#   This program is distributed in the hope that it will be useful,
#   but WITHOUT ANY WARRANTY; without even the implied warranty of
#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#   GNU General Public License for more details.
#
#   You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#   Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
#
# Basically:  we've got an Entourage database with messages, contacts,
# and a ton of other crap in there.
# 
# Messages seem to begin with either "MSrc" or "mesg".
# Former is a properly formatted RFC2822 mail message.
# Latter is sort of an abstract.
# Attachments may start with "Attc"
#
# Processing:
#  - Read through the file.  If:
#     - ^MSrc: 
#        - Close off any existing message.
#        - Increment file counter
#        - Open new file
#        - Keep writing until you find another marker.
#     - ^mesg$
#        - As above, but realize it's a fragment.
#     - ^Attc
#        - As above, but it's an attachment fragment, with MIME
#        boundaries.
# 
# Just to further muck stuff up, there's often a stray character at
# start of line.
#

# ----------------------------------------------------------------------
function closeout(is_a_message, outfile) {
    if (is_a_message != 0 ) {
	fflush(outfile)
	close(outfile)
	printf("Closed outfile: %s\n", outfile ) >"/dev/stdout"
	return is_a_message 
    }
}


# ----------------------------------------------------------------------


BEGIN {
    filenum = 0
    outfile = "Junk-0"
    print "" > outfile    # so that later close doesn't barf.
    is_a_message = 0
}

# Ways to end a message.  Stop writing, it's garbage:
/^.{0,1}(DELE|FLRC|MLID|MLRC|CTCL|TSCL|dbHd|Mdia|popM)$/ {
    filenum++
    closeout(is_a_message, outfile); 
    outfile = "Junk-" filenum
    is_a_message = 2
    printf( "Outfile: %s  ... ", outfile ) >"/dev/stdout"; fflush("/dev/stdout")
}

# RFC 2822 format message, we hope.
/^.{0,1}MSrc$/ {
    filenum++
    rc = closeout(is_a_message, outfile)
    outfile = "MSrc-" filenum
    is_a_message = 1
    printf( "Outfile: %s  ... ", outfile ) >"/dev/stdout"; fflush("/dev/stdout")
}

# Message fragment
/^.{0,1}mesg$/ {
    filenum++
    rc = closeout(is_a_message, outfile)
    outfile = "mesg-" filenum
    is_a_message = 1
    printf( "Outfile: %s  ... ", outfile ) >"/dev/stdout"; fflush("/dev/stdout")
}

# Attachments
/^.{0,1}Attc$/ {
    filenum++
    rc = closeout(is_a_message, outfile)
    outfile = "Attc-" filenum
    is_a_message = 1
    printf( "Outfile: %s  ... ", outfile ) >"/dev/stdout"; fflush("/dev/stdout")
}

# Calendar entries
/^.{0,1}bStd$/ {
    filenum++
    closeout(is_a_message, outfile)
    outfile = "Clndr-" filenum
    is_a_message = 1
    printf( "Outfile: %s  ... ", outfile ) >"/dev/stdout"; fflush("/dev/stdout")
}

# TODO entries
/^.{0,1}TSRC$/ {
    filenum++
    closeout(is_a_message, outfile)
    outfile = "ToDo-" filenum
    is_a_message = 1
    printf( "Outfile: %s  ... ", outfile ) >"/dev/stdout"; fflush("/dev/stdout")
}


# For anytime we're in a message...
{
    if( is_a_message != 0 ) print > outfile
}

END {
    closeout(is_a_message, outfile)
    printf( "\nEnd of the road\n" ) >"/dev/stdout"
    fflush("/dev/stdout")
}
------------------------------------------------------------------------


Peace.

-- 
Karsten M. Self <karsten@linuxmafia.com>        http://linuxmafia.com/~karsten
    Ceterum censeo, Caldera delenda est.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: Digital signature
Url : http://allium.zgp.org/pipermail/linux-elitists/attachments/20081003/c6d56ca5/attachment.pgp 


More information about the linux-elitists mailing list