please note that the following data is collected from various
sources and not necessarily official or 100% correct. If you have any information to
add, send away.
The official SOUP 1.2 specifications (soup12.zip) can be found here (external link).
From: cthuang@io.org (Chin Huang)
To: yarn-list@lists.Colorado.EDU
Subject: Re: format of yarn 0.90 newsbase
Date: Mon, 12 Feb 1996 01:09:07 -0500
Message-Id: <DmtHxc6r3DpO090yn@io.org>
News articles are stored in a file named news.dat, called the spool
file. The spool file is composed of variable-sized blocks. There are
two types of blocks. One block type stores an article. The other
block type marks unused storage.
Every block begins with this header:
(long means an integer stored in 32-bit binary)
long prev; // offset of previous block in file
long size; // byte size of data area in block
long used; // bytes used in data area
If the block stores an article, then <used> contains the article size
and the article is stored following the <used> field.
If the block is unused, then <used> is 0 and these two fields immediately
follow the <used> field:
long prevFree; // offset of previous block in free block list
long nextFree; // offset of next block in free block list
The <prevFree> and <nextFree> fields are pointers which link the free
block into a doubly-linked list of free blocks.
A special free block at the beginning (offset 0) of the spool file
stores the head node in a doubly-linked list of free blocks:
long prev; // offset of last block in file
long size; // = 8
long used; // = 0
long prevFree; // offset of last block in free block list
long nextFree; // offset of first block in free block list
When the expire program deletes an article, it marks the block used
by the article as free. It also merges the freed block with adjacent
free blocks. If the freed block is at the end of the spool file,
it truncates the spool file.
>If you're not too busy could you give me a quick list run-down on the >X-status values and what they stand for? I'm suddenly curious. A Answered D Marked for deletion N New O Old and Unread R Read U Unread
Message-Id: <4/TL04uYOJoY089yn@stack.nl> Date: Sat, 27 Sep 1997 19:18:48 +0200 From: galactus@stack.nl (Arnoud "Galactus" Engelfriet) To: yarn-list@lists.colorado.edu Subject: Re: File formats? Perhaps you could also add documentation on the folder format? The format for Yarn folders is quite similar to the SOUP "binary clean mail" format, although with one small difference. In the SOUP format mentioned above, before each message is its length, as a four-byte unsigned value, in big-endian order. This means that if the four bytes you read are "B0 B1 B2 B3", then the length of the message is B0 * 256 * 256 * 256 + B1 * 256 * 256 + B2 * 256 + B3 Yarn uses the little-endian order, probably because that's what DOS and OS/2 use. This way, Yarn can read the length with one read call. Similar to the previous example, if you now read "B0 B1 B2 B3" from a Yarn mail folder, the length is B2 * 256 * 256 * 256 + B3 * 256 * 256 + B0 * 256 + B1 Note: the messages themselves are plain text and line ends are unix style LF characters (0x0A)
history.pag so far known by me:
- divided into 2 k blocks (0x0800 = 2048)
- first two bytes of block tell how much following data is used in that block
- third byte is the length of the first Message-ID
- then the message-ID
- first byte after Message-ID tells length of extra data following
- data follow message-ID (12 bytes it seems always)
- first 4 bytes big endian news.dat offset to the USED value of header.
- second 4 bytes maybe date imported (not sure of format, probably
number of minutes from some date)
- last 4 bytes is a 'supercedes' date, or 0 if none
- the message-ID's in each block are sorted alphabetically, seemingly
Date: Sun, 22 Mar 1998 10:36:08 +1000 From: Ciaran Dunn For those who are interested here s what I know of the overview file format The Overview Files The overview file holds information for articles for display in the art selection level. The structures are as follows Header of File(8 bytes) First four bytes is start Entry ID Second four bytes is end Entry ID ---------------------------------- Note : If the end entry is less than the start entry the file is empty. ---------------------------------- Each entry in the file then has the following format First 4 bytes - Entry ID 0x20 - Marks start of subject Subject 0x0a - Marks end of subject 0x20 - Marks start of mail address Mail address 0x0a - Marks end of mail address 4 bytes of C-style time_t timestamp(ie seconds since 01/01/1970) Message ID 0x0a - Marks end of message ID Reference List(delimited by 0x20) ie A ref list with three references Ref1 0x20 Ref2 0x20 Ref3 0x0a - Marks end of ref list 7 bytes xx - ?? xx - ?? xx - ?? xx - Length (MSB ???) xx - Length (LSB) xx - ?? xx - ?? - These last two bytes seem to commonly be 00 01 or 00 00 If anyone knows anything further about the last 7 bytes please get in touch with me.