Re: Stripping HtML and/or MIME-encoded messages?

From: Howard Schwartz (theo@ncal.verio.com)
Date: Mon, 20 Dec 1999 10:22:15 -0800 (PST)

Ramsay recently wrote:
Howdy fellow yarn-sters, I've half-heartedly been following the
thread and am pretty amazed at the lack of suggestions, and aware-
ness of the capabilities of yarn & souper. I would have to
disagree there is a need for Chin to release the source code just
for you to adequately filter out unwanted e-mail and newsgroup
posts.

He then suggested using a combination of kill, filter, and/or scoring
to delete unwanted files.

I do not think the problem is that simple Ramsey (p.s., I know regular
expressions extremely well). It was not clear from the original
message, that the poster wanted to delete any message, any part of
which contained HTML markup. Some of us suggested ways to delete
or hide, only the part of the message with HTML code in it. This
function is not provided by yarn, except for the ability to send
MIME encoded messages to a program that handles them. The problem
is that various combinations of text and formatted text are not
always sent as MIME encoded text with appropriate headers. One
has to be a bit creative to automate the recognition of certain
kinds of formatting in certain parts of messages.

Once we determined that the poster wanted to junk all messages that
had any HTML code, even though parts of this message might be
plain text, the task was still not all that trivial. Again, not
all HTML text comes as part of a MIME message. For instance, some
mail clients send out all messages as plain text, and then again as
HTML text, with no MIME headers.

Also, lots of html text lack supposedly standardized tags such as
or that a filter, or scoring program can search for. On the other hand, if you search for any possiby html tag with a REGEX like: <[:alpha:]+> You may delete messages that happen to have some double bracket signs in them, but no real html code. For instance, several authors like to use strings like: As quote characters to reply to mail sent to Howard Schwartz. I fear that, until the developers of client and server mail software agree to a more standardized set of standards, we will all have trouble filtering mail without human intervention so as to remove what we dont want without also removing some of what we want.