The PostConf GUI makes it easy to train Spamassassin's Bayes DB with
front-ends to postcat, postsuper, and sa-learn. Trouble is that it is also
easy to corrupt a Bayesian DB and actually cause the server to receive more
spam. How can you best take advantage of this powerful anti-spam tool?
Here are a few hard-learned Bayes tips and tricks.
* Never feed an entire folder to sa-learn. Though this contradicts
common opinion, Bayes pattern recognition actually works much better in
practice with fewer messages of higher quality.
* Manually review the entire header AND body of every message submitted
to sa-learn. This is the ONLY way to avoid a poisoned Bayes DB.
Note that seemingly harmless messages can carry Bayes-poisoning text,
often hidden in HTML formatting, often many pages into the body. If a
spam email seems too long truncate it, and only sa-learn the first
80 or 100 lines.
* Don't feed attachments to a Bayes DB. Even with good decoders
Spamassassin is too easily tricked into learning Bayes-poison by
base64, image, pdf, and other encoding methods.
* If an email isn't obviously spam (or obviously non-spam) don't feed
it to sa-learn. Just as giving a puppy mixed messages doesn't help it
become a well behaved dog, Bayesian pattern recognition doesn't work
well with ambiguous input.
* We don't recommend allowing end-users to train either their own or
a site-wide Bayes database. Even email administrators who know a lot
about filtering spam often have difficulty training Bayes DBs. End
users with no training, no time, and no discipline can only poison a
Bayes DB, and in so doing lower the effectiveness of other spam
filters.
* Disable AWL (auto whitelisting) or set auto_whitelist_factor
to 0.1 or 0.2, and disable auto-learning entirely, at least until
these promising technologies become more mature.
Post-Queue Actions:
* Never bounce an email that was received and tagged as spam. This
is known as backscatter and is itself spam, even if the message body
is truncated. Since pattern recognition takes time, and can rarely be
completed before email in the process of delivery must be accepted
or rejected, you must accept and DISCARD messages that pass RBLs,
and simple content checks but subsequently fail Bayes and other
content analysies.
Administrative:
Since it is always possible to accidentally poison a Bayesian
database systems administrators should keep daily backups of all
bayes files maintained by Spamassassin. If (or rather when)
message headers start to indicate poison (BAYES_40 and below
in messages that are clearly spam) roll-back to an earlier db.
References:
* How to beat an Adaptive Spam Filter
* Wikipedia on Bayesian Filtering
Return to Documentation Home
|