SCANMAIL(8)                                           SCANMAIL(8)

     NAME
          scanmail, testscan -  spam filters

     SYNOPSIS
          upas/scanmail [ options ] [ qer-args ] root mail sender
          system rcpt-list

          upas/testscan [ -avd ] [ -p patfile ] [ filename ]

     DESCRIPTION
          Scanmail accepts a mail message supplied on standard input,
          applies a file of patterns to a portion of it, and dis-
          patches the message based on the results.  It exactly
          replaces the generic queuing command qer(8) that is executed
          from the rc(1) script /mail/lib/qmail in the mail processing
          pipeline.  Associated with each pattern is an action in
          order of decreasing priority:

          dump      the message is deleted and a log entry is written
                    to /sys/log/smtpd

          hold      the message is placed in a queue for human inspec-
                    tion

          log       a line containing the matching portion of the mes-
                    sage is written to a log

          If no pattern matches or only patterns with an action of log
          match, the message is accepted and scanmail queues the mes-
          sage for delivery.  Scanmail meshes with the blocking facil-
          ities of smtpd(6) to provide several layers of filtering on
          gateway systems.  In all cases the sender is notified that
          the message has been successfully delivered, leaving the
          sender unaware that the message has been potentially delayed
          or deleted.

          Scanmail accepts the arguments of qer(8) as well as the fol-
          lowing:

          -c        Save a copy of each message in a randomly-named
                    file in directory /mail/copy.
          -d        Write debugging information to standard error.
          -h        Queue held messages by sending domain name.  The
                    -q option must specify a root directory; messages
                    are queued in subdirectories of this directory.
                    If the -h option is not specified, messages are
                    accumulated in a subdirectory of /mail/queue.hold
                    named for the contents of /dev/user, usually none.
          -n        Messages are never held for inspection, but are
                    delivered.  Also known as vacation mode.

     SCANMAIL(8)                                           SCANMAIL(8)

          -p filename
                    Read the patterns from filename rather than
                    /mail/lib/patterns.
          -q holdroot
                    Queue deliverable messages in subdirectories of
                    holdroot. This option is the same as the -q option
                    of qer(8) and must be present if the -h option is
                    given.
          -s        Save deleted messages.   Messages are stored, one
                    per randomly-named file, in subdirectories of
                    /mail/queue.dump named with the date.
          -t        Test mode.  The pattern matcher is applied but the
                    message is discarded and the result is not logged.
          -v        Print the highest priority match.  This is useful
                    with the -t option for testing the pattern matcher
                    without actually sending a message.

          Testscan is the command line version of scanmail. If
          filename is missing, it applies the pattern set to the mes-
          sage on standard input.  Unlike scanmail, which finds the
          highest priority match, testscan prints all matches in the
          portion of the message under test.  It is useful for testing
          a pattern set or implementing a personal filter using the
          pipeto file in a user's mail directory.  Testscan accepts
          the following options:

          -a   Print matches in the complete input message

          -d   Enable debug mode

          -v   Print the message after conversion to canonical form
               (q.v.).

          -p filename
               Read the patterns from filename rather than
               /mail/lib/patterns.

        Canonicalization
          Before pattern matching, both programs convert a portion of
          the message header and the beginning of the message to a
          canonical form.  The amount of the header and message body
          processed are set by compile-time parameters in the source
          files.  The canonicalization process converts letters to
          lower-case and replaces consecutive spaces, tabs and newline
          characters with a single space.  HTML commands are deleted
          except for the parameters following A HREF, IMG SRC, and IMG
          BORDER directives.  Additionally, the following MIME escape
          sequences are replaced by their ASCII equivalents:

                     Escape Seq   ASCII
                     ----------   -----
                          =2e       .

     SCANMAIL(8)                                           SCANMAIL(8)

                          =2f       /
                          =20    <space>
                          =3d       =
          and the sequence =<newline> is elided.  Scanmail assembles
          the sender, destination domain and recipient fields of the
          command line into a string that is subjected to the same
          canonical processing.  Following canonicalization, the com-
          mand line and the two long strings containing the header and
          the message body are passed to the matching engine for anal-
          ysis.

        Pattern Syntax
          The matching engine compiles the pattern set and matches it
          to each canonicalized input string.  Patterns are specified
          one per line as follows:

               {*}action: pattern-spec {~~override...~~override}

          On all lines, a # introduces a comment; there is no way to
          escape this character.

          Lines beginning with * contain a pattern-spec that is a
          string; otherwise, the pattern-spec is a regular expression
          in the style of regexp(6). Regular expression matching is
          many times less efficient than string matching, so it is
          wiser to enumerate several similar strings than to combine
          them into a regular expression.  The action is a keyword
          terminated by a : and separated from the pattern by optional
          white-space.  It must be one of the following:

          dump      if the pattern matches, the message is deleted.
                    If the -s command line option is set, the message
                    is saved.

          hold      if the pattern matches, the message is queued in a
                    subdirectory of /mail/queue.hold for manual
                    inspection.  After inspection, the queue can be
                    swept manually using runq (see qer(8)) to deliver
                    messages that were inadvertently matched.

          header    this is the same as the hold action, except the
                    pattern is only applied to the message header.
                    This optimization is useful for patterns that
                    match header fields that are unlikely to be pre-
                    sent in the body of the message.

          line      the sender and a section of the message around the
                    match are written to the file /sys/log/lines.  The
                    message is always delivered.

          loff      patterns of this type are applied only to the
                    canonicalized command line.  When a match occurs,

     SCANMAIL(8)                                           SCANMAIL(8)

                    all patterns with line actions are disabled.  This
                    is useful for limiting the size of the log file by
                    excluding repetitive messages, such as those from
                    mailing lists.

          Patterns are accumulated into pattern sets sharing the same
          action.  The matching engine applies the dump pattern set
          first, then the header and hold pattern sets, and finally
          the line pattern set.  Each pattern set is applied three
          times: to the canonicalized command line, to the message
          header, and finally to the message body.  The ordering of
          patterns in the pattern file is insignificant.

          The pattern-spec is a string of characters terminated by a
          newline, # or override indicator, ~~.  Trailing white-space
          is deleted but patterns containing leading or trailing
          white-space can be enclosed in double-quote characters.  A
          pattern containing a double-quote must be enclosed in
          double-quote characters and preceded by a backslash.  For
          example, the pattern

               "this is not \"spam\""

          matches the string this is not "spam".  The pattern-spec is
          followed by zero or more override strings.  When the spe-
          cific pattern matches, each override is applied and if one
          matches, it cancels the effect of the pattern.  Overrides
          must be strings; regular expressions are not supported.
          Each override is introduced by the string ~~ and continues
          until a subsequent ~~, # or newline, white-space included.
          A ~~ immediately followed by a newline indicates a line con-
          tinuation and further overrides continue on the following
          line.  Leading white-space on the continuation line is
          ignored.  For example,

                  *hold:   sex.com~~essex.com~~sussex.com~~sysex.com~~
                           lasex.com~~cse.psu.edu!owner-9fans

          matches all input containing the string sex.com except for
          messages that also contain the strings in the override list.
          Often it is desirable to override a pattern based on the
          name of the sender or recipient.  For this reason, each
          override pattern is applied to the header and the command
          line as well as the section of the canonicalized input con-
          taining the matching data.  Thus a pattern matching the com-
          mand line or the header searches both the command line and
          the header for overrides while a match in the body searches
          the body, header and command line for overrides.

          The structure of the pattern file and the matching algorithm
          define the strategy for detecting and filtering unwanted
          messages.  Ideally, a hold pattern selects a message for

     SCANMAIL(8)                                           SCANMAIL(8)

          inspection and if it is determined to be undesirable, a spe-
          cific dump pattern is added to delete further instances of
          the message.  Additionally, it is often useful to block the
          sender by updating the smtpd control file.

          In this regime, patterns with a dump action, generally match
          phrases that are likely to be unique.  Patterns that hold a
          message for inspection match phrases commonly found in unde-
          sirable material and occasionally in legitimate messages.
          Patterns that log matches are less specific yet.  In all
          cases the ability to override a pattern by matching another
          string, allows repetitive messages that trigger the pattern,
          such as mailing lists, to pass the filter after the first
          one is processed manually.  The -s option allows deleted
          messages to be salvaged by either manual or semi-automatic
          review, supporting the specification of more aggressive pat-
          terns.  Finally, the utility of the pattern matcher is not
          confined to filtering spam; it is a generally useful admin-
          istrative tool for deleting inadvertently harmful messages,
          for example, mail loops, stuck senders or viruses.  It is
          also useful for collecting or counting messages matching
          certain criteria.

     FILES
          /mail/lib/patterns  default pattern file
          /sys/log/smtpd      log of deleted messages
          /mail/log/lines     file where log matches are logged
          /mail/queue/*       directories where legitimate messages
                              are queued for delivery
          /mail/queue.hold    directory where held messages are queued
                              for inspection
          /mail/queue.dump/*  directory where dumped messages are
                              stored when the -s command line option
                              is specified.
          /mail/copy/*        directory where copies of all incoming
                              messages are stored.

     SOURCE
          /sys/src/cmd/upas/scanmail

     SEE ALSO
          mail(1), qer(8), smtpd(6)

     BUGS
          Testscan does not report a match when the body of a message
          contains exactly one line.