[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [hobbit] Paging -- HTML to Text
- To: <hobbit (at) hswn.dk>
 
- Subject: RE: [hobbit] Paging -- HTML to Text
 
- From: "Sean R. Clark" <sclark (at) nyroc.rr.com>
 
- Date: Tue, 23 Oct 2007 13:12:19 -0400
 
- References: <008f01c81592$7e10ae10$7d6a28a8 (at) hhsea.txnet.state.tx.us>
 
- Thread-index: AcgVknxttFMR1Yh9T2C+8e67lFCH0QABSrxg
 
I pipe my alerts to a perl script
 
below is the stripping html portion - the main message is $body here :
 
 
# right below here i strip out html
 
$body =~ s{ <!          # comments begin with a `<!'
                        # followed by 0 or more comments;
 
    (.*?)               # this is actually to eat up comments in non 
                        # random places
 
     (                  # not suppose to have any white space here
 
                        # just a quick start; 
      --                # each comment starts with a `--'
        .*?             # and includes all text up to and including
      --                # the *next* occurrence of `--'
        \s*             # and may have trailing while space
                        #   (albeit not leading white space XXX)
     )+                 # repetire ad libitum  XXX should be * not +
    (.*?)               # trailing non comment text
   >                    # up to a `>'
}{
    if ($1 || $3) {     # this silliness for embedded comments in tags
        "<!$1 $3>";
    } 
}gesx;                 # mutate into nada, nothing, and niente
 
$body =~ s{ <                    # opening angle bracket
 
    (?:                 # Non-backreffing grouping paren
         [^>'"] *       # 0 or more things that are neither > nor ' nor "
            |           #    or else
         ".*?"          # a section between double quotes (stingy match)
            |           #    or else
         '.*?'          # a section between single quotes (stingy match)
    ) +                 # repetire ad libitum
                        #  hm.... are null tags <> legal? XXX
   >                    # closing angle bracket
}{}gsx;                 # mutate into nada, nothing, and niente
 
$body =~ s{ (
        &              # an entity starts with a semicolon
        ( 
            \x23\d+    # and is either a pound (#) and numbers
             |         #   or else
            \w+        # has alphanumunders up to a semi
        )         
        ;?             # a semi terminates AS DOES ANYTHING ELSE (XXX)
    )
} {
 
    $entity{$2}        # if it's a known entity use that
        ||             #   but otherwise
        $1             # leave what we'd found; NO WARNINGS (XXX)
 
}gex;                  # execute replacement -- that's code not a string
  _____  
From: James Wade [mailto:jkwade (at) futurefrontiers.com] 
Sent: Tuesday, October 23, 2007 12:34 PM
To: hobbit (at) hswn.dk
Subject: [hobbit] Paging -- HTML to Text
Can anyone point me to a good HTML to Text Converter?
 
I'm sending out pages and the incoming message is HTML,
but I want to send it out as text. I want to take all the HTML
imbedded in it out. I'm looking on the web, but I can't seem
to find anything that will allow me to pipe it through the command.
 
Thanks.James