Tokenizer.hs

This implements the tokenizer. It returns all tokens, without throwing any text away. That way, the unprocessed token stream can be used to re-create the input.

The processing uses the enumerator library to produce a stream of tokens from a stream of input or to convert an input stream into a stream of Token instances.

The tokenize and tokenizeFile functions hide this by accumulating the output tokens into a list and returning that.

module Text.Bakers12.Tokenizer
    ( Token(..)
    , TokenType(..)
    , tokenize
    , tokenizeFile
    , tokenizeFileStream
    , tokenizeStream
    , tokenizeE
    ) where

import           Control.Exception (SomeException)
import           Control.Monad (liftM)
import           Control.Monad.Trans (lift)
import qualified Data.Char as C
import qualified Data.Enumerator as E
import qualified Data.Enumerator.Binary as EB
import qualified Data.Enumerator.List as EL
import qualified Data.Enumerator.Text as ET
import qualified Data.Text as T
import qualified Data.Text.Lazy as LT

Text.Bakers12.Tokenizer.Types

import           Text.Bakers12.Tokenizer.Types

This reads text from an instance of Data.Text.Text and returns a list of Token instances.

Either SomeException is a standard way to define a computation that may return something (in this case, a list of tokens) or may have an error.

tokenize :: FilePath -> T.Text -> Either SomeException [Token]
tokenize source input =
    E.runLists [[input]] (tokenizeStream source 0 E.=$ EL.consume)

This reads the input from a file and returns a list of Token instances.

Because this has to read from the file system, it has to execute in the context of the IO monad. Otherwise, this is the same as tokenize.

tokenizeFile :: FilePath -> IO (Either SomeException [Token])
tokenizeFile inputFile =
    E.run (tokenizeFileStream inputFile E.$$ EL.consume)

This creates an Enumerator that reads from a file and produces Token instances.

Enumerators produce data for an enumerator processing pipeline.

This assumes that the files are UTF8.

tokenizeFileStream :: FilePath -> E.Enumerator Token IO b
tokenizeFileStream inputFile =
    EB.enumFile inputFile E.$=
    ET.decode ET.utf8 E.$=
    tokenizeStream inputFile 0

This is an Enumeratee that takes FilePath and returns a Enumerator of Token instances.

Enumeratees are links in the pipeline between an Enumerator and an Iteratee. They take input coming in and filter or transform it and pass it on. Because they're in the middle, they have to pay more attention to the enumerator API. I'll take some time to explain how that works as we encounter it.

tokenizeE :: E.Enumeratee FilePath Token IO b

First, one of the values that an enumeratee can accept is a continuation. This means that there is another chunk of data available.

tokenizeE cont@(E.Continue _) = do

This returns the first item waiting on the input stream. Continue means that there is more data to come, but we may not have it right now (say for networking). So EL.head may return data, or it may not. In Haskell, this means a Maybe data type.

    maybeFP <- EL.head
    case maybeFP of

If there is data right now, it's a file path. We just pass it to tokenizeFileStream, which returns a bunch of tokens, and then loop.

        Just filePath -> tokenizeFileStream filePath cont E.>>== tokenizeE

If there is no data at the moment, just continue waiting.

        Nothing       -> return cont

Other enumeratee inputs mean other things (no more data, end of stream, etc.), but we don't care about that.

tokenizeE step = return step

(After explaining this, there's an obvious way to cut down on the boilerplate for this, at least for this application. It won't happen today, however.)

This is an Enumeratee that takes a stream of Char and transforms it into a stream of Token.

FilePath and Integer are the current state of the tokenizer. This should probably be wrapped in a State monad and stacked on top of this, but for the moment, this is simpler.

tokenizeStream :: Monad m =>
                  FilePath -> Integer -> E.Enumeratee T.Text Token m b
tokenizeStream source offset cont@(E.Continue k) = do

This looks at the first character from the input ...

    maybeC <- ET.head
    case maybeC of
        Just c  -> do

... and dispatches to tokenize' based on the first character. It passes the token into the output stream and loops to take care of the next token.

            token <- tokenize' source offset c
            next  <- lift $ E.runIteratee $ k $ E.Chunks [token]
            tokenizeStream source (offset + fromIntegral (tokenLength token)) next
        Nothing -> return cont
tokenizeStream _ _ step = return step

This takes a character and dispatches the handle the tokenizing the rest of the token from it.

tokenize' :: Monad m => FilePath -> Integer -> Char -> E.Iteratee T.Text m Token
tokenize' source offset c

Tokens that contain alphanumeric or spaces/separators return a span of tokens of the same type.

    | C.isAlpha c       = tokenFromTaken source offset AlphaToken c C.isAlpha
    | C.isNumber c      = tokenFromTaken source offset NumberToken c C.isNumber
    | isSeparator c     = tokenFromTaken source offset SeparatorToken c isSeparator

Punctuation, symbols, marks, and what-have-yous are single-character tokens.

    | C.isPunctuation c = return . makeToken source offset PunctuationToken $ T.singleton c
    | C.isSymbol c      = return . makeToken source offset SymbolToken $ T.singleton c
    | C.isMark c        = return . makeToken source offset MarkToken $ T.singleton c
    | otherwise         = return . makeToken source offset UnknownToken $ T.singleton c

This is an augmented separator predicate that also tests for spaces.

isSeparator :: Char -> Bool
isSeparator c = C.isSpace c || C.isSeparator c

This runs takeWhile with the predicate, conses the initial element to the front, and creates a Token of the given type. Hence Token from takeWhile -en. Or something.

tokenFromTaken :: Monad m   
               => FilePath                      -- ^ tokenSource
               -> Integer                       -- ^ tokenOffset
               -> TokenType                     -- ^ tokenType
               -> Char                          -- ^ Initial character.
               -> (Char -> Bool)                -- ^ Predicate for taking the rest of the token.
               -> E.Iteratee T.Text m Token
tokenFromTaken source offset tType initial predicate =

The composed function here attaches the initial character, actualizes the token string from lazy to strict, and creates the Token from that.

    liftM (makeToken source offset tType . LT.toStrict . LT.cons initial)
          (ET.takeWhile predicate)

This takes the minimum data necessary to create a token, normalizes the input text, and creates the Token.

makeToken :: FilePath -> Integer -> TokenType -> T.Text -> Token
makeToken source offset tType raw =
    Token normalized raw rawLength tType source offset
    where
        normalized = normalizeToken raw
        rawLength  = T.length raw

This takes a raw token Text and returns a normalized version. Currently, this just lower-cases everything.

normalizeToken :: T.Text -> T.Text
normalizeToken = T.map C.toLower