Filter and tokenization module

This module filters a source formatted document and convert the document into a normalized segmented and tokenized xml.

The xml representation produced can be used for regenerating original source format (adding eventually markup in native format)

Tokenization rules are described in Tokenizer.l
Segmentation rules are minimal and described in Segmentation.cc

to test this module:
make tokenization
./tokenization file

dependency: libxml2 library

status:
HTML and txt filters are integrated

known issues:
no charset detection has been integrated - both txt and html should be utf-8

todo:
introduce charset detection and conversion
use SRX as segmentation mechanism
extend to other filters...
document and comment code

