CLIPS Text Corpus Normalization Toolkit Documentation

Release 2.6

Brigitte BIGI (FR,EN), Viet-Bac Le (VN), Sopheap Seng (KH)

May, 20, 2009


Summary

1. Corpus transformations

2. Utility programs

1. Corpus Transformations

Input to text

DESCRIPTION

This is a set of programs which convert an input document to a specific formatted text. This includes: Html2Text.awk, ASCII2Text.awk, ESTER2Text.awk, LeMonde2Text.awk (1987-2002) and LeMondeXML2Text.awk (2003-).

USAGE

ASCII2Text.awk [options]
Where options include:

-o filename write the output to this file (default: stdout)
-s DONT use a very simplist algorithm to split sentences
-h print this help

Text2XML.awk

DESCRIPTION

This program transforms a specific input text to a XML document.

USAGE

Text2XML.awk [options]
Where options include:

-o filename write the output to this file (default: stdout)
-u convert characters to utf-8
-e enc specify output encoding (default is utf-8)
-l lang give a specific language: FR|CN (default: none)
-h print this help

SplitOrig2Wd.awk

DESCRIPTION

This program adds the <wd> tag to a XML document by cutting at spaces the original sentence.
For chinese, this program cut systematically after each chinese character (using a python program to do so).
The SplitOrig2WdKH.awk program do the same specifically for the Khmer language.

USAGE

SentOrig2Wd.awk [options]
Where options include:

-e id give a specific id to elements (default: elt)
-l lang give a specific language (CN or default: none)
-n print <s> null </s> if the sentence is empty
-o filename write the output to this file (default: stdout)
-h print this help

Replace.awk

DESCRIPTION

This program replace the 1st column strings of a file which match, by their corresponding string in the 2nd column.

USAGE

Replace.awk [options]
Where options include:

-V filename a two columns file (to_be_replaced replacement)
-l lang give a specific language FR|CN|EN (default: none)
-o filename write the output to this file (default: stdout)
-h print this help

Tokenize.awk

DESCRIPTION

This program is a lexical tokenizer, used to tokenize by words.

USAGE

Tokenize.awk [options]
Where options include:

-l lang EN or anything else!
-V filename read vocabulary from this file
-o filename write the output to this file (default: stdout)
-h print this help

Stick.awk

DESCRIPTION

Stick words of a corpus. This program can be used to introduce phrases.

USAGE

Stick.awk [options]
Where options include:

-V filename read vocabulary from this file
-a char char used to join strings in a stick (default is null)
-s number max stick size (default: 5)
-e id stick on these id elts (default: elt)
-t id dont modify the elts and create a new elts serie with this id (default: to)
-o filename write the output to this file (default: stdout)
-h print this help

Lower.awk

DESCRIPTION

This program converts all words of a corpus in lower case.

USAGE

Lower.awk

-e id make change only on elts with this id
-V filename file contains upper/lower character couples
-o filename write the output to this file (default: stdout)
-h print this help

Num2LetterXX.awk

DESCRIPTION

This is a set of programs which convert numbers into their textual form. This includes: Num2LetterFR.awk (French), Num2LetterVN.awk (Vietnamese), Num2LetterKH (Khmer), Num2LetterCN.awk (chinese), Num2LetterEN (English) and Num2LetterSP.awk (Spanish).

USAGE

Num2LetterXX.awk [options]
Where options include:

-noxml input is one word a line (instead of XML)
-o filename write the output to this file (default: stdout)
-h print this help

Remove.awk

DESCRIPTION

Remove words of a corpus. The list of words to remove is placed in a sorted file. This can associated to the tag: word, lem or pos. This program can be used, for example, to remove punct, adj, verbs...

USAGE

Remove.awk [options]
Where options include:

-V filename read vocabulary to remove from this file
-e remove elements
-w remove words (default)
-l remove lemmas
-p remove pos taggs
-o filename write the output to this file (default: stdout)
-h print this help

PosTagger.awk

DESCRIPTION

ONLY FOR FRENCH OR ENGLISH....
Add the POS Tagg to words, using LIA_Tagg (GPL external program). The tagger can be downloaded HERE.

USAGE

PosTagger.awk [options]
Where options include:

-l add pos taggs AND lemmas
-o filename write the output to this file (default: stdout)
-h print this help

Giza2XML.awk

DESCRIPTION

File format converter.
This program converts a bilingual aligned corpus to our XML format.


2. Utilities


XML2Sent.awk

DESCRIPTION

Select sentences from a xml corpus for SLM training. The output file is one sentence by line with <s> and </s> tags

USAGE

XML2Sent.awk [options]
Where options include:

-o filename write the output to this file (default: stdout)
-V filename read vocabulary from this file
-u value Value > 0 and < 1 for the threshold of UNK
-M value minimum number of elements for a sentence
-l filename log file for rejected sentences
-n filename print <s> null </s> if the sentence is empty
-h print this help

XML2WordCounts.awk

DESCRIPTION

XML2WordCounts estimates word frequencies of a XML corpus. The output file is a 2 columns list with elements ranked by frequencies.

USAGE

XML2WordCounts.awk [options]
Where options include:

-h print this help

Counts2Vocab.awk

DESCRIPTION

Counts2Vocab creates a vocabulary (list of words) from a count file.

USAGE

Counts2Vocab.awk [options]
Where options include:

-t filename read unigram counts from this file
-o filename write sorted vocabulary to this file
-K value desired vocabulary size
-c value desired minimum number of occurrences for a word
-h print this help


Counts2Zipf.awk

DESCRIPTION

This program estimates the Zipf law values from a count file. The output is a 4 columns file with (word,count,index,zipf).

USAGE

cat file.counts | Counts2Zipf.awk [options ...] [-h]
Where options include:

-o filename write result to this file [default: stdout]
-s value the column number to sort the output (2, 3 or 4)
-h print this help

Html2XML-FR.csh

USAGE

Html2XML-FR.csh

[HTML|URL] input file or web url
[w] input file is an url, not a file
[g] graphical mode
-h print help

WordsTopN.csh

USAGE

WordsTopN.csh

N value (number of words)

Textometrie.csh

USAGE

Textometrie.csh - ONLY FOR FRENCH Need zenity and gnuplot to be installed.


3. Example of use

Here is a screenshot of the French html file downloaded from a French news web site.


To normalize this file, try the following:

cd examples
./sample-fr.csh francais.html

which convert a file downloaded from a French news web site.
The output XML file is francais.xml (see the next screenshot).


Now, try the following:

cd ../scripts/
./Textometrie.csh

and give the francais.xml file to the fileselection dialog box.

These two windows will appear: