CLIPS Text Corpus Normalization Toolkit Documentation

Release 2.6

Brigitte BIGI (FR,EN), Viet-Bac Le (VN), Sopheap Seng (KH)

May, 20, 2009

Summary

1. Corpus Transformations

Input to text

DESCRIPTION

This is a set of programs which convert an input document to a specific formatted text. This includes: Html2Text.awk, ASCII2Text.awk, ESTER2Text.awk, LeMonde2Text.awk (1987-2002) and LeMondeXML2Text.awk (2003-).

USAGE

ASCII2Text.awk [options]
Where options include:

-o filename write the output to this file (default: stdout)

-s DONT use a very simplist algorithm to split sentences

-h print this help

Text2XML.awk

DESCRIPTION

This program transforms a specific input text to a XML document.

USAGE

Text2XML.awk [options]
Where options include:

-o filename write the output to this file (default: stdout)

-u convert characters to utf-8

-e enc specify output encoding (default is utf-8)

-l lang give a specific language: FR|CN (default: none)

-h print this help

SplitOrig2Wd.awk

DESCRIPTION

This program adds the <wd> tag to a XML document by cutting at spaces the original sentence.
For chinese, this program cut systematically after each chinese character (using a python program to do so).
The SplitOrig2WdKH.awk program do the same specifically for the Khmer language.

USAGE

SentOrig2Wd.awk [options]
Where options include:

-e id give a specific id to elements (default: elt)

-l lang give a specific language (CN or default: none)

-n print <s> null </s> if the sentence is empty

-o filename write the output to this file (default: stdout)

-h print this help

Replace.awk

DESCRIPTION

This program replace the 1st column strings of a file which match, by their corresponding string in the 2nd column.

USAGE

Replace.awk [options]
Where options include:

-V filename a two columns file (to_be_replaced replacement)

-l lang give a specific language FR|CN|EN (default: none)

-o filename write the output to this file (default: stdout)

-h print this help

Tokenize.awk

DESCRIPTION

This program is a lexical tokenizer, used to tokenize by words.

USAGE

Tokenize.awk [options]
Where options include:

-l lang EN or anything else!

-V filename read vocabulary from this file

-o filename write the output to this file (default: stdout)

-h print this help

Stick.awk

DESCRIPTION

Stick words of a corpus. This program can be used to introduce phrases.

USAGE

Stick.awk [options]
Where options include:

-V filename read vocabulary from this file

-a char char used to join strings in a stick (default is null)

-s number max stick size (default: 5)

-e id stick on these id elts (default: elt)

-t id dont modify the elts and create a new elts serie with this id (default: to)

-o filename write the output to this file (default: stdout)

-h print this help

Lower.awk

DESCRIPTION

This program converts all words of a corpus in lower case.

USAGE

Lower.awk

-e id make change only on elts with this id

-V filename file contains upper/lower character couples

-o filename write the output to this file (default: stdout)

-h print this help

Num2LetterXX.awk

DESCRIPTION

This is a set of programs which convert numbers into their textual form. This includes: Num2LetterFR.awk (French), Num2LetterVN.awk (Vietnamese), Num2LetterKH (Khmer), Num2LetterCN.awk (chinese), Num2LetterEN (English) and Num2LetterSP.awk (Spanish).

USAGE

Num2LetterXX.awk [options]
Where options include:

-noxml input is one word a line (instead of XML)

-o filename write the output to this file (default: stdout)

-h print this help

Remove.awk

DESCRIPTION

Remove words of a corpus. The list of words to remove is placed in a sorted file. This can associated to the tag: word, lem or pos. This program can be used, for example, to remove punct, adj, verbs...

USAGE

Remove.awk [options]
Where options include:

-V filename read vocabulary to remove from this file

-e remove elements

-w remove words (default)

-l remove lemmas

-p remove pos taggs

-o filename write the output to this file (default: stdout)

-h print this help

PosTagger.awk

DESCRIPTION

ONLY FOR FRENCH OR ENGLISH....
Add the POS Tagg to words, using LIA_Tagg (GPL external program). The tagger can be downloaded HERE.

USAGE

PosTagger.awk [options]
Where options include:

-l add pos taggs AND lemmas

-o filename write the output to this file (default: stdout)

-h print this help

Giza2XML.awk

DESCRIPTION

File format converter.
This program converts a bilingual aligned corpus to our XML format.

2. Utilities

XML2Sent.awk

DESCRIPTION

Select sentences from a xml corpus for SLM training. The output file is one sentence by line with <s> and </s> tags

USAGE

XML2Sent.awk [options]
Where options include:

-o filename write the output to this file (default: stdout)

-V filename read vocabulary from this file

-u value Value > 0 and < 1 for the threshold of UNK

-M value minimum number of elements for a sentence

-l filename log file for rejected sentences

-n filename print <s> null </s> if the sentence is empty

-h print this help

XML2WordCounts.awk

DESCRIPTION

XML2WordCounts estimates word frequencies of a XML corpus. The output file is a 2 columns list with elements ranked by frequencies.

USAGE

XML2WordCounts.awk [options]
Where options include:

-h print this help

Counts2Vocab.awk

DESCRIPTION

Counts2Vocab creates a vocabulary (list of words) from a count file.

USAGE

Counts2Vocab.awk [options]
Where options include:

-t filename read unigram counts from this file

-o filename write sorted vocabulary to this file

-K value desired vocabulary size

-c value desired minimum number of occurrences for a word

-h print this help

Counts2Zipf.awk

DESCRIPTION

This program estimates the Zipf law values from a count file. The output is a 4 columns file with (word,count,index,zipf).

USAGE

cat file.counts | Counts2Zipf.awk [options ...] [-h]
Where options include:

-o filename write result to this file [default: stdout]

-s value the column number to sort the output (2, 3 or 4)

-h print this help

Html2XML-FR.csh

USAGE

Html2XML-FR.csh

[HTML|URL] input file or web url

[w] input file is an url, not a file

[g] graphical mode

-h print help

Textometrie.csh

USAGE

Textometrie.csh - ONLY FOR FRENCH Need zenity and gnuplot to be installed.

3. Example of use

Here is a screenshot of the French html file downloaded from a French news web site.

which convert a file downloaded from a French news web site.
The output XML file is francais.xml (see the next screenshot).

-o filename	write the output to this file (default: stdout)
-s	DONT use a very simplist algorithm to split sentences
-h	print this help

-o filename	write the output to this file (default: stdout)
-u	convert characters to utf-8
-e enc	specify output encoding (default is utf-8)
-l lang	give a specific language: FR\|CN (default: none)
-h	print this help

-e id	give a specific id to elements (default: elt)
-l lang	give a specific language (CN or default: none)
-n	print <s> null </s> if the sentence is empty
-o filename	write the output to this file (default: stdout)
-h	print this help

-V filename	a two columns file (to_be_replaced replacement)
-l lang	give a specific language FR\|CN\|EN (default: none)
-o filename	write the output to this file (default: stdout)
-h	print this help

-l lang	EN or anything else!
-V filename	read vocabulary from this file
-o filename	write the output to this file (default: stdout)
-h	print this help

-V filename	read vocabulary from this file
-a char	char used to join strings in a stick (default is null)
-s number	max stick size (default: 5)
-e id	stick on these id elts (default: elt)
-t id	dont modify the elts and create a new elts serie with this id (default: to)
-o filename	write the output to this file (default: stdout)
-h	print this help

-e id	make change only on elts with this id
-V filename	file contains upper/lower character couples
-o filename	write the output to this file (default: stdout)
-h	print this help

-noxml	input is one word a line (instead of XML)
-o filename	write the output to this file (default: stdout)
-h	print this help

-V filename	read vocabulary to remove from this file
-e	remove elements
-w	remove words (default)
-l	remove lemmas
-p	remove pos taggs
-o filename	write the output to this file (default: stdout)
-h	print this help

-l	add pos taggs AND lemmas
-o filename	write the output to this file (default: stdout)
-h	print this help

-t filename	read unigram counts from this file
-o filename	write sorted vocabulary to this file
-K value	desired vocabulary size
-c value	desired minimum number of occurrences for a word
-h	print this help

-o filename	write result to this file [default: stdout]
-s value	the column number to sort the output (2, 3 or 4)
-h	print this help

[HTML\|URL]	input file or web url
[w]	input file is an url, not a file
[g]	graphical mode
-h	print help

CLIPS Text Corpus Normalization Toolkit Documentation

Release 2.6

Brigitte BIGI (FR,EN), Viet-Bac Le (VN), Sopheap Seng (KH)

May, 20, 2009

Summary

1. Corpus transformations

2. Utility programs

1. Corpus Transformations

DESCRIPTION

USAGE

DESCRIPTION

USAGE

DESCRIPTION

USAGE

DESCRIPTION

USAGE

DESCRIPTION

USAGE

DESCRIPTION

USAGE

DESCRIPTION

USAGE

DESCRIPTION

USAGE

DESCRIPTION

USAGE

DESCRIPTION

USAGE

DESCRIPTION

2. Utilities

DESCRIPTION

USAGE

DESCRIPTION

USAGE

DESCRIPTION

USAGE

DESCRIPTION

USAGE

USAGE

USAGE

USAGE

3. Example of use