1. Corpus transformations
2. Utility programs
This is a set of programs which convert an input document to a specific formatted text. This includes: Html2Text.awk, ASCII2Text.awk, ESTER2Text.awk, LeMonde2Text.awk (1987-2002) and LeMondeXML2Text.awk (2003-).
ASCII2Text.awk [options]
Where options include:
-o filename write the output to this file (default: stdout) -s DONT use a very simplist algorithm to split sentences -h print this help
This program transforms a specific input text to a XML document.
Text2XML.awk [options]
Where options include:
-o filename write the output to this file (default: stdout) -u convert characters to utf-8 -e enc specify output encoding (default is utf-8) -l lang give a specific language: FR|CN (default: none) -h print this help
This program adds the <wd> tag to a XML document
by cutting at spaces the original sentence.
For chinese, this program cut systematically after each
chinese character (using a python program to do so).
The SplitOrig2WdKH.awk program do the same specifically for
the Khmer language.
SentOrig2Wd.awk [options]
Where options include:
-e id give a specific id to elements (default: elt) -l lang give a specific language (CN or default: none) -n print <s> null </s> if the sentence is empty -o filename write the output to this file (default: stdout) -h print this help
This program replace the 1st column strings of a file which match, by their corresponding string in the 2nd column.
Replace.awk [options]
Where options include:
-V filename a two columns file (to_be_replaced replacement) -l lang give a specific language FR|CN|EN (default: none) -o filename write the output to this file (default: stdout) -h print this help
This program is a lexical tokenizer, used to tokenize by words.
Tokenize.awk [options]
Where options include:
-l lang EN or anything else! -V filename read vocabulary from this file -o filename write the output to this file (default: stdout) -h print this help
Stick words of a corpus. This program can be used to introduce phrases.
Stick.awk [options]
Where options include:
-V filename read vocabulary from this file -a char char used to join strings in a stick (default is null) -s number max stick size (default: 5) -e id stick on these id elts (default: elt) -t id dont modify the elts and create a new elts serie with this id (default: to) -o filename write the output to this file (default: stdout) -h print this help
This program converts all words of a corpus in lower case.
Lower.awk
-e id make change only on elts with this id -V filename file contains upper/lower character couples -o filename write the output to this file (default: stdout) -h print this help
This is a set of programs which convert numbers into their textual form. This includes: Num2LetterFR.awk (French), Num2LetterVN.awk (Vietnamese), Num2LetterKH (Khmer), Num2LetterCN.awk (chinese), Num2LetterEN (English) and Num2LetterSP.awk (Spanish).
Num2LetterXX.awk [options]
Where options include:
-noxml input is one word a line (instead of XML) -o filename write the output to this file (default: stdout) -h print this help
Remove words of a corpus. The list of words to remove is placed in a sorted file. This can associated to the tag: word, lem or pos. This program can be used, for example, to remove punct, adj, verbs...
Remove.awk [options]
Where options include:
-V filename read vocabulary to remove from this file -e remove elements -w remove words (default) -l remove lemmas -p remove pos taggs -o filename write the output to this file (default: stdout) -h print this help
ONLY FOR FRENCH OR ENGLISH....
Add the POS Tagg to words, using LIA_Tagg (GPL external program).
The tagger can be downloaded
HERE.
-l add pos taggs AND lemmas -o filename write the output to this file (default: stdout) -h print this help
File format converter.
This program converts a bilingual aligned corpus to our XML format.
Select sentences from a xml corpus for SLM training. The output file is one sentence by line with <s> and </s> tags
-o filename write the output to this file (default: stdout) -V filename read vocabulary from this file -u value Value > 0 and < 1 for the threshold of UNK -M value minimum number of elements for a sentence -l filename log file for rejected sentences -n filename print <s> null </s> if the sentence is empty -h print this help
XML2WordCounts estimates word frequencies of a XML corpus. The output file is a 2 columns list with elements ranked by frequencies.
XML2WordCounts.awk [options]
Where options include:
-h print this help
Counts2Vocab creates a vocabulary (list of words) from a count file.
Counts2Vocab.awk [options]
Where options include:
-t filename read unigram counts from this file
-o filename write sorted vocabulary to this file
-K value desired vocabulary size
-c value desired minimum number of occurrences for a word
-h print this help
This program estimates the Zipf law values from a count file. The output is a 4 columns file with (word,count,index,zipf).
cat file.counts | Counts2Zipf.awk [options ...] [-h]
Where options include:
-o filename write result to this file [default: stdout] -s value the column number to sort the output (2, 3 or 4) -h print this help
Html2XML-FR.csh
[HTML|URL] input file or web url [w] input file is an url, not a file [g] graphical mode -h print help
WordsTopN.csh
N value (number of words)
Textometrie.csh - ONLY FOR FRENCH Need zenity and gnuplot to be installed.
Here is a screenshot of the French html file downloaded from a French news web site.
To normalize this file, try the following:
cd examples
./sample-fr.csh francais.html
which convert a file downloaded from a French news web site.
The output XML file is francais.xml (see the next screenshot).
Now, try the following:
cd ../scripts/
./Textometrie.csh
and give the francais.xml file to the fileselection dialog box.
These two windows will appear: