TranslatorBank - Handbook

TranslatorBank is a free and user-friendly software to collect specialized texts from the Web (build a corpus), extract monolingual terminology and collocations and make linguistic analyses during the translation process. It is designed for use in a professional translation environment or in translation classes to analyse reference texts.

If you use this software in your papers, please cite: "Revisiting corpus creation and analysis tools for translation tasks". In: Gallego-Hernández, Daniel and Rodríguez-Inés, Patricia (eds.), Special Issue: Corpus Use and Learning to Translate, almost 20 years on. Cadernos de Tradução, 36(1), 62-87 (2016).

System Requirements

Operating System: Windows.

Overview

For translators and interpreters, analysing concordances allow you to:

  • See how words and phrases are used in real contexts (corporate language, technical field, etc.)
  • Analyse the language style and terminology of a particular industrial sector or company
  • Find the relevant terminology and phraseology

Key features:

  • Easy collection of reference texts (monolingual) from the Internet to create your own specialized corpus (from HTML webpages or PDF files)
  • Generation of concordances to see how words work in real contexts (concordances can be ordered alphabetically)
  • Computing of collocations for analyzed word
  • Computing of word frequency lists
  • Extracting of specialized terminology (it works only for languages for which the TreeTagger is available, see TreeTagger)
  • From the search results (concordance) you can open the original file, being it a webpage or a PDF file.

How it works

Collecting texts

If you don't have a set of texts, you can create your own corpus from the Web using the CorpusCreator utility. The user manual for CorpusCreator can be found here. You can start CorpusCreator from menu Corpus | Create corpus from the Web (CorpusCreator). Texts collected with CorpusCreator are downloaded and converted in XML-files which can be directly iported in a corpus database by TranslatorBank.

Creating a corpus database

When you have a set of suitable documents for your project, you can import them in corpus database. All texts must be saved in the same folder which will eventually be your project folder. You can import:

  • Texts in simple .txt format (for example the texts you collected manually). If your texts are in a different format (PDF, PowerPoint, Word, etc.), you have first convert them into .txt. For your convenience, TranslatorBank provides you a PDF to TEXT converter which can convert all PDF saved in a folder. You can open it from the tool CorpuCreator
  • Texts in XML format i.e. the texts automatically downloaded with the tool CorpusCreator

Once you have a folder with .txt and/or .xml files, you can import them in a corpus database:

  • Select Corpus | Create corpus database from folder
  • Select the language of your texts
  • The software will now import all your texts into a corpus database.

Opening a corpus

Click on Corpus | Open corpus database. PS: when you open TranslatorBank, the last corpus you have used will be automatically opened. You see the name of the open corpus on the top of the window.

Searching for words

TranslatorBank uses a query system similar to the Google Search Engine. To start querying your corpus, type a word, part of it or a phrase in the Search mask and press Enter. The concordances for your string will be displayed. As default the Proximity search is activated: this means that if you enter 2 words, the tool will search for sentences containing the two words in a gap of 5 words (distance). If you want to search for exact words and phrases, just use "". To see the whole text context of a particular result, double-click on the corresponding concordance ID (row number on the left). Here the search word is highlighted. You can access the original file ith links provided.

Ordering results

To alphabetically order the results, open the Options panel by clicking the green +, use the drop-down list to select Left or Right and the distance of the word by which you want the results to be ordered.

Search options

  • Results (KIWC) length in characters: set how many characters should be displayed in the results
  • Proximity search in all orders: when searching for two words, results are showing containing the two words irrespective of the reciprocal order
  • Proximity search distance (words): set the max. distance (in words) of the two words to be considered as a match
  • Context length (characters): set how many characters will be used to display the context of your result in the lower panel. To show the context of a result you can click on the number corresponding to the result.
  • Max. number of results: if your text corpus is very large, you can limit the maximum amount of results shown. This makes the system work quicker
  • Scuffle results: show results in casual order. This is useful to see results not in the order they appear in your documents but in a mixed order

View

  • In menu View you can change the font size of the search results (concordances).

Tools

  • Compute Word frequency (tokens): it generates a list of words ordered by their frequency. Stopwords can be excluded. Filters can be applied. If this command was already run on the same corpus, the computation is skipped and the results are shown.
  • Extract terminology: the tool extracts a list of specialized words following the grammar rules defined in the parameter files saved in the folder (Installation directory/files/TermExtractor/Rules). Note: This function is language-dependent which means it needs language-dependend resources (see TreeTagger). At the moment only the resources for a few languages are installed (you can add them manually). If this command was already run on the same corpus, the computation is skipped and the results are shown.
  • Find collocates for selected word: it generates a list of words that appear often near the selected word.
  • Compute basic corpus statistics: it gives you some information about your corpus, such as language, number of texts and so on.

Installing language files for Terminology Extraction

We are still working on rules and heuristics for the single languages. So please, consider the results just as "working in progress".

Even if the idea of TranslatorBank is to offer you an out-of-the-box solution, for the extraction module - as it involves so many different languages - you need to help the software to speak your language. In fact, for extracting the terminology from your corpus TranslatorBank uses three resources:

  1. a third-party software called TreeTagger (already installed)
  2. the language-dependent parameter files used by the TreeTagger (only few are installed, you can installed the other from the homepage of the TreeTagger)
  3. a set of language-dependent morphological rules (to be created for your corpus language)
  4. stopwords and common vocabulary files

We cannot provide all language-dependend resources directly for many languages,. This is why you may need to add some files (for English everything is already installed, even if grammar rules, stopwords and common vocabulary can be changed i.e. improved by the user). If some file for your language is missing in the installation, TranslatorBank will popup a message. In this case you need to:

  • Open the TreeTagger's homepage here and download the parameter file for the languages you need (scroll-down till you find "Parameter files")
  • Unpack it (you can use the free unzipper called 7z)
  • Rename the file with the language name in English, small letters. For example: german, italian, english etc.
  • Put the file in the installation folder "C:\Programs\TranslatorBank\files\TreeTagger\lib\"

You also need the rules files. Rules are provided for English, Italian and German (but you can change i.e. improve them). Morphological rules describe what terms are. Unfortunately I do not speak so many languages, so I have no clue what is a term in French, Spanish or Russian. You are called to write these rules alone (or ask me for support, we can do that togheter). If you have succesfully written yur rules, it would be nice if you could send them to me so that I can make them available with the next release.

For Egnlish, for example, we can say that a term can be a NOUN. In the rules file for English, we have to write "<1gram>;NN;" where "1 gram" means that the term is made only of 1 word and "NN" is the part of speach used in the Tagset documentation (associated with the parameter file of the TreTagger) to describes nouns. If we want to extract terms made by "adjectiv+noun", we add a new line in the Rules file for English as: <2gram>;ADJ,NN; (please note the use of commas and semicolumns).

Rule files are saved in a folder called Rules. You can open it with Menu "Tool|Open directory with language". See the format of the English rules file to understand how a rule files is made.

If you have problems in adding your language, write me an e-mail and I'll help you to create this rules file. I'll need some help from you in order to build it!

Copyright

TranslatorBank and all its modules are copyright of Claudio Fantinuoli, University of Mainz. The third party modules are owned by their respective owners. For more information contact the author of this software.