CorpusCreator - Handbook

About

CorpusCreator is a free and user-friendly tool to create small corpora using the Web as a source of texts. It was designed to build specialized comparable corpora for translation tasks. The simple interface looks like a search engine. It let you perform a Web search using some keywords representing your domain of interest and download the links found (PDF or HTML). Each downloaded text is converted into plain text or XML.

Finding texts for a corpus

Please make sure you have a valid BING key. You should replace it with your own key as soon as possible. You can get one for free here (1 month). Get an API key of V7. Insert the key in menu Options|Bing|Insert your API key

The main windows:

Use this like a normal Search engine. Firstly, create an empty folder where you want the tool to save all the texts. In the search mask, insert a couple of key words which are typical for the texts you are going to collect (if you want to build a corpus on photovoltaic, insert for example: "solar energy" "photovoltaic" "panel"). You don't need to use commas. Use "" for multi-word terms.

Options

  • decide if you want to build a corpus with PDF or HTML files or both.
  • set the number of results CorpusMode is going to find (standard 10).
  • you can set the language you are interested in.
  • in Domain you can restrict the search to a specific domain, for example .de for texts located in Germany or usa.gov for texts coming only from the official Web portal of the U.S. Administration.

Press Enter or the magnifier to start the search. Results are shown:

You can now deselect some sources, if you think they are not relevant for your corpus, and decide whether you want to proceed downloading the selected links or reiterate your search using new key terms. In this way you can collect more texts or extend the topic of your corpus. To reiterate the search, just enter new terms and press enter or the magnifier: the new results will be added to your list of links (duplicate will be automatically removed). When you are done, you can start downloading all links by selecting the download icon. A new window will be shown. Select the folder where you want to save all downloaded texts.

Just select an empty folder you have previously created. Click on OK to start downloading your texts. Depending on how many texts you are going to download, this process may take some time. The texts will be downloaded and converted into XML or plain text.

You can now import all texts into a corpus database using the import option in the main window of TranslatorBank, see the handbook

Options

  1. Try to keep only running text from HTML: the tool tries to keep only the running text, discarding buttons, etc.
  2. Clean texts when downloading: the tool removes double spaces, tabs and tries to make the text more uniform.
  3. Output as XML: this is the default option. All downloaded texts are formatted as XML. This is the format used to create databases in TranslatorBank.
  4. Set static XML values: if you need it, you can add some "static" values to each downloaded and in XML converted text. This could be useful if you are going to use the text with other software.

Examples for corpus creation

The following examples demonstrate how you can build a specialized corpus.

  1. Creating a German corpus of the Bundesbank, the central bank of Germany: In Google, append "site:bundesbank.de filetype:PDF" to your search. In Advanced Search, set the language to "German".
  2. Creating an English corpus on photovoltaic: Use Google to search for "photovoltaic solar cell filetype:PDF". In Advanced Search, set the language to "English".