CorpusCreator is a free and user-friendly tool to create small corpora using the Web as a source of texts. It was designed to build specialized comparable corpora for translation tasks. The simple interface looks like a search engine. It let you perform a Web search using some keywords representing your domain of interest and download the links found (PDF or HTML). Each downloaded text is converted into plain text or XML.
Please make sure you have a valid BING key. You should replace it with your own key as soon as possible. You can get one for free here (1 month). Get an API key of V7. Insert the key in menu Options|Bing|Insert your API key
The main windows:
Use this like a normal Search engine. Firstly, create an empty folder where you want the tool to save all the texts. In the search mask, insert a couple of key words which are typical for the texts you are going to collect (if you want to build a corpus on photovoltaic, insert for example: "solar energy" "photovoltaic" "panel"). You don't need to use commas. Use "" for multi-word terms.
Options
Press Enter or the magnifier to start the search. Results are shown:
You can now deselect some sources, if you think they are not relevant for your corpus, and decide whether you want to proceed downloading the selected links or reiterate your search using new key terms. In this way you can collect more texts or extend the topic of your corpus. To reiterate the search, just enter new terms and press enter or the magnifier: the new results will be added to your list of links (duplicate will be automatically removed). When you are done, you can start downloading all links by selecting the download icon. A new window will be shown. Select the folder where you want to save all downloaded texts.
Just select an empty folder you have previously created. Click on OK to start downloading your texts. Depending on how many texts you are going to download, this process may take some time. The texts will be downloaded and converted into XML or plain text.
You can now import all texts into a corpus database using the import option in the main window of TranslatorBank, see the handbook
The following examples demonstrate how you can build a specialized corpus.