CorpusStudio Tutorial

Erwin Komen

Radboud University

E.Komen@Let.ru.nl

1.            The problem

The problem we are going to investigate today is the following:

 

find it-clefts in English corpora.

 

For the purpose of this tutorial we will assume that it-clefts contain the following elements in the indicated order:

We will be using Corpus Studio to investigate this problem. Please first install the program on the computer you are working on (use the Quick Start Guide).

2.            Setting the base directory

After the program is installed correctly, choose Tools/Settings to set the base directory for work with CorpusStudio. Use something like u:\CorpusStudio or u:\data\CorpusStudio.

3.            Creating the project

Create a new corpus research project using File/New (or: Ctrl+N). For our purposes give it an appropriate name, e.g. Tutorial.

The locations of input and output files now need to be set.

·        Select the output directory button

o   Create Tutorial as a subdirectory of CorpusStudio.

o   Set the output directory to CorpusStudio\Tutorial

·        Select the Query directory button

o   Create Queries (or: qq) as a subdirectory of CorpusStudio.

o   Set the query directory to CorpusStudio\Queries

·        Select the input directory button

o   Find the Corpora\PPCEME subdirectory on the W drive at:

W:\Letteren\Talen en Culturen\Engelse taal en letterkunde\Corpora\PPCEME

·        Select all PSD files in this directory by pressing “Select all files in this directory”.

·        SAVE the project! Use Ctrl+S or File/Save.

There are still some more general things to be set, so choose the General tab page.

·        Fill in the Author

·        Describe the goal of this tutorial

·        Add comments about the implementation

·        Save the project again.

4.            Add a definitions file

For this tutorial we are going to add an existing definitions file. It should be stored in our own CorpusStudio\Queries directory, so that in future projects we can make use of it by choosing Definition/Add.

But for this tutorial we are going to copy it from the location where it is stored on the W-drive using the command Definition/Import.

The location of the definition file is the following:

 

W:\Letteren\Talen en Culturen\Engelse taal en letterkunde\Erwin\CorpusStudio\OE+MEU.def

5.            Make the first query

We are now ready to make a first query. This first query should select all IPs with the following characteristics:

·        They contain an NP

·        They contain a finite form of “to be”

·        They contain a CP

Make a new query using Query/New. Give it the following characteristics:

Now fill in the new query so that it looks like the following example:

node: IP*

add_to_ignore: \**

remove_nodes: f

print_indices: t

define: OE+MEU.def

 

query: ( (IP* iDoms anynp)     AND

         (IP* iDoms finite_BE) AND

         (IP* iDoms CP*)

        )

For an explanation of keywords like iDoms (immediately dominates) see Help/Query languages/Corpus search. You will be directed to the CorpusSearch website. By leafing through the tutorial, you will find the explanations of these keywords.

6.            Make the second query

The first query only gives IPs which contain the basic ingredients for an it-cleft. We now want to take the output of this first query and make a finer selection to get the actual it-clefts. Make a new query called ipIT-cleft, which should look as follows:

node: IP*

add_to_ignore: \**

remove_nodes: f

print_indices: t

define: OE+MEU.def

 

query: ( ( (IP* iDoms [1]anynp) AND

           ([1]anynp iDoms pronoun) AND

           (pronoun iDoms pronoun_it) ) AND

         (IP* iDoms finite_BE) AND

         (IP* iDoms [2]anynp) AND

         (IP* iDoms CP*) AND

         ([1]anynp iPrecedes finite_BE) AND

         (finite_BE iPrecedes [2]anynp) AND

         ([2]anynp iPrecedes CP*)

       )

 

The first part of this query aims to find the it pronoun. The treebank coding of an “it” pronoun can be understood from the following example:

             (49 NP-SBJ-2 (50 PRO hit))

The pronoun lexeme hit is inside a node of type PRO, and this node is inside one of type NP.

The IP containing the NP with it should, again, contain the finite form of be, a second NP (indicated by the index [2]), and a CP.

The order of these basic elements is defined using the keyword iPrecedes, “immediately precedes”.

Be sure to check the brackets around the queries!!

7.            Making the construction

The queries need to be executed in a particular order, and this is going to be defined in the Constructor Editor tab.

Choose Constructor/Add, and then set the following parameters:

It is a good practice to add things like “Goal” and “Comments”.

Add the second query in the constructor using Choose Constructor/Add, and then set the following parameters:

Make sure your corpus research project is saved again – we don’t want to loose our precious efforts!

8.            Executing the queries

You have created queries in the query editor, the definition file they make use of has been loaded in this project, and the queries have been put in a particular order. You can now execute the queries by pressing F10 (or selecting Tools/Execute constructor).

·        The “Output Monitor” tab page is opened, and the status bar shows that the first query is being executed.

o   This may take quite some time, depending on how many source files you have defined!

o   Execution may be interrupted using F11.

·        After the status bar has shown that the second query is being executed, it will tell that the execution of the queries is ready.

o   When no errors were met, the textbox “Command line” should show “Finished successfully”.

9.            Checking the results

There are now several ways to look at the output that has been produced.

Finding words or phrases in the definition files, the query files and the output files can be done using Ctrl+F (Edit/Find). Note that the find function works forward or backward, taking the current selection as a starting point.

10.       Affiliation

Erwin R. Komen

Centre for Language Studies/ETC

Radboud University of Nijmegen

Box 9103, 6500 HD Nijmegen

E.Komen@Let.ru.nl