CorpusStudio Tutorial

Erwin Komen

Radboud University

1. The problem

The problem we are going to investigate today is the following:

find it-clefts in English corpora.

For the purpose of this tutorial we will assume that it-clefts contain the following elements in the indicated order:

The word it (with variant readings “it”, “hit”)
A finite form of the verb “to be”
A noun phrase
A relative clause, formed by a CP

We will be using Corpus Studio to investigate this problem. Please first install the program on the computer you are working on (use the Quick Start Guide).

2. Setting the base directory

After the program is installed correctly, choose Tools/Settings to set the base directory for work with CorpusStudio. Use something like u:\CorpusStudio or u:\data\CorpusStudio.

3. Creating the project

Create a new corpus research project using File/New (or: Ctrl+N). For our purposes give it an appropriate name, e.g. Tutorial.

The locations of input and output files now need to be set.

· Select the output directory button

o Create Tutorial as a subdirectory of CorpusStudio.

o Set the output directory to CorpusStudio\Tutorial

· Select the Query directory button

o Create Queries (or: qq) as a subdirectory of CorpusStudio.

o Set the query directory to CorpusStudio\Queries

· Select the input directory button

o Find the Corpora\PPCEME subdirectory on the W drive at:

W:\Letteren\Talen en Culturen\Engelse taal en letterkunde\Corpora\PPCEME

· Select all PSD files in this directory by pressing “Select all files in this directory”.

· SAVE the project! Use Ctrl+S or File/Save.

There are still some more general things to be set, so choose the General tab page.

· Fill in the Author

· Describe the goal of this tutorial

· Add comments about the implementation

· Save the project again.

4. Add a definitions file

For this tutorial we are going to add an existing definitions file. It should be stored in our own CorpusStudio\Queries directory, so that in future projects we can make use of it by choosing Definition/Add.

But for this tutorial we are going to copy it from the location where it is stored on the W-drive using the command Definition/Import.

The location of the definition file is the following:

W:\Letteren\Talen en Culturen\Engelse taal en letterkunde\Erwin\CorpusStudio\OE+MEU.def

5. Make the first query

We are now ready to make a first query. This first query should select all IPs with the following characteristics:

· They contain an NP

· They contain a finite form of “to be”

· They contain a CP

Make a new query using Query/New. Give it the following characteristics:

The name: ipNP+BE+CP

This means that we have an IP containing a NP, BE and CP).

The node: IP*

This means that we will be looking in all IP types.

Add to ignore: \**

This means that we will be ignoring nodes containing an *. Such nodes do not contain surface forms, but underlying forms.

Definitions: select OE+MEU
Remove nodes: uncheck this option

We want to be able to do things with nodes that are below the first layer.

Print indices: uncheck this option

Indices are the numbers for each node. We would like to see them in the output.

Now fill in the new query so that it looks like the following example:

node: IP*

add_to_ignore: \**

remove_nodes: f

print_indices: t

define: OE+MEU.def

query: ( (IP* iDoms anynp) AND

(IP* iDoms finite_BE) AND

(IP* iDoms CP*)

)

For an explanation of keywords like iDoms (immediately dominates) see Help/Query languages/Corpus search. You will be directed to the CorpusSearch website. By leafing through the tutorial, you will find the explanations of these keywords.

6. Make the second query

The first query only gives IPs which contain the basic ingredients for an it-cleft. We now want to take the output of this first query and make a finer selection to get the actual it-clefts. Make a new query called ipIT-cleft, which should look as follows:

node: IP*

add_to_ignore: \**

remove_nodes: f

print_indices: t

define: OE+MEU.def

query: ( ( (IP* iDoms [1]anynp) AND

([1]anynp iDoms pronoun) AND

(pronoun iDoms pronoun_it) ) AND

(IP* iDoms finite_BE) AND

(IP* iDoms [2]anynp) AND

(IP* iDoms CP*) AND

([1]anynp iPrecedes finite_BE) AND

(finite_BE iPrecedes [2]anynp) AND

([2]anynp iPrecedes CP*)

)

The first part of this query aims to find the it pronoun. The treebank coding of an “it” pronoun can be understood from the following example:

(49 NP-SBJ-2 (50 PRO hit))

The pronoun lexeme hit is inside a node of type PRO, and this node is inside one of type NP.

The IP containing the NP with it should, again, contain the finite form of be, a second NP (indicated by the index [2]), and a CP.

The order of these basic elements is defined using the keyword iPrecedes, “immediately precedes”.

Be sure to check the brackets around the queries!!

7. Making the construction

The queries need to be executed in a particular order, and this is going to be defined in the Constructor Editor tab.

Choose Constructor/Add, and then set the following parameters:

The query should be ipNP+BE+CP
The Input should be Source.
The output should be left empty.

This is an intermediate output, and we won’t be using it. Corpusstudio will think of its own name for the output, and delete it afterwards.

The complement file should be unchecked.

It is a good practice to add things like “Goal” and “Comments”.

Add the second query in the constructor using Choose Constructor/Add, and then set the following parameters:

The query should be ipIT-cleft
The Input should be 1/out.
The output should be: Tutorial-ItCleft
The “complement file”-box should be checked.

For this tutorial we will also look at the complement file.

Again, fill in “Goal” and “Comments” according to your desires!

Make sure your corpus research project is saved again – we don’t want to loose our precious efforts!

8. Executing the queries

You have created queries in the query editor, the definition file they make use of has been loaded in this project, and the queries have been put in a particular order. You can now execute the queries by pressing F10 (or selecting Tools/Execute constructor).

· The “Output Monitor” tab page is opened, and the status bar shows that the first query is being executed.

o This may take quite some time, depending on how many source files you have defined!

o Execution may be interrupted using F11.

· After the status bar has shown that the second query is being executed, it will tell that the execution of the queries is ready.

o When no errors were met, the textbox “Command line” should show “Finished successfully”.

9. Checking the results

There are now several ways to look at the output that has been produced.

Go to the “Constructor Editor” and select the last line.

Press the button “Open output file”. This will take you to the tabpage “Output files”, and open the output file produced for this last query.

Alternatively open de tabpage “Tree”.

Press F8 (View/Update tree).
Press Shift+F8 (View/Expand tree).
Double clicking nodes in this tree will lead you either

(1) to the query file definition in the query editor, or
(2) to the output file in the “Output Files” tabpage.

Try double clicking on the 1/out output that has been produced after the first step defined in the Constructor Editor.

This should yield a message from Corpus Studio informing you that this output file does not exist. It has not been made, since it is a temporary output file only. If you want Corpus Studio to keep the temporary files for this project, you should set that option in the “Genearl” tabpage.

Finding words or phrases in the definition files, the query files and the output files can be done using Ctrl+F (Edit/Find). Note that the find function works forward or backward, taking the current selection as a starting point.

10. Affiliation

Erwin R. Komen

Centre for Language Studies/ETC

Radboud University of Nijmegen

Box 9103, 6500 HD Nijmegen

E.Komen@Let.ru.nl