***********************************************************************
README document for NSPGate v 0.03 wrapper for Ngram Statistics Package
***********************************************************************

** Making sure that GATE sees NSPGate **
----------------------------------------

After running the INSTALL script as mentioned in the INSTALL document,

1. Open GATE
2. Click on the menu item: File -> Manage CREOLE plugins
3. The dialog box that comes up shows list of plugins that GATE can see.
4. The list should have the following entry:
    - NSP_Wrapper
5. Each entry has 2 check-boxes in front of it. Selecting the first check box
   causes GATE to load that plugin for the current session. Selecting the
   second check box causes GATE to load the plugin for all subsequent sessions.
   Select both the check boxes for the "NSP_Wrapper" component and click
   on "OK".
6. The GATE "Messages" tab in the main window should show a message saying
   "CREOLE plugin loaded: NSP_Wrapper".

Successful execution of these steps ensures that GATE can talk with the new 
NSP_Wrapper plugin.

** Using NSPGate **
-------------------

To create an application pipeline that uses NSPWrapper, execute the following
steps (after ensuring NSPGate is recognized by GATE as above):

1. Open GATE
2. Right click on the "Processing Resources" node in the left side Tree View
   and select the menu "New --> NSP Wrapper"
3. A dialog box pops up, you can type a name to be used in the text box or
   leave it blank and click "OK".
4. A new instance of the NSPWrapper is now created.
5. Similaryly create a "New --> ANNIE English Tokenizer" resource.
6. Now right click on the "Applications" node and select "New --> Pipeline".
7. Again a dialog pops up which asks for the pipeline name, this can be left
   blank or a suitable name for the pipeline can be specified. Click "OK".
8. Double click on the newly created pipeline node in the tree view. On the
   right hand side you will see two lists. One on the left contains the list
   of all available loaded resources that can be used in the pipeline. The
   one on the right contains those that are present in the pipeline. Since this
   is a newly created pipeline, the right hand side should be empty.
9. Initially add the tokenizer resource from the left list to the right list
   using the right arrow button in betwee the two lists. Then add the 
   NSPWrapper resource to the right side list similarly.
10. Load a document in the GATE environment (in the "Language Resources" node).
11. Click on ANNIE tokenizer resource in the right list and assign the new
    document as a parameter to the tokenizer.
12. Similarly assign the document as a parameter to the NSPWrapper.
13. Set the parameters of the NSP Wrapper according to your specifications. The
    parameter names are exactly the same as command line options of count.pl
    and statistic.pl programs from the Ngram Statistics Package.

The NSP_Wrapper component works with two types of GATE language resources -
GATE document or GATE corpus (a collection of GATE documents). Default is
document. This can be changed from the combo box in the argument "Name"
column.

For now we will just use the document parameter.

Other parameters for the NSPWrapper are explained below:

The other parameters for NSPWrapper are simply place holders for Ngram
Statistics Package parameters. A brief description is given below, refer
to NSP documentation for details.

"stop"

Description: Full path to the file that is to be used as a stop list
containing functional words (such as "the", "of", "and", etc.). This is a
file that contains the specification of stop words in Perl regular expression
format.


"remove"

Description: The frequency cut-off value for ngrams, below which they
are not considered significant. This corresponds to the "--remove" option
of the count.pl program in NSP - which *does not* affect the frequency
counting process, i.e. the ngrams are totally ignored.

"ngram" (default 1)

Description: The ngrams to be annotated. Unigrams are annotated
as 1gram, bigrams as 2gram, trigrams as 3gram and so on.


"statModule"

Description: The statistic module to be used for significance testing
of ngrams, should be set in accordance with the NGRAM value above. If 2 is
the ngram value, then only those statistical modules that support bigrams 
can be added here. Similarly for trigrams.


"score"

Description: The statistical score cutoff to be used by the above
statistic module. Represents confidence level desired in the significance
of the ngram.


"token"

Description: Full path to the file containing token definitions to be
used by count.pl.


"nontoken"

Description: Full path to the file containing non-token definitions to be
used by count.pl to ignore in the analysis.

14. After setting all the appropriate parameter values, run the new
    pipeline and it should produce the required type of annotations in the
    default annotation set of the document. NSPWrapper *always* produces
    annotations in the default annotation set.


** Questions? **
----------------

Contact Mahesh Joshi (joshi031@d.umn.edu) or Ted Pedersen (tpederse@d.umn.edu)


** Copyright Notice **
----------------------

Copyright (C) 2005-06, 

Mahesh Joshi
University of Minnesota, Duluth

Ted Pedersen
University of Minnesota, Duluth

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.