*********************************************************************** README document for NSPGate v 0.03 wrapper for Ngram Statistics Package *********************************************************************** ** Making sure that GATE sees NSPGate ** ---------------------------------------- After running the INSTALL script as mentioned in the INSTALL document, 1. Open GATE 2. Click on the menu item: File -> Manage CREOLE plugins 3. The dialog box that comes up shows list of plugins that GATE can see. 4. The list should have the following entry: - NSP_Wrapper 5. Each entry has 2 check-boxes in front of it. Selecting the first check box causes GATE to load that plugin for the current session. Selecting the second check box causes GATE to load the plugin for all subsequent sessions. Select both the check boxes for the "NSP_Wrapper" component and click on "OK". 6. The GATE "Messages" tab in the main window should show a message saying "CREOLE plugin loaded: NSP_Wrapper". Successful execution of these steps ensures that GATE can talk with the new NSP_Wrapper plugin. ** Using NSPGate ** ------------------- To create an application pipeline that uses NSPWrapper, execute the following steps (after ensuring NSPGate is recognized by GATE as above): 1. Open GATE 2. Right click on the "Processing Resources" node in the left side Tree View and select the menu "New --> NSP Wrapper" 3. A dialog box pops up, you can type a name to be used in the text box or leave it blank and click "OK". 4. A new instance of the NSPWrapper is now created. 5. Similaryly create a "New --> ANNIE English Tokenizer" resource. 6. Now right click on the "Applications" node and select "New --> Pipeline". 7. Again a dialog pops up which asks for the pipeline name, this can be left blank or a suitable name for the pipeline can be specified. Click "OK". 8. Double click on the newly created pipeline node in the tree view. On the right hand side you will see two lists. One on the left contains the list of all available loaded resources that can be used in the pipeline. The one on the right contains those that are present in the pipeline. Since this is a newly created pipeline, the right hand side should be empty. 9. Initially add the tokenizer resource from the left list to the right list using the right arrow button in betwee the two lists. Then add the NSPWrapper resource to the right side list similarly. 10. Load a document in the GATE environment (in the "Language Resources" node). 11. Click on ANNIE tokenizer resource in the right list and assign the new document as a parameter to the tokenizer. 12. Similarly assign the document as a parameter to the NSPWrapper. 13. Set the parameters of the NSP Wrapper according to your specifications. The parameter names are exactly the same as command line options of count.pl and statistic.pl programs from the Ngram Statistics Package. The NSP_Wrapper component works with two types of GATE language resources - GATE document or GATE corpus (a collection of GATE documents). Default is document. This can be changed from the combo box in the argument "Name" column. For now we will just use the document parameter. Other parameters for the NSPWrapper are explained below: The other parameters for NSPWrapper are simply place holders for Ngram Statistics Package parameters. A brief description is given below, refer to NSP documentation for details. "stop" Description: Full path to the file that is to be used as a stop list containing functional words (such as "the", "of", "and", etc.). This is a file that contains the specification of stop words in Perl regular expression format. "remove" Description: The frequency cut-off value for ngrams, below which they are not considered significant. This corresponds to the "--remove" option of the count.pl program in NSP - which *does not* affect the frequency counting process, i.e. the ngrams are totally ignored. "ngram" (default 1) Description: The ngrams to be annotated. Unigrams are annotated as 1gram, bigrams as 2gram, trigrams as 3gram and so on. "statModule" Description: The statistic module to be used for significance testing of ngrams, should be set in accordance with the NGRAM value above. If 2 is the ngram value, then only those statistical modules that support bigrams can be added here. Similarly for trigrams. "score" Description: The statistical score cutoff to be used by the above statistic module. Represents confidence level desired in the significance of the ngram. "token" Description: Full path to the file containing token definitions to be used by count.pl. "nontoken" Description: Full path to the file containing non-token definitions to be used by count.pl to ignore in the analysis. 14. After setting all the appropriate parameter values, run the new pipeline and it should produce the required type of annotations in the default annotation set of the document. NSPWrapper *always* produces annotations in the default annotation set. ** Questions? ** ---------------- Contact Mahesh Joshi (joshi031@d.umn.edu) or Ted Pedersen (tpederse@d.umn.edu) ** Copyright Notice ** ---------------------- Copyright (C) 2005-06, Mahesh Joshi University of Minnesota, Duluth Ted Pedersen University of Minnesota, Duluth This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.