Package weka.core.tokenizers
Class NGramTokenizer
- java.lang.Object
-
- weka.core.tokenizers.Tokenizer
-
- weka.core.tokenizers.CharacterDelimitedTokenizer
-
- weka.core.tokenizers.NGramTokenizer
-
- All Implemented Interfaces:
java.io.Serializable
,java.util.Enumeration
,OptionHandler
,RevisionHandler
public class NGramTokenizer extends CharacterDelimitedTokenizer
Splits a string into an n-gram with min and max grams. Valid options are:-delimiters <value> The delimiters to use (default ' \r\n\t.,;:'"()?!').
-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
- Version:
- $Revision: 1.4 $
- Author:
- Sebastian Germesin (sebastian.germesin@dfki.de), FracPete (fracpete at waikato dot ac dot nz)
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description NGramTokenizer()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description int
getNGramMaxSize()
Gets the max N of the NGram.int
getNGramMinSize()
Gets the min N of the NGram.java.lang.String[]
getOptions()
Gets the current option settings for the OptionHandler.java.lang.String
getRevision()
Returns the revision string.java.lang.String
globalInfo()
Returns a string describing the stemmerboolean
hasMoreElements()
returns true if there's more elements availablejava.util.Enumeration
listOptions()
Returns an enumeration of all the available options..static void
main(java.lang.String[] args)
Runs the tokenizer with the given options and strings to tokenize.java.lang.Object
nextElement()
Returns N-grams and also (N-1)-grams and ....java.lang.String
NGramMaxSizeTipText()
Returns the tip text for this property.java.lang.String
NGramMinSizeTipText()
Returns the tip text for this property.void
setNGramMaxSize(int value)
Sets the max size of the Ngram.void
setNGramMinSize(int value)
Sets the min size of the Ngram.void
setOptions(java.lang.String[] options)
Parses a given list of options.void
tokenize(java.lang.String s)
Sets the string to tokenize.-
Methods inherited from class weka.core.tokenizers.CharacterDelimitedTokenizer
delimitersTipText, getDelimiters, setDelimiters
-
Methods inherited from class weka.core.tokenizers.Tokenizer
runTokenizer, tokenize
-
-
-
-
Method Detail
-
globalInfo
public java.lang.String globalInfo()
Returns a string describing the stemmer- Specified by:
globalInfo
in classTokenizer
- Returns:
- a description suitable for displaying in the explorer/experimenter gui
-
listOptions
public java.util.Enumeration listOptions()
Returns an enumeration of all the available options..- Specified by:
listOptions
in interfaceOptionHandler
- Overrides:
listOptions
in classCharacterDelimitedTokenizer
- Returns:
- an enumeration of all available options.
-
getOptions
public java.lang.String[] getOptions()
Gets the current option settings for the OptionHandler.- Specified by:
getOptions
in interfaceOptionHandler
- Overrides:
getOptions
in classCharacterDelimitedTokenizer
- Returns:
- the list of current option settings as an array of strings
-
setOptions
public void setOptions(java.lang.String[] options) throws java.lang.Exception
Parses a given list of options. Valid options are:-delimiters <value> The delimiters to use (default ' \r\n\t.,;:'"()?!').
-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
- Specified by:
setOptions
in interfaceOptionHandler
- Overrides:
setOptions
in classCharacterDelimitedTokenizer
- Parameters:
options
- the list of options as an array of strings- Throws:
java.lang.Exception
- if an option is not supported
-
getNGramMaxSize
public int getNGramMaxSize()
Gets the max N of the NGram.- Returns:
- the size (N) of the NGram.
-
setNGramMaxSize
public void setNGramMaxSize(int value)
Sets the max size of the Ngram.- Parameters:
value
- the size of the NGram.
-
NGramMaxSizeTipText
public java.lang.String NGramMaxSizeTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setNGramMinSize
public void setNGramMinSize(int value)
Sets the min size of the Ngram.- Parameters:
value
- the size of the NGram.
-
getNGramMinSize
public int getNGramMinSize()
Gets the min N of the NGram.- Returns:
- the size (N) of the NGram.
-
NGramMinSizeTipText
public java.lang.String NGramMinSizeTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
hasMoreElements
public boolean hasMoreElements()
returns true if there's more elements available- Specified by:
hasMoreElements
in interfacejava.util.Enumeration
- Specified by:
hasMoreElements
in classTokenizer
- Returns:
- true if there are more elements available
-
nextElement
public java.lang.Object nextElement()
Returns N-grams and also (N-1)-grams and .... and 1-grams.- Specified by:
nextElement
in interfacejava.util.Enumeration
- Specified by:
nextElement
in classTokenizer
- Returns:
- the next element
-
tokenize
public void tokenize(java.lang.String s)
Sets the string to tokenize. Tokenization happens immediately.
-
getRevision
public java.lang.String getRevision()
Returns the revision string.- Returns:
- the revision
-
main
public static void main(java.lang.String[] args)
Runs the tokenizer with the given options and strings to tokenize. The tokens are printed to stdout.- Parameters:
args
- the commandline options and strings to tokenize
-
-