Package nltk_lite :: Package contrib :: Module concord :: Class Aggregator
[hide private]
[frames] | no frames]

Class Aggregator

source code

object --+
         |
        Aggregator

Class for aggregating and summarising corpus concordance data.

This class allows one or more sets of concordance data to be summarised and displayed. This is useful for corpus linguistic tasks like counting the number of occurences of a particular word and its different POS tags in a given corpus, or comparing these frequencies across different corpora. It creates a FreqDist for each set of concordance data, counting how often each unique entry appears in it.

An example of how to use this class to show the frequency of the five most common digrams of the form "must/md X/Y" in the Brown Corpus sections a and g:

   concA = IndexConcordance(list(brown.tagged('a')))
   rawA = concA.raw(middleRegexp="^must/md$", leftContextLength=0, rightContextLength=1)
   concG = IndexConcordance(list(brown.tagged('g')))
   rawG = concG.raw(middleRegexp="^must/md$", leftContextLength=0, rightContextLength=1)
   agg = Aggregator()
   agg.add(rawA, "Brown Corpus A")
   agg.add(rawG, "Brown Corpus G")
   agg.formatted(showFirstX=5)

   Output:

   Brown Corpus A
   ------------------------------
    must/md be/be          17
    must/md have/hv        5
    must/md not/*          3
    must/md play/vb        2
    must/md ''/''          1

   Brown Corpus G
   ------------------------------
    must/md be/be          38
    must/md have/hv        21
    must/md ,/,            6
    must/md not/*          5
    must/md always/rb      3
Instance Methods [hide private]
 
__init__(self, inputList=None)
Constructor.
source code
 
add(self, raw, name)
Adds the given set of raw concordance output to the aggregator.
source code
 
remove(self, name)
Removes all sets of raw concordance output with the given name.
source code
 
formatted(self, useWord=True, usePOS=True, normalise=True, threshold=-1, showFirstX=-1, decimalPlaces=4, countOther=True, showTotal=True)
Displays formatted concordance summary information.
source code
list, number
raw(self, useWord=True, usePOS=True)
Generates raw summary information.
source code
 
format(self, output, maxKeyLength=20, threshold=-1, showFirstX=-1, decimalPlaces=4, normalise=True, countOther=True, showTotal=True)
Displays concordance summary information.
source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Class Variables [hide private]
  _OTHER_TEXT = '<OTHER>'
  _TOTAL_TEXT = '<TOTAL>'
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, inputList=None)
(Constructor)

source code 

Constructor.

Parameters:
  • inputList (list) - List of (raw concordance data, name) tuples to be entered into the aggregator. Defaults to None.
Overrides: object.__init__

add(self, raw, name)

source code 

Adds the given set of raw concordance output to the aggregator.

Parameters:
  • raw (list) - Raw concordance data (produced by IndexConcordance.raw()). Expects a list of ([left context], target word, [right context], target word sentence number) tuples.
  • name (string) - Name to associate with the set of data.

remove(self, name)

source code 

Removes all sets of raw concordance output with the given name.

Parameters:
  • name (string) - Name of data set to remove.

formatted(self, useWord=True, usePOS=True, normalise=True, threshold=-1, showFirstX=-1, decimalPlaces=4, countOther=True, showTotal=True)

source code 

Displays formatted concordance summary information.

This is a convenience method that combines raw() and display()'s options. Unless you need raw output, this is probably the most useful method.

Parameters:
  • useWord (boolean) - Include the words in the count. Defaults to True.
  • usePOS (boolean) - Include the POS tags in the count. Defaults to False.
  • normalise (boolean) - If true, normalises the frequencies for each set of concordance output by dividing each key's frequency by the total number of samples in that concordances's FreqDist. Allows easier comparison of results between data sets. Care must be taken when combining this option with the threshold option, as any threshold of 1 or more will prevent any output being displayed. Defaults to False.
  • threshold (number) - Frequency display threshold. Results below this frequency will not be displayed. If less than 0, everything will be displayed. Defaults to -1.
  • showFirstX (number) - Only show this many results, starting with the most frequent. If less than 0, everything will be displayed. Defaults to -1.
  • decimalPlaces (integer) - Number of decimal places of accuracy to display. Used when displaying non-integers with the normalise option. Defaults to 4.
  • countOther (boolean) - If true, any samples not shown (due to their frequency being below the given thershold or because they were after the number of results specified by the showFirstX argument) will be combined into one sample. This sample's frequency is the sum of all unshown sample's frequencies. Defaults to False.
  • showTotal (boolean) - If true, prints the sum of all frequencies (of the entire FreqDist, not just of the samples displayed.) Defaults to False.

raw(self, useWord=True, usePOS=True)

source code 

Generates raw summary information.

Creates a FreqDist for each set of concordance output and uses it to count the frequency of each line in it. The concordance output is flattened from lists of tokens to strings, as lists cannot be hashed. The list of FreqDists is returned, as well as the length of the longest string (used for formatted display).

Parameters:
  • useWord (boolean) - Include the words in the count. Defaults to True.
  • usePOS (boolean) - Include the POS tags in the count. Defaults to False.
Returns: list, number
A list of (FreqDist, name) pairs, and the length of the longest key in all the FreqDists.

format(self, output, maxKeyLength=20, threshold=-1, showFirstX=-1, decimalPlaces=4, normalise=True, countOther=True, showTotal=True)

source code 

Displays concordance summary information.

Formats and displays information produced by raw().

Parameters:
  • output (list) - List of (FreqDist, name) pairs (as produced by raw()).
  • maxKeyLength (number) - Length of longest key. Defaults to 20.
  • normalise (boolean) - If true, normalises the frequencies for each set of concordance output by dividing each key's frequency by the total number of samples in that concordances's FreqDist. Allows easier comparison of results between data sets. Care must be taken when combining this option with the threshold option, as any threshold of 1 or more will prevent any output being displayed. Defaults to False.
  • threshold (number) - Frequency display threshold. Results below this frequency will not be displayed. If less than 0, everything will be displayed. Defaults to -1.
  • showFirstX (number) - Only show this many results, starting with the most frequent. If less than 0, everything will be displayed. Defaults to -1.
  • decimalPlaces (integer) - Number of decimal places of accuracy to display. Used when displaying non-integers with the normalise option. Defaults to 4.
  • countOther (boolean) - If true, any samples not shown (due to their frequency being below the given thershold or because they were after the number of results specified by the showFirstX argument) will be combined into one sample. This sample's frequency is the sum of all unshown sample's frequencies. Defaults to False.
  • showTotal (boolean) - If true, prints the sum of all frequencies (of the entire FreqDist, not just of the samples displayed.) Defaults to False.