Class CharsToNameCanonicalizer


  • public final class CharsToNameCanonicalizer
    extends java.lang.Object
    This class is a kind of specialized type-safe Map, from char array to String value. Specialization means that in addition to type-safety and specific access patterns (key char array, Value optionally interned String; values added on access if necessary), and that instances are meant to be used concurrently, but by using well-defined mechanisms to obtain such concurrently usable instances. Main use for the class is to store symbol table information for things like compilers and parsers; especially when number of symbols (keywords) is limited.

    For optimal performance, usage pattern should be one where matches should be very common (especially after "warm-up"), and as with most hash-based maps/sets, that hash codes are uniformly distributed. Also, collisions are slightly more expensive than with HashMap or HashSet, since hash codes are not used in resolving collisions; that is, equals() comparison is done with all symbols in same bucket index.
    Finally, rehashing is also more expensive, as hash codes are not stored; rehashing requires all entries' hash codes to be recalculated. Reason for not storing hash codes is reduced memory usage, hoping for better memory locality.

    Usual usage pattern is to create a single "master" instance, and either use that instance in sequential fashion, or to create derived "child" instances, which after use, are asked to return possible symbol additions to master instance. In either case benefit is that symbol table gets initialized so that further uses are more efficient, as eventually all symbols needed will already be in symbol table. At that point no more Symbol String allocations are needed, nor changes to symbol table itself.

    Note that while individual SymbolTable instances are NOT thread-safe (much like generic collection classes), concurrently used "child" instances can be freely used without synchronization. However, using master table concurrently with child instances can only be done if access to master instance is read-only (i.e. no modifications done).

    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      (package private) static class  CharsToNameCanonicalizer.Bucket
      This class is a symbol table entry.
      private static class  CharsToNameCanonicalizer.TableInfo
      Immutable value class used for sharing information as efficiently as possible, by only require synchronization of reference manipulation but not access to contents.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private CharsToNameCanonicalizer.Bucket[] _buckets
      Overflow buckets; if primary doesn't match, lookup is done from here.
      private boolean _canonicalize
      Whether any canonicalization should be attempted (whether using intern or not.
      private int _flags  
      private boolean _hashShared
      Flag that indicates whether underlying data structures for the main hash area are shared or not.
      private int _indexMask
      Mask used to get index from hash values; equal to _buckets.length - 1, when _buckets.length is a power of two.
      private int _longestCollisionList
      We need to keep track of the longest collision list; this is needed both to indicate problems with attacks and to allow flushing for other cases.
      private java.util.BitSet _overflows
      Lazily constructed structure that is used to keep track of collision buckets that have overflowed once: this is used to detect likely attempts at denial-of-service attacks that uses hash collisions.
      private CharsToNameCanonicalizer _parent
      Sharing of learnt symbols is done by optional linking of symbol table instances with their parents.
      private int _seed
      Seed value we use as the base to make hash codes non-static between different runs, but still stable for lifetime of a single symbol table instance.
      private int _size
      Current size (number of entries); needed to know if and when rehash.
      private int _sizeThreshold
      Limit that indicates maximum size this instance can hold before it needs to be expanded and rehashed.
      private java.lang.String[] _symbols
      Primary matching symbols; it's expected most match occur from here.
      private java.util.concurrent.atomic.AtomicReference<CharsToNameCanonicalizer.TableInfo> _tableInfo
      Member that is only used by the root table instance: root passes immutable state info child instances, and children may return new state if they add entries to the table.
      private static int DEFAULT_T_SIZE
      Default initial table size.
      static int HASH_MULT  
      (package private) static int MAX_COLL_CHAIN_LENGTH
      Also: to thwart attacks based on hash collisions (which may or may not be cheap to calculate), we will need to detect "too long" collision chains.
      (package private) static int MAX_ENTRIES_FOR_REUSE
      Let's only share reasonably sized symbol tables.
      private static int MAX_T_SIZE
      Let's not expand symbol tables past some maximum size; this should protected against OOMEs caused by large documents with unique (~= random) names.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private java.lang.String _addSymbol​(char[] buffer, int start, int len, int h, int index)  
      private java.lang.String _findSymbol2​(char[] buffer, int start, int len, CharsToNameCanonicalizer.Bucket b)  
      private void _handleSpillOverflow​(int bucketIndex, CharsToNameCanonicalizer.Bucket newBucket, int mainIndex)
      Method called when an overflow bucket has hit the maximum expected length: this may be a case of DoS attack.
      int _hashToIndex​(int rawHash)
      Helper method that takes in a "raw" hash value, shuffles it as necessary, and truncates to be used as the index.
      private static int _thresholdSize​(int hashAreaSize)  
      int bucketCount()
      Method for checking number of primary hash buckets this symbol table uses.
      int calcHash​(char[] buffer, int start, int len)
      Implementation of a hashing method for variable length Strings.
      int calcHash​(java.lang.String key)  
      int collisionCount()
      Method mostly needed by unit tests; calculates number of entries that are in collision list.
      private void copyArrays()
      Method called when copy-on-write is needed; generally when first change is made to a derived symbol table.
      static CharsToNameCanonicalizer createRoot()
      Method called to create root canonicalizer for a JsonFactory instance.
      protected static CharsToNameCanonicalizer createRoot​(int seed)  
      java.lang.String findSymbol​(char[] buffer, int start, int len, int h)  
      int hashSeed()  
      CharsToNameCanonicalizer makeChild​(int flags)
      "Factory" method; will create a new child instance of this symbol table.
      int maxCollisionLength()
      Method mostly needed by unit tests; calculates length of the longest collision chain.
      boolean maybeDirty()  
      private void mergeChild​(CharsToNameCanonicalizer.TableInfo childState)
      Method that allows contents of child table to potentially be "merged in" with contents of this symbol table.
      private void rehash()
      Method called when size (number of entries) of symbol table grows so big that load factor is exceeded.
      void release()
      Method called by the using code to indicate it is done with this instance.
      protected void reportTooManyCollisions​(int maxLen)  
      int size()  
      protected void verifyInternalConsistency()
      Diagnostics method that will verify that internal data structures are consistent; not meant as user-facing method but only for test suites and possible troubleshooting.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • DEFAULT_T_SIZE

        private static final int DEFAULT_T_SIZE
        Default initial table size. Shouldn't be miniscule (as there's cost to both array realloc and rehashing), but let's keep it reasonably small. For systems that properly reuse factories it doesn't matter either way; but when recreating factories often, initial overhead may dominate.
        See Also:
        Constant Field Values
      • MAX_T_SIZE

        private static final int MAX_T_SIZE
        Let's not expand symbol tables past some maximum size; this should protected against OOMEs caused by large documents with unique (~= random) names.
        See Also:
        Constant Field Values
      • MAX_ENTRIES_FOR_REUSE

        static final int MAX_ENTRIES_FOR_REUSE
        Let's only share reasonably sized symbol tables. Max size set to 3/4 of 16k; this corresponds to 64k main hash index. This should allow for enough distinct names for almost any case.
        See Also:
        Constant Field Values
      • MAX_COLL_CHAIN_LENGTH

        static final int MAX_COLL_CHAIN_LENGTH
        Also: to thwart attacks based on hash collisions (which may or may not be cheap to calculate), we will need to detect "too long" collision chains. Let's start with static value of 100 entries for the longest legal chain.

        Note: longest chain we have been able to produce without malicious intent has been 38 (with "com.fasterxml.jackson.core.main.TestWithTonsaSymbols"); our setting should be reasonable here.

        Since:
        2.1
        See Also:
        Constant Field Values
      • _parent

        private final CharsToNameCanonicalizer _parent
        Sharing of learnt symbols is done by optional linking of symbol table instances with their parents. When parent linkage is defined, and child instance is released (call to release), parent's shared tables may be updated from the child instance.
      • _tableInfo

        private final java.util.concurrent.atomic.AtomicReference<CharsToNameCanonicalizer.TableInfo> _tableInfo
        Member that is only used by the root table instance: root passes immutable state info child instances, and children may return new state if they add entries to the table. Child tables do NOT use the reference.
      • _seed

        private final int _seed
        Seed value we use as the base to make hash codes non-static between different runs, but still stable for lifetime of a single symbol table instance. This is done for security reasons, to avoid potential DoS attack via hash collisions.
        Since:
        2.1
      • _flags

        private final int _flags
      • _canonicalize

        private boolean _canonicalize
        Whether any canonicalization should be attempted (whether using intern or not.

        NOTE: non-final since we may need to disable this with overflow.

      • _symbols

        private java.lang.String[] _symbols
        Primary matching symbols; it's expected most match occur from here.
      • _buckets

        private CharsToNameCanonicalizer.Bucket[] _buckets
        Overflow buckets; if primary doesn't match, lookup is done from here.

        Note: Number of buckets is half of number of symbol entries, on assumption there's less need for buckets.

      • _size

        private int _size
        Current size (number of entries); needed to know if and when rehash.
      • _sizeThreshold

        private int _sizeThreshold
        Limit that indicates maximum size this instance can hold before it needs to be expanded and rehashed. Calculated using fill factor passed in to constructor.
      • _indexMask

        private int _indexMask
        Mask used to get index from hash values; equal to _buckets.length - 1, when _buckets.length is a power of two.
      • _longestCollisionList

        private int _longestCollisionList
        We need to keep track of the longest collision list; this is needed both to indicate problems with attacks and to allow flushing for other cases.
        Since:
        2.1
      • _hashShared

        private boolean _hashShared
        Flag that indicates whether underlying data structures for the main hash area are shared or not. If they are, then they need to be handled in copy-on-write way, i.e. if they need to be modified, a copy needs to be made first; at this point it will not be shared any more, and can be modified.

        This flag needs to be checked both when adding new main entries, and when adding new collision list queues (i.e. creating a new collision list head entry)

      • _overflows

        private java.util.BitSet _overflows
        Lazily constructed structure that is used to keep track of collision buckets that have overflowed once: this is used to detect likely attempts at denial-of-service attacks that uses hash collisions.
        Since:
        2.4
    • Constructor Detail

      • CharsToNameCanonicalizer

        private CharsToNameCanonicalizer​(int seed)
        Main method for constructing a root symbol table instance.
    • Method Detail

      • _thresholdSize

        private static int _thresholdSize​(int hashAreaSize)
      • createRoot

        public static CharsToNameCanonicalizer createRoot()
        Method called to create root canonicalizer for a JsonFactory instance. Root instance is never used directly; its main use is for storing and sharing underlying symbol arrays as needed.
      • makeChild

        public CharsToNameCanonicalizer makeChild​(int flags)
        "Factory" method; will create a new child instance of this symbol table. It will be a copy-on-write instance, ie. it will only use read-only copy of parent's data, but when changes are needed, a copy will be created.

        Note: while this method is synchronized, it is generally not safe to both use makeChild/mergeChild, AND to use instance actively. Instead, a separate 'root' instance should be used on which only makeChild/mergeChild are called, but instance itself is not used as a symbol table.

      • release

        public void release()
        Method called by the using code to indicate it is done with this instance. This lets instance merge accumulated changes into parent (if need be), safely and efficiently, and without calling code having to know about parent information.
      • mergeChild

        private void mergeChild​(CharsToNameCanonicalizer.TableInfo childState)
        Method that allows contents of child table to potentially be "merged in" with contents of this symbol table.

        Note that caller has to make sure symbol table passed in is really a child or sibling of this symbol table.

      • size

        public int size()
      • bucketCount

        public int bucketCount()
        Method for checking number of primary hash buckets this symbol table uses.
        Since:
        2.1
      • maybeDirty

        public boolean maybeDirty()
      • hashSeed

        public int hashSeed()
      • collisionCount

        public int collisionCount()
        Method mostly needed by unit tests; calculates number of entries that are in collision list. Value can be at most (size() - 1), but should usually be much lower, ideally 0.
        Since:
        2.1
      • maxCollisionLength

        public int maxCollisionLength()
        Method mostly needed by unit tests; calculates length of the longest collision chain. This should typically be a low number, but may be up to size() - 1 in the pathological case
        Since:
        2.1
      • findSymbol

        public java.lang.String findSymbol​(char[] buffer,
                                           int start,
                                           int len,
                                           int h)
      • _addSymbol

        private java.lang.String _addSymbol​(char[] buffer,
                                            int start,
                                            int len,
                                            int h,
                                            int index)
      • _handleSpillOverflow

        private void _handleSpillOverflow​(int bucketIndex,
                                          CharsToNameCanonicalizer.Bucket newBucket,
                                          int mainIndex)
        Method called when an overflow bucket has hit the maximum expected length: this may be a case of DoS attack. Deal with it based on settings by either clearing up bucket (to avoid indefinite expansion) or throwing exception. Currently the first overflow for any single bucket DOES NOT throw an exception, only second time (per symbol table instance)
      • _hashToIndex

        public int _hashToIndex​(int rawHash)
        Helper method that takes in a "raw" hash value, shuffles it as necessary, and truncates to be used as the index.
      • calcHash

        public int calcHash​(char[] buffer,
                            int start,
                            int len)
        Implementation of a hashing method for variable length Strings. Most of the time intention is that this calculation is done by caller during parsing, not here; however, sometimes it needs to be done for parsed "String" too.
        Parameters:
        len - Length of String; has to be at least 1 (caller guarantees this pre-condition)
      • calcHash

        public int calcHash​(java.lang.String key)
      • copyArrays

        private void copyArrays()
        Method called when copy-on-write is needed; generally when first change is made to a derived symbol table.
      • rehash

        private void rehash()
        Method called when size (number of entries) of symbol table grows so big that load factor is exceeded. Since size has to remain power of two, arrays will then always be doubled. Main work is really redistributing old entries into new String/Bucket entries.
      • reportTooManyCollisions

        protected void reportTooManyCollisions​(int maxLen)
        Since:
        2.1
      • verifyInternalConsistency

        protected void verifyInternalConsistency()
        Diagnostics method that will verify that internal data structures are consistent; not meant as user-facing method but only for test suites and possible troubleshooting.
        Since:
        2.10