org.cyberneko.html.filters
Class Purifier

java.lang.Object
  extended by org.cyberneko.html.filters.DefaultFilter
      extended by org.cyberneko.html.filters.Purifier
All Implemented Interfaces:
org.apache.xerces.xni.parser.XMLComponent, org.apache.xerces.xni.parser.XMLDocumentFilter, org.apache.xerces.xni.parser.XMLDocumentSource, org.apache.xerces.xni.XMLDocumentHandler, HTMLComponent

public class Purifier
extends DefaultFilter

This filter purifies the HTML input to ensure XML well-formedness. The purification process includes:

Illegal characters in XML names are converted to the character sequence "_u####_" where "####" is the value of the Unicode character represented in hexadecimal. Whereas illegal characters appearing in document content is converted to the character sequence "\\u####".

In comments, the character '-' is replaced by the character sequence "- " to prevent "--" from ever appearing in the comment content. For CDATA sections, the character ']' is replaced by the character sequence "] " to prevent "]]" from appearing.

The URI used for synthesized namespace bindings is "http://cyberneko.org/html/ns/synthesized/number" where number is generated to ensure uniqueness.

Version:
$Id: Purifier.java,v 1.5 2005/02/14 03:56:54 andyc Exp $
Author:
Andy Clark

Field Summary
protected static java.lang.String AUGMENTATIONS
          Include infoset augmentations.
protected  boolean fAugmentations
          Augmentations.
protected  boolean fInCDATASection
          True if inside a CDATA section.
protected  org.apache.xerces.xni.NamespaceContext fNamespaceContext
          Namespace information.
protected  boolean fNamespaces
          Namespaces.
protected  java.lang.String fPublicId
          Public identifier of doctype declaration.
protected  boolean fSeenDoctype
          True if the doctype declaration was seen.
protected  boolean fSeenRootElement
          True if root element was seen.
protected  int fSynthesizedNamespaceCount
          Synthesized namespace binding count.
protected  java.lang.String fSystemId
          System identifier of doctype declaration.
protected static java.lang.String NAMESPACES
          Namespaces.
protected static HTMLEventInfo SYNTHESIZED_ITEM
          Synthesized event info item.
static java.lang.String SYNTHESIZED_NAMESPACE_PREFX
          Synthesized namespace binding prefix.
 
Fields inherited from class org.cyberneko.html.filters.DefaultFilter
fDocumentHandler, fDocumentSource
 
Constructor Summary
Purifier()
           
 
Method Summary
 void characters(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)
          Characters.
 void comment(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)
          Comment.
 void doctypeDecl(java.lang.String root, java.lang.String pubid, java.lang.String sysid, org.apache.xerces.xni.Augmentations augs)
          Doctype declaration.
 void emptyElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs)
          Empty element.
 void endCDATA(org.apache.xerces.xni.Augmentations augs)
          End CDATA section.
 void endElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs)
          End element.
protected  void handleStartDocument()
          Handle start document.
protected  void handleStartElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs)
          Handle start element.
 void processingInstruction(java.lang.String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs)
          Processing instruction.
protected  java.lang.String purifyName(java.lang.String name, boolean localpart)
          Purify name.
protected  org.apache.xerces.xni.QName purifyQName(org.apache.xerces.xni.QName qname)
          Purify qualified name.
protected  org.apache.xerces.xni.XMLString purifyText(org.apache.xerces.xni.XMLString text)
          Purify content.
 void reset(org.apache.xerces.xni.parser.XMLComponentManager manager)
          Resets the component.
 void startCDATA(org.apache.xerces.xni.Augmentations augs)
          Start CDATA section.
 void startDocument(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.Augmentations augs)
          Start document.
 void startDocument(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs)
          Start document.
 void startElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs)
          Start element.
protected  void synthesizeBinding(org.apache.xerces.xni.XMLAttributes attrs, java.lang.String ns)
          Synthesize namespace binding.
protected  org.apache.xerces.xni.Augmentations synthesizedAugs()
          Returns an augmentations object with a synthesized item added.
protected static java.lang.String toHexString(int c, int padlen)
          Returns a padded hexadecimal string for the given value.
 void xmlDecl(java.lang.String version, java.lang.String encoding, java.lang.String standalone, org.apache.xerces.xni.Augmentations augs)
          XML declaration.
 
Methods inherited from class org.cyberneko.html.filters.DefaultFilter
endDocument, endGeneralEntity, endPrefixMapping, getDocumentHandler, getDocumentSource, getFeatureDefault, getPropertyDefault, getRecognizedFeatures, getRecognizedProperties, ignorableWhitespace, merge, setDocumentHandler, setDocumentSource, setFeature, setProperty, startGeneralEntity, startPrefixMapping, textDecl
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SYNTHESIZED_NAMESPACE_PREFX

public static final java.lang.String SYNTHESIZED_NAMESPACE_PREFX
Synthesized namespace binding prefix.

See Also:
Constant Field Values

NAMESPACES

protected static final java.lang.String NAMESPACES
Namespaces.

See Also:
Constant Field Values

AUGMENTATIONS

protected static final java.lang.String AUGMENTATIONS
Include infoset augmentations.

See Also:
Constant Field Values

SYNTHESIZED_ITEM

protected static final HTMLEventInfo SYNTHESIZED_ITEM
Synthesized event info item.


fNamespaces

protected boolean fNamespaces
Namespaces.


fAugmentations

protected boolean fAugmentations
Augmentations.


fSeenDoctype

protected boolean fSeenDoctype
True if the doctype declaration was seen.


fSeenRootElement

protected boolean fSeenRootElement
True if root element was seen.


fInCDATASection

protected boolean fInCDATASection
True if inside a CDATA section.


fPublicId

protected java.lang.String fPublicId
Public identifier of doctype declaration.


fSystemId

protected java.lang.String fSystemId
System identifier of doctype declaration.


fNamespaceContext

protected org.apache.xerces.xni.NamespaceContext fNamespaceContext
Namespace information.


fSynthesizedNamespaceCount

protected int fSynthesizedNamespaceCount
Synthesized namespace binding count.

Constructor Detail

Purifier

public Purifier()
Method Detail

reset

public void reset(org.apache.xerces.xni.parser.XMLComponentManager manager)
           throws org.apache.xerces.xni.parser.XMLConfigurationException
Description copied from class: DefaultFilter
Resets the component. The component can query the component manager about any features and properties that affect the operation of the component.

Specified by:
reset in interface org.apache.xerces.xni.parser.XMLComponent
Overrides:
reset in class DefaultFilter
Parameters:
manager - The component manager.
Throws:
org.apache.xerces.xni.parser.XMLConfigurationException

startDocument

public void startDocument(org.apache.xerces.xni.XMLLocator locator,
                          java.lang.String encoding,
                          org.apache.xerces.xni.Augmentations augs)
                   throws org.apache.xerces.xni.XNIException
Start document.

Overrides:
startDocument in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

startDocument

public void startDocument(org.apache.xerces.xni.XMLLocator locator,
                          java.lang.String encoding,
                          org.apache.xerces.xni.NamespaceContext nscontext,
                          org.apache.xerces.xni.Augmentations augs)
                   throws org.apache.xerces.xni.XNIException
Start document.

Specified by:
startDocument in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
startDocument in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

xmlDecl

public void xmlDecl(java.lang.String version,
                    java.lang.String encoding,
                    java.lang.String standalone,
                    org.apache.xerces.xni.Augmentations augs)
             throws org.apache.xerces.xni.XNIException
XML declaration.

Specified by:
xmlDecl in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
xmlDecl in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

comment

public void comment(org.apache.xerces.xni.XMLString text,
                    org.apache.xerces.xni.Augmentations augs)
             throws org.apache.xerces.xni.XNIException
Comment.

Specified by:
comment in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
comment in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

processingInstruction

public void processingInstruction(java.lang.String target,
                                  org.apache.xerces.xni.XMLString data,
                                  org.apache.xerces.xni.Augmentations augs)
                           throws org.apache.xerces.xni.XNIException
Processing instruction.

Specified by:
processingInstruction in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
processingInstruction in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

doctypeDecl

public void doctypeDecl(java.lang.String root,
                        java.lang.String pubid,
                        java.lang.String sysid,
                        org.apache.xerces.xni.Augmentations augs)
                 throws org.apache.xerces.xni.XNIException
Doctype declaration.

Specified by:
doctypeDecl in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
doctypeDecl in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

startElement

public void startElement(org.apache.xerces.xni.QName element,
                         org.apache.xerces.xni.XMLAttributes attrs,
                         org.apache.xerces.xni.Augmentations augs)
                  throws org.apache.xerces.xni.XNIException
Start element.

Specified by:
startElement in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
startElement in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

emptyElement

public void emptyElement(org.apache.xerces.xni.QName element,
                         org.apache.xerces.xni.XMLAttributes attrs,
                         org.apache.xerces.xni.Augmentations augs)
                  throws org.apache.xerces.xni.XNIException
Empty element.

Specified by:
emptyElement in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
emptyElement in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

startCDATA

public void startCDATA(org.apache.xerces.xni.Augmentations augs)
                throws org.apache.xerces.xni.XNIException
Start CDATA section.

Specified by:
startCDATA in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
startCDATA in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

endCDATA

public void endCDATA(org.apache.xerces.xni.Augmentations augs)
              throws org.apache.xerces.xni.XNIException
End CDATA section.

Specified by:
endCDATA in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
endCDATA in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

characters

public void characters(org.apache.xerces.xni.XMLString text,
                       org.apache.xerces.xni.Augmentations augs)
                throws org.apache.xerces.xni.XNIException
Characters.

Specified by:
characters in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
characters in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

endElement

public void endElement(org.apache.xerces.xni.QName element,
                       org.apache.xerces.xni.Augmentations augs)
                throws org.apache.xerces.xni.XNIException
End element.

Specified by:
endElement in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
endElement in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

handleStartDocument

protected void handleStartDocument()
Handle start document.


handleStartElement

protected void handleStartElement(org.apache.xerces.xni.QName element,
                                  org.apache.xerces.xni.XMLAttributes attrs)
Handle start element.


synthesizeBinding

protected void synthesizeBinding(org.apache.xerces.xni.XMLAttributes attrs,
                                 java.lang.String ns)
Synthesize namespace binding.


synthesizedAugs

protected final org.apache.xerces.xni.Augmentations synthesizedAugs()
Returns an augmentations object with a synthesized item added.


purifyQName

protected org.apache.xerces.xni.QName purifyQName(org.apache.xerces.xni.QName qname)
Purify qualified name.


purifyName

protected java.lang.String purifyName(java.lang.String name,
                                      boolean localpart)
Purify name.


purifyText

protected org.apache.xerces.xni.XMLString purifyText(org.apache.xerces.xni.XMLString text)
Purify content.


toHexString

protected static java.lang.String toHexString(int c,
                                              int padlen)
Returns a padded hexadecimal string for the given value.



(C) Copyright 2002-2008, Andy Clark. All rights reserved.