org.cyberneko.html.filters
Class ElementRemover

java.lang.Object
  extended by org.cyberneko.html.filters.DefaultFilter
      extended by org.cyberneko.html.filters.ElementRemover
All Implemented Interfaces:
org.apache.xerces.xni.parser.XMLComponent, org.apache.xerces.xni.parser.XMLDocumentFilter, org.apache.xerces.xni.parser.XMLDocumentSource, org.apache.xerces.xni.XMLDocumentHandler, HTMLComponent

public class ElementRemover
extends DefaultFilter

This class is a document filter capable of removing specified elements from the processing stream. There are two options for processing document elements:

The first option allows the application to specify which elements appearing in the event stream should be accepted and, therefore, passed on to the next stage in the pipeline. All elements not in the list of acceptable elements have their start and end tags stripped from the event stream unless those elements appear in the list of elements to be removed.

The second option allows the application to specify which elements should be completely removed from the event stream. When an element appears that is to be removed, the element's start and end tag as well as all of that element's content is removed from the event stream.

A common use of this filter would be to only allow rich-text and linking elements as well as the character content to pass through the filter — all other elements would be stripped. The following code shows how to configure this filter to perform this task:

  ElementRemover remover = new ElementRemover();
  remover.acceptElement("b", null);
  remover.acceptElement("i", null);
  remover.acceptElement("u", null);
  remover.acceptElement("a", new String[] { "href" });
 

However, this would still allow the text content of other elements to pass through, which may not be desirable. In order to further "clean" the input, the removeElement option can be used. The following piece of code adds the ability to completely remove any <SCRIPT> tags and content from the stream.

  remover.removeElement("script");
 

Note: All text and accepted element children of a stripped element is retained. To completely remove an element's content, use the removeElement method.

Note: Care should be taken when using this filter because the output may not be a well-balanced tree. Specifically, if the application removes the <HTML> element (with or without retaining its children), the resulting document event stream will no longer be well-formed.

Version:
$Id: ElementRemover.java,v 1.5 2005/02/14 03:56:54 andyc Exp $
Author:
Andy Clark

Field Summary
protected  java.util.Hashtable fAcceptedElements
          Accepted elements.
protected  int fElementDepth
          The element depth.
protected  int fRemovalElementDepth
          The element depth at element removal.
protected  java.util.Hashtable fRemovedElements
          Removed elements.
protected static java.lang.Object NULL
          A "null" object.
 
Fields inherited from class org.cyberneko.html.filters.DefaultFilter
fDocumentHandler, fDocumentSource
 
Constructor Summary
ElementRemover()
           
 
Method Summary
 void acceptElement(java.lang.String element, java.lang.String[] attributes)
          Specifies that the given element should be accepted and, optionally, which attributes of that element should be kept.
 void characters(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)
          Characters.
 void comment(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)
          Comment.
protected  boolean elementAccepted(java.lang.String element)
          Returns true if the specified element is accepted.
protected  boolean elementRemoved(java.lang.String element)
          Returns true if the specified element should be removed.
 void emptyElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attributes, org.apache.xerces.xni.Augmentations augs)
          Empty element.
 void endCDATA(org.apache.xerces.xni.Augmentations augs)
          End CDATA section.
 void endElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs)
          End element.
 void endGeneralEntity(java.lang.String name, org.apache.xerces.xni.Augmentations augs)
          End general entity.
 void endPrefixMapping(java.lang.String prefix, org.apache.xerces.xni.Augmentations augs)
          End prefix mapping.
protected  boolean handleOpenTag(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attributes)
          Handles an open tag.
 void ignorableWhitespace(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs)
          Ignorable whitespace.
 void processingInstruction(java.lang.String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs)
          Processing instruction.
 void removeElement(java.lang.String element)
          Specifies that the given element should be completely removed.
 void startCDATA(org.apache.xerces.xni.Augmentations augs)
          Start CDATA section.
 void startDocument(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.Augmentations augs)
          Start document.
 void startDocument(org.apache.xerces.xni.XMLLocator locator, java.lang.String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs)
          Start document.
 void startElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attributes, org.apache.xerces.xni.Augmentations augs)
          Start element.
 void startGeneralEntity(java.lang.String name, org.apache.xerces.xni.XMLResourceIdentifier id, java.lang.String encoding, org.apache.xerces.xni.Augmentations augs)
          Start general entity.
 void startPrefixMapping(java.lang.String prefix, java.lang.String uri, org.apache.xerces.xni.Augmentations augs)
          Start prefix mapping.
 void textDecl(java.lang.String version, java.lang.String encoding, org.apache.xerces.xni.Augmentations augs)
          Text declaration.
 
Methods inherited from class org.cyberneko.html.filters.DefaultFilter
doctypeDecl, endDocument, getDocumentHandler, getDocumentSource, getFeatureDefault, getPropertyDefault, getRecognizedFeatures, getRecognizedProperties, merge, reset, setDocumentHandler, setDocumentSource, setFeature, setProperty, xmlDecl
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NULL

protected static final java.lang.Object NULL
A "null" object.


fAcceptedElements

protected java.util.Hashtable fAcceptedElements
Accepted elements.


fRemovedElements

protected java.util.Hashtable fRemovedElements
Removed elements.


fElementDepth

protected int fElementDepth
The element depth.


fRemovalElementDepth

protected int fRemovalElementDepth
The element depth at element removal.

Constructor Detail

ElementRemover

public ElementRemover()
Method Detail

acceptElement

public void acceptElement(java.lang.String element,
                          java.lang.String[] attributes)
Specifies that the given element should be accepted and, optionally, which attributes of that element should be kept.

Parameters:
element - The element to accept.
attributes - The list of attributes to be kept or null if no attributes should be kept for this element. see #removeElement

removeElement

public void removeElement(java.lang.String element)
Specifies that the given element should be completely removed. If an element is encountered during processing that is on the remove list, the element's start and end tags as well as all of content contained within the element will be removed from the processing stream.

Parameters:
element - The element to completely remove.

startDocument

public void startDocument(org.apache.xerces.xni.XMLLocator locator,
                          java.lang.String encoding,
                          org.apache.xerces.xni.NamespaceContext nscontext,
                          org.apache.xerces.xni.Augmentations augs)
                   throws org.apache.xerces.xni.XNIException
Start document.

Specified by:
startDocument in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
startDocument in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

startDocument

public void startDocument(org.apache.xerces.xni.XMLLocator locator,
                          java.lang.String encoding,
                          org.apache.xerces.xni.Augmentations augs)
                   throws org.apache.xerces.xni.XNIException
Start document.

Overrides:
startDocument in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

startPrefixMapping

public void startPrefixMapping(java.lang.String prefix,
                               java.lang.String uri,
                               org.apache.xerces.xni.Augmentations augs)
                        throws org.apache.xerces.xni.XNIException
Start prefix mapping.

Overrides:
startPrefixMapping in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

startElement

public void startElement(org.apache.xerces.xni.QName element,
                         org.apache.xerces.xni.XMLAttributes attributes,
                         org.apache.xerces.xni.Augmentations augs)
                  throws org.apache.xerces.xni.XNIException
Start element.

Specified by:
startElement in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
startElement in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

emptyElement

public void emptyElement(org.apache.xerces.xni.QName element,
                         org.apache.xerces.xni.XMLAttributes attributes,
                         org.apache.xerces.xni.Augmentations augs)
                  throws org.apache.xerces.xni.XNIException
Empty element.

Specified by:
emptyElement in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
emptyElement in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

comment

public void comment(org.apache.xerces.xni.XMLString text,
                    org.apache.xerces.xni.Augmentations augs)
             throws org.apache.xerces.xni.XNIException
Comment.

Specified by:
comment in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
comment in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

processingInstruction

public void processingInstruction(java.lang.String target,
                                  org.apache.xerces.xni.XMLString data,
                                  org.apache.xerces.xni.Augmentations augs)
                           throws org.apache.xerces.xni.XNIException
Processing instruction.

Specified by:
processingInstruction in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
processingInstruction in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

characters

public void characters(org.apache.xerces.xni.XMLString text,
                       org.apache.xerces.xni.Augmentations augs)
                throws org.apache.xerces.xni.XNIException
Characters.

Specified by:
characters in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
characters in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

ignorableWhitespace

public void ignorableWhitespace(org.apache.xerces.xni.XMLString text,
                                org.apache.xerces.xni.Augmentations augs)
                         throws org.apache.xerces.xni.XNIException
Ignorable whitespace.

Specified by:
ignorableWhitespace in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
ignorableWhitespace in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

startGeneralEntity

public void startGeneralEntity(java.lang.String name,
                               org.apache.xerces.xni.XMLResourceIdentifier id,
                               java.lang.String encoding,
                               org.apache.xerces.xni.Augmentations augs)
                        throws org.apache.xerces.xni.XNIException
Start general entity.

Specified by:
startGeneralEntity in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
startGeneralEntity in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

textDecl

public void textDecl(java.lang.String version,
                     java.lang.String encoding,
                     org.apache.xerces.xni.Augmentations augs)
              throws org.apache.xerces.xni.XNIException
Text declaration.

Specified by:
textDecl in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
textDecl in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

endGeneralEntity

public void endGeneralEntity(java.lang.String name,
                             org.apache.xerces.xni.Augmentations augs)
                      throws org.apache.xerces.xni.XNIException
End general entity.

Specified by:
endGeneralEntity in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
endGeneralEntity in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

startCDATA

public void startCDATA(org.apache.xerces.xni.Augmentations augs)
                throws org.apache.xerces.xni.XNIException
Start CDATA section.

Specified by:
startCDATA in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
startCDATA in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

endCDATA

public void endCDATA(org.apache.xerces.xni.Augmentations augs)
              throws org.apache.xerces.xni.XNIException
End CDATA section.

Specified by:
endCDATA in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
endCDATA in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

endElement

public void endElement(org.apache.xerces.xni.QName element,
                       org.apache.xerces.xni.Augmentations augs)
                throws org.apache.xerces.xni.XNIException
End element.

Specified by:
endElement in interface org.apache.xerces.xni.XMLDocumentHandler
Overrides:
endElement in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

endPrefixMapping

public void endPrefixMapping(java.lang.String prefix,
                             org.apache.xerces.xni.Augmentations augs)
                      throws org.apache.xerces.xni.XNIException
End prefix mapping.

Overrides:
endPrefixMapping in class DefaultFilter
Throws:
org.apache.xerces.xni.XNIException

elementAccepted

protected boolean elementAccepted(java.lang.String element)
Returns true if the specified element is accepted.


elementRemoved

protected boolean elementRemoved(java.lang.String element)
Returns true if the specified element should be removed.


handleOpenTag

protected boolean handleOpenTag(org.apache.xerces.xni.QName element,
                                org.apache.xerces.xni.XMLAttributes attributes)
Handles an open tag.



(C) Copyright 2002-2008, Andy Clark. All rights reserved.