org.cyberneko.html
Class HTMLScanner

java.lang.Object
  extended by org.cyberneko.html.HTMLScanner
All Implemented Interfaces:
org.apache.xerces.xni.parser.XMLComponent, org.apache.xerces.xni.parser.XMLDocumentScanner, org.apache.xerces.xni.parser.XMLDocumentSource, org.apache.xerces.xni.XMLLocator, HTMLComponent

public class HTMLScanner
extends java.lang.Object
implements org.apache.xerces.xni.parser.XMLDocumentScanner, org.apache.xerces.xni.XMLLocator, HTMLComponent

A simple HTML scanner. This scanner makes no attempt to balance tags or fix other problems in the source document — it just scans what it can and generates XNI document "events", ignoring errors of all kinds.

This component recognizes the following features:

This component recognizes the following properties:

Version:
$Id: HTMLScanner.java,v 1.19 2005/06/14 05:52:37 andyc Exp $
Author:
Andy Clark
See Also:
HTMLElements, HTMLEntities

Nested Class Summary
 class HTMLScanner.ContentScanner
          The primary HTML document scanner.
static class HTMLScanner.CurrentEntity
          Current entity.
protected static class HTMLScanner.LocationItem
          Location infoset item.
static class HTMLScanner.PlaybackInputStream
          A playback input stream.
static interface HTMLScanner.Scanner
          Basic scanner interface.
 class HTMLScanner.SpecialScanner
          Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.
 
Field Summary
protected static java.lang.String AUGMENTATIONS
          Include infoset augmentations.
static java.lang.String CDATA_SECTIONS
          Scan CDATA sections.
protected static java.lang.reflect.Method CHARSET_forName
          Charset#forName method, if available.
protected static boolean DEBUG_CALLBACKS
          Set to true to debug callbacks.
protected static java.lang.reflect.Method DECODER_averageCharsPerByte
          CharsetDecoder#averageCharsPerByte method, if available.
protected static int DEFAULT_BUFFER_SIZE
          Default buffer size.
protected static java.lang.String DEFAULT_ENCODING
          Default encoding.
protected static java.lang.String DOCTYPE_PUBID
          Doctype declaration public identifier.
protected static java.lang.String DOCTYPE_SYSID
          Doctype declaration system identifier.
protected static java.lang.String ERROR_REPORTER
          Error reporter.
protected  boolean fAugmentations
          Augmentations.
protected  int fBeginColumnNumber
          Beginning column number.
protected  int fBeginLineNumber
          Beginning line number.
protected  HTMLScanner.PlaybackInputStream fByteStream
          The playback byte stream.
protected  boolean fCDATASections
          CDATA sections.
protected  HTMLScanner.Scanner fContentScanner
          Content scanner.
protected  HTMLScanner.CurrentEntity fCurrentEntity
          Current entity.
protected  java.util.Stack fCurrentEntityStack
          The current entity stack.
protected  java.lang.String fDefaultIANAEncoding
          Default encoding.
protected  java.lang.String fDoctypePubid
          Doctype declaration public identifier.
protected  java.lang.String fDoctypeSysid
          Doctype declaration system identifier.
protected  org.apache.xerces.xni.XMLDocumentHandler fDocumentHandler
          The document handler.
protected  int fElementCount
          Element count.
protected  int fElementDepth
          Element depth.
protected  int fEndColumnNumber
          Ending column number.
protected  int fEndLineNumber
          Ending line number.
protected  HTMLErrorReporter fErrorReporter
          Error reporter.
protected  boolean fFixWindowsCharRefs
          Fix Microsoft Windows® character entity references.
protected  java.lang.String fIANAEncoding
          Auto-detected IANA encoding.
protected  boolean fIgnoreSpecifiedCharset
          Ignore specified character set.
protected  boolean fInsertDoctype
          Insert document type declaration.
protected  boolean fIso8859Encoding
          True if the encoding matches "ISO-8859-*".
static java.lang.String FIX_MSWINDOWS_REFS
          Fix Microsoft Windows® character entity references.
protected  java.lang.String fJavaEncoding
          Auto-detected Java encoding.
protected  short fNamesAttrs
          Modify HTML attribute names.
protected  short fNamesElems
          Modify HTML element names.
protected  boolean fNormalizeAttributes
          Normalize attribute values.
protected  boolean fNotifyCharRefs
          Notify character entity references.
protected  boolean fNotifyHtmlBuiltinRefs
          Notify HTML built-in general entity references.
protected  boolean fNotifyXmlBuiltinRefs
          Notify XML built-in general entity references.
protected  boolean fOverrideDoctype
          Override doctype declaration public and system identifiers.
protected  boolean fReportErrors
          Report errors.
protected  HTMLScanner.Scanner fScanner
          The current scanner.
protected  short fScannerState
          The current scanner state.
protected  boolean fScriptStripCDATADelims
          Strip CDATA delimiters from SCRIPT tags.
protected  boolean fScriptStripCommentDelims
          Strip comment delimiters from SCRIPT tags.
protected  HTMLScanner.SpecialScanner fSpecialScanner
          Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references.
protected  org.apache.xerces.xni.XMLString fString
          String.
protected  org.apache.xerces.util.XMLStringBuffer fStringBuffer
          String buffer.
protected  boolean fStyleStripCDATADelims
          Strip CDATA delimiters from STYLE tags.
protected  boolean fStyleStripCommentDelims
          Strip comment delimiters from STYLE tags.
static java.lang.String HTML_4_01_FRAMESET_PUBID
          HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").
static java.lang.String HTML_4_01_FRAMESET_SYSID
          HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").
static java.lang.String HTML_4_01_STRICT_PUBID
          HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").
static java.lang.String HTML_4_01_STRICT_SYSID
          HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").
static java.lang.String HTML_4_01_TRANSITIONAL_PUBID
          HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").
static java.lang.String HTML_4_01_TRANSITIONAL_SYSID
          HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").
static java.lang.String IGNORE_SPECIFIED_CHARSET
          Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag.
static java.lang.String INSERT_DOCTYPE
          Insert document type declaration.
protected static java.lang.String NAMES_ATTRS
          Modify HTML attribute names: { "upper", "lower", "default" }.
protected static java.lang.String NAMES_ELEMS
          Modify HTML element names: { "upper", "lower", "default" }.
protected static short NAMES_LOWERCASE
          Lowercase HTML names.
protected static short NAMES_NO_CHANGE
          Don't modify HTML names.
protected static short NAMES_UPPERCASE
          Uppercase HTML names.
protected static java.lang.String NORMALIZE_ATTRIBUTES
          Normalize attribute values.
static java.lang.String NOTIFY_CHAR_REFS
          Notify character entity references (e.g.
static java.lang.String NOTIFY_HTML_BUILTIN_REFS
          Notify handler of built-in entity references (e.g.
static java.lang.String NOTIFY_XML_BUILTIN_REFS
          Notify handler of built-in entity references (e.g.
static java.lang.String OVERRIDE_DOCTYPE
          Override doctype declaration public and system identifiers.
protected static java.lang.String REPORT_ERRORS
          Report errors.
static java.lang.String SCRIPT_STRIP_CDATA_DELIMS
          Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.
static java.lang.String SCRIPT_STRIP_COMMENT_DELIMS
          Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.
protected static short STATE_CONTENT
          State: content.
protected static short STATE_END_DOCUMENT
          State: end document.
protected static short STATE_MARKUP_BRACKET
          State: markup bracket.
protected static short STATE_START_DOCUMENT
          State: start document.
static java.lang.String STYLE_STRIP_CDATA_DELIMS
          Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.
static java.lang.String STYLE_STRIP_COMMENT_DELIMS
          Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.
protected static HTMLEventInfo SYNTHESIZED_ITEM
          Synthesized event info item.
 
Constructor Summary
HTMLScanner()
           
 
Method Summary
protected static boolean builtinXmlRef(java.lang.String name)
          Returns true if the name is a built-in XML general entity reference.
 void cleanup(boolean closeall)
          Cleans up used resources.
static java.lang.String expandSystemId(java.lang.String systemId, java.lang.String baseSystemId)
          Expands a system id and returns the system id as a URI, if it can be expanded.
protected static java.lang.String fixURI(java.lang.String str)
          Fixes a platform dependent filename to standard URI form.
protected  int fixWindowsCharacter(int origChar)
          Fixes Microsoft Windows® specific characters.
 java.lang.String getBaseSystemId()
          Returns the base system identifier.
 int getCharacterOffset()
          Returns the character offset.
 int getColumnNumber()
          Returns the current column number.
 org.apache.xerces.xni.XMLDocumentHandler getDocumentHandler()
          Returns the document handler.
 java.lang.String getEncoding()
          Returns the encoding.
 java.lang.String getExpandedSystemId()
          Returns the expanded system identifier.
 java.lang.Boolean getFeatureDefault(java.lang.String featureId)
          Returns the default state for a feature.
 int getLineNumber()
          Returns the current line number.
 java.lang.String getLiteralSystemId()
          Returns the literal system identifier.
protected static short getNamesValue(java.lang.String value)
          Converts HTML names string value to constant value.
 java.lang.Object getPropertyDefault(java.lang.String propertyId)
          Returns the default state for a property.
 java.lang.String getPublicId()
          Returns the public identifier.
 java.lang.String[] getRecognizedFeatures()
          Returns recognized features.
 java.lang.String[] getRecognizedProperties()
          Returns recognized properties.
protected static java.lang.String getValue(org.apache.xerces.xni.XMLAttributes attrs, java.lang.String aname)
          Returns the value of the specified attribute, ignoring case.
 java.lang.String getXMLVersion()
          Returns the XML version.
protected  int load(int offset)
          Loads a new chunk of data into the buffer and returns the number of characters loaded or -1 if no additional characters were loaded.
protected  org.apache.xerces.xni.Augmentations locationAugs()
          Returns an augmentations object with a location item added.
protected static java.lang.String modifyName(java.lang.String name, short mode)
          Modifies the given name based on the specified mode.
 void pushInputSource(org.apache.xerces.xni.parser.XMLInputSource inputSource)
          Pushes an input source onto the current entity stack.
protected  int read()
          Reads a single character.
 void reset(org.apache.xerces.xni.parser.XMLComponentManager manager)
          Resets the component.
protected  org.apache.xerces.xni.XMLResourceIdentifier resourceId()
          Returns an empty resource identifier.
protected  void scanDoctype()
          Scans a DOCTYPE line.
 boolean scanDocument(boolean complete)
          Scans the document.
protected  int scanEntityRef(org.apache.xerces.util.XMLStringBuffer str, boolean content)
          Scans an entity reference.
protected  java.lang.String scanLiteral()
          Scans a quoted literal.
protected  java.lang.String scanName()
          Scans a name.
 void setDocumentHandler(org.apache.xerces.xni.XMLDocumentHandler handler)
          Sets the document handler.
 void setFeature(java.lang.String featureId, boolean state)
          Sets a feature.
 void setInputSource(org.apache.xerces.xni.parser.XMLInputSource source)
          Sets the input source.
 void setProperty(java.lang.String propertyId, java.lang.Object value)
          Sets a property.
protected  void setScanner(HTMLScanner.Scanner scanner)
          Sets the scanner.
protected  void setScannerState(short state)
          Sets the scanner state.
protected  boolean skip(java.lang.String s, boolean caseSensitive)
          Returns true if the specified text is present and is skipped.
protected  boolean skipMarkup(boolean balance)
          Skips markup.
protected  int skipNewlines()
          Skips newlines and returns the number of newlines skipped.
protected  int skipNewlines(int maxlines)
          Skips newlines and returns the number of newlines skipped.
protected  boolean skipSpaces()
          Skips whitespace.
protected  org.apache.xerces.xni.Augmentations synthesizedAugs()
          Returns an augmentations object with a synthesized item added.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

HTML_4_01_STRICT_PUBID

public static final java.lang.String HTML_4_01_STRICT_PUBID
HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN").

See Also:
Constant Field Values

HTML_4_01_STRICT_SYSID

public static final java.lang.String HTML_4_01_STRICT_SYSID
HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd").

See Also:
Constant Field Values

HTML_4_01_TRANSITIONAL_PUBID

public static final java.lang.String HTML_4_01_TRANSITIONAL_PUBID
HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN").

See Also:
Constant Field Values

HTML_4_01_TRANSITIONAL_SYSID

public static final java.lang.String HTML_4_01_TRANSITIONAL_SYSID
HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd").

See Also:
Constant Field Values

HTML_4_01_FRAMESET_PUBID

public static final java.lang.String HTML_4_01_FRAMESET_PUBID
HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN").

See Also:
Constant Field Values

HTML_4_01_FRAMESET_SYSID

public static final java.lang.String HTML_4_01_FRAMESET_SYSID
HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd").

See Also:
Constant Field Values

AUGMENTATIONS

protected static final java.lang.String AUGMENTATIONS
Include infoset augmentations.

See Also:
Constant Field Values

REPORT_ERRORS

protected static final java.lang.String REPORT_ERRORS
Report errors.

See Also:
Constant Field Values

NOTIFY_CHAR_REFS

public static final java.lang.String NOTIFY_CHAR_REFS
Notify character entity references (e.g. &#32;, &#x20;, etc).

See Also:
Constant Field Values

NOTIFY_XML_BUILTIN_REFS

public static final java.lang.String NOTIFY_XML_BUILTIN_REFS
Notify handler of built-in entity references (e.g. &amp;, &lt;, etc).

Note: This only applies to the five pre-defined XML general entities. Specifically, "amp", "lt", "gt", "quot", and "apos". This is done for compatibility with the Xerces feature.

To be notified of the built-in entity references in HTML, set the http://cyberneko.org/html/features/scanner/notify-builtin-refs feature to true.

See Also:
Constant Field Values

NOTIFY_HTML_BUILTIN_REFS

public static final java.lang.String NOTIFY_HTML_BUILTIN_REFS
Notify handler of built-in entity references (e.g. &nobr;, &copy;, etc).

Note: This includes the five pre-defined XML general entities.

See Also:
Constant Field Values

FIX_MSWINDOWS_REFS

public static final java.lang.String FIX_MSWINDOWS_REFS
Fix Microsoft Windows® character entity references.

See Also:
Constant Field Values

SCRIPT_STRIP_COMMENT_DELIMS

public static final java.lang.String SCRIPT_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from SCRIPT tag contents.

See Also:
Constant Field Values

SCRIPT_STRIP_CDATA_DELIMS

public static final java.lang.String SCRIPT_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from SCRIPT tag contents.

See Also:
Constant Field Values

STYLE_STRIP_COMMENT_DELIMS

public static final java.lang.String STYLE_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<!−−" and "−−>") from STYLE tag contents.

See Also:
Constant Field Values

STYLE_STRIP_CDATA_DELIMS

public static final java.lang.String STYLE_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<![CDATA[" and "]]>") from STYLE tag contents.

See Also:
Constant Field Values

IGNORE_SPECIFIED_CHARSET

public static final java.lang.String IGNORE_SPECIFIED_CHARSET
Ignore specified charset found in the <meta equiv='Content-Type' content='text/html;charset=…'> tag.

See Also:
Constant Field Values

CDATA_SECTIONS

public static final java.lang.String CDATA_SECTIONS
Scan CDATA sections.

See Also:
Constant Field Values

OVERRIDE_DOCTYPE

public static final java.lang.String OVERRIDE_DOCTYPE
Override doctype declaration public and system identifiers.

See Also:
Constant Field Values

INSERT_DOCTYPE

public static final java.lang.String INSERT_DOCTYPE
Insert document type declaration.

See Also:
Constant Field Values

NORMALIZE_ATTRIBUTES

protected static final java.lang.String NORMALIZE_ATTRIBUTES
Normalize attribute values.

See Also:
Constant Field Values

NAMES_ELEMS

protected static final java.lang.String NAMES_ELEMS
Modify HTML element names: { "upper", "lower", "default" }.

See Also:
Constant Field Values

NAMES_ATTRS

protected static final java.lang.String NAMES_ATTRS
Modify HTML attribute names: { "upper", "lower", "default" }.

See Also:
Constant Field Values

DEFAULT_ENCODING

protected static final java.lang.String DEFAULT_ENCODING
Default encoding.

See Also:
Constant Field Values

ERROR_REPORTER

protected static final java.lang.String ERROR_REPORTER
Error reporter.

See Also:
Constant Field Values

DOCTYPE_PUBID

protected static final java.lang.String DOCTYPE_PUBID
Doctype declaration public identifier.

See Also:
Constant Field Values

DOCTYPE_SYSID

protected static final java.lang.String DOCTYPE_SYSID
Doctype declaration system identifier.

See Also:
Constant Field Values

STATE_CONTENT

protected static final short STATE_CONTENT
State: content.

See Also:
Constant Field Values

STATE_MARKUP_BRACKET

protected static final short STATE_MARKUP_BRACKET
State: markup bracket.

See Also:
Constant Field Values

STATE_START_DOCUMENT

protected static final short STATE_START_DOCUMENT
State: start document.

See Also:
Constant Field Values

STATE_END_DOCUMENT

protected static final short STATE_END_DOCUMENT
State: end document.

See Also:
Constant Field Values

NAMES_NO_CHANGE

protected static final short NAMES_NO_CHANGE
Don't modify HTML names.

See Also:
Constant Field Values

NAMES_UPPERCASE

protected static final short NAMES_UPPERCASE
Uppercase HTML names.

See Also:
Constant Field Values

NAMES_LOWERCASE

protected static final short NAMES_LOWERCASE
Lowercase HTML names.

See Also:
Constant Field Values

DEFAULT_BUFFER_SIZE

protected static final int DEFAULT_BUFFER_SIZE
Default buffer size.

See Also:
Constant Field Values

DEBUG_CALLBACKS

protected static final boolean DEBUG_CALLBACKS
Set to true to debug callbacks.

See Also:
Constant Field Values

CHARSET_forName

protected static java.lang.reflect.Method CHARSET_forName
Charset#forName method, if available.


DECODER_averageCharsPerByte

protected static java.lang.reflect.Method DECODER_averageCharsPerByte
CharsetDecoder#averageCharsPerByte method, if available.


SYNTHESIZED_ITEM

protected static final HTMLEventInfo SYNTHESIZED_ITEM
Synthesized event info item.


fAugmentations

protected boolean fAugmentations
Augmentations.


fReportErrors

protected boolean fReportErrors
Report errors.


fNotifyCharRefs

protected boolean fNotifyCharRefs
Notify character entity references.


fNotifyXmlBuiltinRefs

protected boolean fNotifyXmlBuiltinRefs
Notify XML built-in general entity references.


fNotifyHtmlBuiltinRefs

protected boolean fNotifyHtmlBuiltinRefs
Notify HTML built-in general entity references.


fFixWindowsCharRefs

protected boolean fFixWindowsCharRefs
Fix Microsoft Windows® character entity references.


fScriptStripCDATADelims

protected boolean fScriptStripCDATADelims
Strip CDATA delimiters from SCRIPT tags.


fScriptStripCommentDelims

protected boolean fScriptStripCommentDelims
Strip comment delimiters from SCRIPT tags.


fStyleStripCDATADelims

protected boolean fStyleStripCDATADelims
Strip CDATA delimiters from STYLE tags.


fStyleStripCommentDelims

protected boolean fStyleStripCommentDelims
Strip comment delimiters from STYLE tags.


fIgnoreSpecifiedCharset

protected boolean fIgnoreSpecifiedCharset
Ignore specified character set.


fCDATASections

protected boolean fCDATASections
CDATA sections.


fOverrideDoctype

protected boolean fOverrideDoctype
Override doctype declaration public and system identifiers.


fInsertDoctype

protected boolean fInsertDoctype
Insert document type declaration.


fNormalizeAttributes

protected boolean fNormalizeAttributes
Normalize attribute values.


fNamesElems

protected short fNamesElems
Modify HTML element names.


fNamesAttrs

protected short fNamesAttrs
Modify HTML attribute names.


fDefaultIANAEncoding

protected java.lang.String fDefaultIANAEncoding
Default encoding.


fErrorReporter

protected HTMLErrorReporter fErrorReporter
Error reporter.


fDoctypePubid

protected java.lang.String fDoctypePubid
Doctype declaration public identifier.


fDoctypeSysid

protected java.lang.String fDoctypeSysid
Doctype declaration system identifier.


fBeginLineNumber

protected int fBeginLineNumber
Beginning line number.


fBeginColumnNumber

protected int fBeginColumnNumber
Beginning column number.


fEndLineNumber

protected int fEndLineNumber
Ending line number.


fEndColumnNumber

protected int fEndColumnNumber
Ending column number.


fByteStream

protected HTMLScanner.PlaybackInputStream fByteStream
The playback byte stream.


fCurrentEntity

protected HTMLScanner.CurrentEntity fCurrentEntity
Current entity.


fCurrentEntityStack

protected final java.util.Stack fCurrentEntityStack
The current entity stack.


fScanner

protected HTMLScanner.Scanner fScanner
The current scanner.


fScannerState

protected short fScannerState
The current scanner state.


fDocumentHandler

protected org.apache.xerces.xni.XMLDocumentHandler fDocumentHandler
The document handler.


fIANAEncoding

protected java.lang.String fIANAEncoding
Auto-detected IANA encoding.


fJavaEncoding

protected java.lang.String fJavaEncoding
Auto-detected Java encoding.


fIso8859Encoding

protected boolean fIso8859Encoding
True if the encoding matches "ISO-8859-*".


fElementCount

protected int fElementCount
Element count.


fElementDepth

protected int fElementDepth
Element depth.


fContentScanner

protected HTMLScanner.Scanner fContentScanner
Content scanner.


fSpecialScanner

protected HTMLScanner.SpecialScanner fSpecialScanner
Special scanner used for elements whose content needs to be scanned as plain text, ignoring markup such as elements and entity references. For example: <SCRIPT> and <COMMENT>.


fString

protected final org.apache.xerces.xni.XMLString fString
String.


fStringBuffer

protected final org.apache.xerces.util.XMLStringBuffer fStringBuffer
String buffer.

Constructor Detail

HTMLScanner

public HTMLScanner()
Method Detail

pushInputSource

public void pushInputSource(org.apache.xerces.xni.parser.XMLInputSource inputSource)
Pushes an input source onto the current entity stack. This enables the scanner to transparently scan new content (e.g. the output written by an embedded script). At the end of the current entity, the scanner returns where it left off at the time this entity source was pushed.

Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.

Parameters:
inputSource - The new input source to start scanning.

cleanup

public void cleanup(boolean closeall)
Cleans up used resources. For example, if scanning is terminated early, then this method ensures all remaining open streams are closed.

Parameters:
closeall - Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.

getEncoding

public java.lang.String getEncoding()
Returns the encoding.

Specified by:
getEncoding in interface org.apache.xerces.xni.XMLLocator

getPublicId

public java.lang.String getPublicId()
Returns the public identifier.

Specified by:
getPublicId in interface org.apache.xerces.xni.XMLLocator

getBaseSystemId

public java.lang.String getBaseSystemId()
Returns the base system identifier.

Specified by:
getBaseSystemId in interface org.apache.xerces.xni.XMLLocator

getLiteralSystemId

public java.lang.String getLiteralSystemId()
Returns the literal system identifier.

Specified by:
getLiteralSystemId in interface org.apache.xerces.xni.XMLLocator

getExpandedSystemId

public java.lang.String getExpandedSystemId()
Returns the expanded system identifier.

Specified by:
getExpandedSystemId in interface org.apache.xerces.xni.XMLLocator

getLineNumber

public int getLineNumber()
Returns the current line number.

Specified by:
getLineNumber in interface org.apache.xerces.xni.XMLLocator

getColumnNumber

public int getColumnNumber()
Returns the current column number.

Specified by:
getColumnNumber in interface org.apache.xerces.xni.XMLLocator

getXMLVersion

public java.lang.String getXMLVersion()
Returns the XML version.

Specified by:
getXMLVersion in interface org.apache.xerces.xni.XMLLocator

getCharacterOffset

public int getCharacterOffset()
Returns the character offset.

Specified by:
getCharacterOffset in interface org.apache.xerces.xni.XMLLocator

getFeatureDefault

public java.lang.Boolean getFeatureDefault(java.lang.String featureId)
Returns the default state for a feature.

Specified by:
getFeatureDefault in interface org.apache.xerces.xni.parser.XMLComponent
Specified by:
getFeatureDefault in interface HTMLComponent

getPropertyDefault

public java.lang.Object getPropertyDefault(java.lang.String propertyId)
Returns the default state for a property.

Specified by:
getPropertyDefault in interface org.apache.xerces.xni.parser.XMLComponent
Specified by:
getPropertyDefault in interface HTMLComponent

getRecognizedFeatures

public java.lang.String[] getRecognizedFeatures()
Returns recognized features.

Specified by:
getRecognizedFeatures in interface org.apache.xerces.xni.parser.XMLComponent

getRecognizedProperties

public java.lang.String[] getRecognizedProperties()
Returns recognized properties.

Specified by:
getRecognizedProperties in interface org.apache.xerces.xni.parser.XMLComponent

reset

public void reset(org.apache.xerces.xni.parser.XMLComponentManager manager)
           throws org.apache.xerces.xni.parser.XMLConfigurationException
Resets the component.

Specified by:
reset in interface org.apache.xerces.xni.parser.XMLComponent
Throws:
org.apache.xerces.xni.parser.XMLConfigurationException

setFeature

public void setFeature(java.lang.String featureId,
                       boolean state)
                throws org.apache.xerces.xni.parser.XMLConfigurationException
Sets a feature.

Specified by:
setFeature in interface org.apache.xerces.xni.parser.XMLComponent
Throws:
org.apache.xerces.xni.parser.XMLConfigurationException

setProperty

public void setProperty(java.lang.String propertyId,
                        java.lang.Object value)
                 throws org.apache.xerces.xni.parser.XMLConfigurationException
Sets a property.

Specified by:
setProperty in interface org.apache.xerces.xni.parser.XMLComponent
Throws:
org.apache.xerces.xni.parser.XMLConfigurationException

setInputSource

public void setInputSource(org.apache.xerces.xni.parser.XMLInputSource source)
                    throws java.io.IOException
Sets the input source.

Specified by:
setInputSource in interface org.apache.xerces.xni.parser.XMLDocumentScanner
Throws:
java.io.IOException

scanDocument

public boolean scanDocument(boolean complete)
                     throws org.apache.xerces.xni.XNIException,
                            java.io.IOException
Scans the document.

Specified by:
scanDocument in interface org.apache.xerces.xni.parser.XMLDocumentScanner
Throws:
org.apache.xerces.xni.XNIException
java.io.IOException

setDocumentHandler

public void setDocumentHandler(org.apache.xerces.xni.XMLDocumentHandler handler)
Sets the document handler.

Specified by:
setDocumentHandler in interface org.apache.xerces.xni.parser.XMLDocumentSource

getDocumentHandler

public org.apache.xerces.xni.XMLDocumentHandler getDocumentHandler()
Returns the document handler.

Specified by:
getDocumentHandler in interface org.apache.xerces.xni.parser.XMLDocumentSource

getValue

protected static java.lang.String getValue(org.apache.xerces.xni.XMLAttributes attrs,
                                           java.lang.String aname)
Returns the value of the specified attribute, ignoring case.


expandSystemId

public static java.lang.String expandSystemId(java.lang.String systemId,
                                              java.lang.String baseSystemId)
Expands a system id and returns the system id as a URI, if it can be expanded. A return value of null means that the identifier is already expanded. An exception thrown indicates a failure to expand the id.

Parameters:
systemId - The systemId to be expanded.
Returns:
Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.

fixURI

protected static java.lang.String fixURI(java.lang.String str)
Fixes a platform dependent filename to standard URI form.

Parameters:
str - The string to fix.
Returns:
Returns the fixed URI string.

modifyName

protected static final java.lang.String modifyName(java.lang.String name,
                                                   short mode)
Modifies the given name based on the specified mode.


getNamesValue

protected static final short getNamesValue(java.lang.String value)
Converts HTML names string value to constant value.

See Also:
NAMES_NO_CHANGE, NAMES_LOWERCASE, NAMES_UPPERCASE

fixWindowsCharacter

protected int fixWindowsCharacter(int origChar)
Fixes Microsoft Windows® specific characters.

Details about this common problem can be found at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html


read

protected int read()
            throws java.io.IOException
Reads a single character.

Throws:
java.io.IOException

load

protected int load(int offset)
            throws java.io.IOException
Loads a new chunk of data into the buffer and returns the number of characters loaded or -1 if no additional characters were loaded.

Parameters:
offset - The offset at which new characters should be loaded.
Throws:
java.io.IOException

setScanner

protected void setScanner(HTMLScanner.Scanner scanner)
Sets the scanner.


setScannerState

protected void setScannerState(short state)
Sets the scanner state.


scanDoctype

protected void scanDoctype()
                    throws java.io.IOException
Scans a DOCTYPE line.

Throws:
java.io.IOException

scanLiteral

protected java.lang.String scanLiteral()
                                throws java.io.IOException
Scans a quoted literal.

Throws:
java.io.IOException

scanName

protected java.lang.String scanName()
                             throws java.io.IOException
Scans a name.

Throws:
java.io.IOException

scanEntityRef

protected int scanEntityRef(org.apache.xerces.util.XMLStringBuffer str,
                            boolean content)
                     throws java.io.IOException
Scans an entity reference.

Throws:
java.io.IOException

skip

protected boolean skip(java.lang.String s,
                       boolean caseSensitive)
                throws java.io.IOException
Returns true if the specified text is present and is skipped.

Throws:
java.io.IOException

skipMarkup

protected boolean skipMarkup(boolean balance)
                      throws java.io.IOException
Skips markup.

Throws:
java.io.IOException

skipSpaces

protected boolean skipSpaces()
                      throws java.io.IOException
Skips whitespace.

Throws:
java.io.IOException

skipNewlines

protected int skipNewlines()
                    throws java.io.IOException
Skips newlines and returns the number of newlines skipped.

Throws:
java.io.IOException

skipNewlines

protected int skipNewlines(int maxlines)
                    throws java.io.IOException
Skips newlines and returns the number of newlines skipped.

Throws:
java.io.IOException

locationAugs

protected final org.apache.xerces.xni.Augmentations locationAugs()
Returns an augmentations object with a location item added.


synthesizedAugs

protected final org.apache.xerces.xni.Augmentations synthesizedAugs()
Returns an augmentations object with a synthesized item added.


resourceId

protected final org.apache.xerces.xni.XMLResourceIdentifier resourceId()
Returns an empty resource identifier.


builtinXmlRef

protected static boolean builtinXmlRef(java.lang.String name)
Returns true if the name is a built-in XML general entity reference.



(C) Copyright 2002-2008, Andy Clark. All rights reserved.