Read XML skipping DTD

Albert Y. C. Lai, trebla@vex.net

When you use HXT readDocument, it fetches whatever DTD file specified in the document (even when you turn off validation, since there may be entity references defined). Sometimes you don't want the DTD; sometimes it is harmful to fetch the DTD file, e.g, needlessly fetching W3C DTDs is not nice.

So I write these alternatives derived from the source code of readDocument and parseXmlDocument. They skip fetching DTDs, and therefore also skip namespace stuff and validation. Out of my laziness, they also lack tracing. Most options are ignored. I think only these are honoured: a_proxy, a_use_curl, a_options_curl, a_encoding. Very few canonicalizations are done: standard XML entity references, character references, CDATA, and string merging. (You can further skip them by calling readXmlDocPristine instead.) Processing instructions, DTD declarations, comments, and whitespace are preserved; you can further change some of them with tools from Text.XML.HXT.Arrow.Edit.

 String -> IOStateArrow s b XmlTree
readXmlDoc options uri = readXmlDocPristine options uri >>> simplifyXmlStrings

readXmlDocPristine :: Attributes -> String -> IOStateArrow s b XmlTree
readXmlDocPristine options uri =
    getDocumentContents options uri
    >>>
    ( ( replaceChildren ( ( getAttrValue a_source
			    &&&
			    xshow getChildren
			  )
			  >>>
			  parseXmlDoc
			  >>>
			  filterErrorMsg
			)
	>>>
	setDocumentStatusFromSystemState "parse XML document"
      )
      `when` documentStatusOk )

simplifyXmlStrings :: (ArrowList a) => a XmlTree XmlTree
simplifyXmlStrings = substXmlEntityRefs >>> canonicalizeContents
]]>

My other HXT notes