When you use
HXT
readDocument
, it fetches whatever DTD
file specified in the document (even when you turn off validation,
since there may be entity references defined). Sometimes you don't
want the DTD; sometimes it is harmful to fetch the DTD file, e.g,
needlessly
fetching W3C DTDs is not nice.
So I write these alternatives derived from the source code
of readDocument
and parseXmlDocument
. They
skip fetching DTDs, and therefore also skip namespace stuff and
validation. Out of my laziness, they also lack tracing. Most options
are ignored. I think only these are honoured:
a_proxy
,
a_use_curl
,
a_options_curl
,
a_encoding
. Very few canonicalizations are done:
standard XML entity references, character references, CDATA, and
string merging. (You can further skip them by
calling readXmlDocPristine
instead.) Processing
instructions, DTD declarations, comments, and whitespace are
preserved; you can further change some of them with tools from
Text.XML.HXT.Arrow.Edit.
String -> IOStateArrow s b XmlTree readXmlDoc options uri = readXmlDocPristine options uri >>> simplifyXmlStrings readXmlDocPristine :: Attributes -> String -> IOStateArrow s b XmlTree readXmlDocPristine options uri = getDocumentContents options uri >>> ( ( replaceChildren ( ( getAttrValue a_source &&& xshow getChildren ) >>> parseXmlDoc >>> filterErrorMsg ) >>> setDocumentStatusFromSystemState "parse XML document" ) `when` documentStatusOk ) simplifyXmlStrings :: (ArrowList a) => a XmlTree XmlTree simplifyXmlStrings = substXmlEntityRefs >>> canonicalizeContents ]]>