When you use
HXT
readDocument, it fetches whatever DTD
file specified in the document (even when you turn off validation,
since there may be entity references defined). Sometimes you don't
want the DTD; sometimes it is harmful to fetch the DTD file, e.g,
needlessly
fetching W3C DTDs is not nice.
So I write these alternatives derived from the source code
of readDocument and parseXmlDocument. They
skip fetching DTDs, and therefore also skip namespace stuff and
validation. Out of my laziness, they also lack tracing. Most options
are ignored. I think only these are honoured:
a_proxy,
a_use_curl,
a_options_curl,
a_encoding. Very few canonicalizations are done:
standard XML entity references, character references, CDATA, and
string merging. (You can further skip them by
calling readXmlDocPristine instead.) Processing
instructions, DTD declarations, comments, and whitespace are
preserved; you can further change some of them with tools from
Text.XML.HXT.Arrow.Edit.
String -> IOStateArrow s b XmlTree
readXmlDoc options uri = readXmlDocPristine options uri >>> simplifyXmlStrings
readXmlDocPristine :: Attributes -> String -> IOStateArrow s b XmlTree
readXmlDocPristine options uri =
getDocumentContents options uri
>>>
( ( replaceChildren ( ( getAttrValue a_source
&&&
xshow getChildren
)
>>>
parseXmlDoc
>>>
filterErrorMsg
)
>>>
setDocumentStatusFromSystemState "parse XML document"
)
`when` documentStatusOk )
simplifyXmlStrings :: (ArrowList a) => a XmlTree XmlTree
simplifyXmlStrings = substXmlEntityRefs >>> canonicalizeContents
]]>