Thursday, July 23, 2009

Removing singleline and multiline comments from XML files.

Thou most of the XML parsers are capable of ignoring XML comments(<!-- -->) but while XML file processing through bash shell scripts makes life tough.

Came across such scenario recently where had to remove all the comments from XML file before processing it through grep, sed, awk and other bash shell utilities.

Sed proved to be a handy tool to remove all the single and multiline comments from the XML files.

Sample XML file. [Assuming filename as sample.xml]
<?xml version="1.0" encoding="ISO-8859-1"?>
<!--
If the message tag does not contain a definition of a property,
the default value will be used.
-->
<message>
<value>reference</value>
</message>

<!-- some comment --
>
<!-- another comment -->

<!--
This is another multiline comment.
line
-->
Command below would be able to remove all the comments in the sample.xml file

$ cat sample| sed '/<!--.*-->/d'| sed '/<!--/,/-->/d'

Result:
<?xml version="1.0" encoding="ISO-8859-1"?>
<message>
<value>reference</value>
</message>
Cheers,
make world open.