Thursday, July 23, 2009

Removing singleline and multiline comments from XML files.

Thou most of the XML parsers are capable of ignoring XML comments(<!-- -->) but while XML file processing through bash shell scripts makes life tough.

Came across such scenario recently where had to remove all the comments from XML file before processing it through grep, sed, awk and other bash shell utilities.

Sed proved to be a handy tool to remove all the single and multiline comments from the XML files.

Sample XML file. [Assuming filename as sample.xml]
<?xml version="1.0" encoding="ISO-8859-1"?>
<!--
If the message tag does not contain a definition of a property,
the default value will be used.
-->
<message>
<value>reference</value>
</message>

<!-- some comment --
>
<!-- another comment -->

<!--
This is another multiline comment.
line
-->
Command below would be able to remove all the comments in the sample.xml file

$ cat sample| sed '/<!--.*-->/d'| sed '/<!--/,/-->/d'

Result:
<?xml version="1.0" encoding="ISO-8859-1"?>
<message>
<value>reference</value>
</message>
Cheers,
make world open.

4 comments:

mats said...

when I try this I get the error:

sed: -e expression #1, char 1: unknown command: `<'

Daredevil said...

cat sample| sed '//d'| sed '//d'

this one works , i think he forgot to add another / after second sed

danijel said...

This doesn't work on this sample

NOT COMMENT

It removes the "NOT COMMENT" even though it shouldn't. It is because you assume there is only one comment per line.

mperez said...

This REGEX is not complete. For example if a include a string which content a comment line.

""