XML can now contain translatable HTML code

With Wordbee you can translate xml files that contain good old html code.

A simple example

Let us look at the two ways for embedding html inside an XML document:

<?xml version="1.0" encoding="UTF-8" ?>
        <block title="Introduction"><![CDATA[
           <p>Welcome to our book</p>
           <p><a href="#" title="click me">Continue reading &amp; browsing</a></p> 

The <block> node contains html code. Using CDATA markers (see brown color) means that the html can be inserted directly, as is. Let us call this the "CDATA method".

Alternatively, your XMLs may show up encoded html. This is the "Encoded html method":

<?xml version="1.0" encoding="UTF-8" ?>
        <block title="Introduction">
           &lt;p&gt;Welcome to our book&lt;/p&gt;
           &lt;p&gt;&lt;a href="#" title="click me"&gt;Continue reading 
 &amp;amp; browsing&lt;/a&gt;&lt;/p&gt; 

Here all reserved XML characters <, > and & are replaced by &lt;, &gt; and &amp; respectively. That looks less readable but is actually the more common and proper way of doing things... After all, XML was invented by IT people ;-)

By the way, a real xml would likely contain more than one <block> node. But I wanted to keep it simple here.

Configuring Wordbee - Setup once, use forever

All that is left to do is telling Wordbee what contents require translation. Click the "Settings" button in the top navigation menu, then "XML" in the list of document formats. Finally, click the link to create a new configuration:

Tick the "HTML contents" option (see yellow marker above). Below this option you can choose a configuration for extracting html text. This lets you e.g. customize what html nodes are translatable. Leave the default if your html is mostly standard.
Now scroll down in this page to specify the translatable xml nodes:

The rules are so-called XPath expressions:
  • //block tells the system that all <block> nodes need to be translated. Since the contents is html, I ticked that option to the right. Last not least, tick "Encoded" if contents follows the Encoded html method. Untick option for the CDATA method.
  • //block/@title will extract the block titles <block title="xyz"> for translation. This one is just plain text. Important: Attribute rules like this should be put before the rule for extracting the node contents.
Save the configuration and ... Congratulations! ... You have just setup the system to translate all your "books" in xml format for the next couple of years.

Translating a file

Upload your xml files to a Wordbee project and mark them for online translation. You will be asked the xml configuration to use:
It is very important to select the right configuration! Otherwise Wordbee does not know what contents to translate.
And this is how our sample file shows up in the translation editor:
You can see that everything is extracted. The chapter title, the first html paragraph, the hyperlink title and the actual hyperlink text. Our html example is so simple that there are even no inline tags in the editor :-)

Advanced topics for experts

  • The xml and the embedded html must be character encoded the same way (utf-8, windows-1252...). You cannot mix different encodings such as an utf-8 xml embedding big5 html.
  • The embedded html can be fragments as in the example above or a complete html page with headers, body element, javascript...
  • If your html is xhtml compliant you can also insert the code without using a CDATA and without encoding it. In such a case you would untick the "Encoded" option in the configuration. This would then also require to declare all html entity references in the xml document.
  • In the Xml configuration page you can select an html configuration. Create your own if you need to do do things such as: Exclude certain html texts from translation, extract Javascript texts or not and much more.

Do not hesitate to contact us if you need advice with your XMLs!

Your Wordbee Team!
Have more questions? Submit a request


Please sign in to leave a comment.