XSS and HTML formatted text

No pasaran!

We have been working on Starty.co Answers for quite a while, and we cannot stop adding new features and hone existing ones.
Finally we decided to release the first version of Answers without such a convenient thing as text markup, like BBCode. However, we did some preparatory work in this area and decided that in the near future it will include the TinyMCE  WYSYWIG and Highlight.js module to highlight code snippets.

Even though we decided not to include a full markup in version one, it was necessary to implement the basic things – to replace the line breaks with <br> and parse http(s):// links in the text and replace them with <a href=”"></a>.

After adding a new line to <br> translation, I started looking how smart people implement link parsing and replacement with <a href=”"></a> in the text. In the end I settled on this approach:

$string = preg_replace('@(https?://([-\w\.]+)+(/([\w/_\.]*(\?\S+)?(#\S+)?)?)?)@',
'<a href="$1">$1</a>',

But first, as usual, I looked if there was anything suitable on the php.net site. I came across this comment where simple rules are described for preg_replace() to process BBCode. And I thought of adding support of at least a few basic BBCode tags, but we were planning to add TinyMCE with HTML-markup! If I would implement a simple BBCode in the first version, then we will have to support it in a next version as someone will use BBCode and it will suddenly stop working. Or, during an upgrade from version one it will be necessary to look for and change the BBCode in the database. This is not good, I thought, and decided that with the same success I can quickly implement several popular html-tags.

At this point I started reading how to store and display text with html-markup. There are no problems storing it, actually; everything is clear. For databases, it does not matter what is inside the text.
The problem, of course, is with the output. On the one hand, there is a requirement to keep entered HTML-tags, on the other hand, we can’t allow XSS to be dragged in, or just not allow them to break the entire page layout.

You may ask: Why developers of TinyMCE, an advanced WYSIWYG-editor, do not take care of elimination of XSS from HTML code, generated as the output? After all, if we disable the ability to edit HTML code in TinyMCE – the user will not be able to push XSS through TinyMCE.
Unfortunately, we must remember that we cannot trust any data coming from the client. Switched off HTML mode can be enabled on the client side easily as it is JavaScript. And anyway, no one will prevent a hacker from POST-ing data – the server cannot distinguish who prepared the data in the request, TinyMCE or not.
Therefore, although TinyMCE developers did some stuff to prevent attacks, complete protection from XSS is not the task of the editor running on client side.

It would seem, PHP has the tools for XSS prevention – such as strip_tags() and htmlentities(). However, with custom text with HTML-markup they won’t help.
Suppose we want to implement even the simplest tag <b></b>, and parse a custom text user_msg this way:

user_msg_safe = strip_tags (user_msg, '<b>');

Nothing bad can happen right? Wrong. That’s a great discussion on stackoverflow. The following can occur:

<b style="width: expression(alert(document.location));"> XSS </b>

Yes, in most browsers, in all modern and popular browsers, such an attack may not lead to anything. But this is not a reason to relax. And this is just a <b> tag! We would like to implement the usual – <b>, <i>, <s>, <pre>, <img>, <a>!
<img>, for example, with its “src” attribute – simply terrible things can happen:

<img src="javascript:alert('XSS');" alt="" />

And even replacing or cutting substring “javascript:” from user_msg is not helping as

<IMG SRC="jav&#x0A;ascript:alert('XSS');">

has the same bad effect.

There’s a tonne of these examples. Yes, most of the vulnerabilities are closed in modern browsers, but even their diversity allows us to understand that processing of custom text with HTML-markup is not a trivial task.

What to do?
One option – use other markup languages – such as BBCode, Markdown, Wiki markup. With them it is much easier to build HTML-safe design. However they are still vulnerable to XSS.
The second option – still use HTML-markup, but pass user_msg through so-called “HTML sanitizer”. In particular, many folks at stackoverflow are recommending HTML Purifier.
It seems that the first pilot Answers version will not support markup language, but very soon we will return to this question, because understand how important this feature is.


- vulnerable tags and examples of XSS attacks
- an interesting scientific view at XSS (Russian), at the end of the article suggestions concerning http headers, which help to avoid XSS
- related questions (first, second, third) on StackOverflow
- HTML sanitizer  HTML Purifier
- HTML sanitizer PHP Input Filter
- more on sanitizers on StackOverflow
- XSS Cheat Sheet (used as a reference inside this post)
- vulnerability tags and XSS attacks examples (used as a reference inside this post)


P.S. While I was preparing my post, containing lots of HTML tags, for a publication in WordPress, I ran out of steam by trying to make it look correctly. Probably, we should seriously think about using an alternative markup designed specifically for text entered by the users themselves.

Print this post | Home

One comment

  1. Angel says:

    I think that’s what Markdown was designed for, to allow users to still add styling to their text but fully prevent any XSS input by only allowing full text.