Update 11-Jul-2010: As a result of this ticket, jQuery contentType documentation is now explicit about UTF-8 being used to send data to the server in the Request header.
Dealing with character sets in browser-based applications can be a real pain. You have to ensure all levels of infrastructure are set to use the same character set. For browser oriented applications that means the browser/HTML, JavaScript code, and whatever server side infrastructure you have (code and database for example), all need to be configured correctly. If you are writing a new application, the choice is usually easy: use UTF-8. In fact this shouldn’t even be a choice for new apps – just use UTF-8. However, there are cases where UTF-8 is not an option. Perhaps you’re working on legacy code, or are writing a component that will be used by others, and so you need to accommodate a variety of character sets. The later is the case with BlogIt. Recently I added the ability to perform administrative functions over Ajax, and so began a long journey of pain, and a single major lesson…
When performing an Ajax operation there are two things to consider. The data being sent from the browser to the server (the Request), and the data being sent from the server back to the browser (the Response). With character sets in mind you need to ensure you’re configuring both sides appropriately. Good luck with that, it’s not possible.
Here’s an example setup using jQuery, with the Ajax setup snippet below. Notice how we try to force the use of ISO-8859-1:
$.ajax({
contentType: "application/x-www-form-urlencoded; charset=ISO-8859-1",
type: 'POST',
dataType:'json',
url:$("#ajax_form1").attr('action'),
data: $(this).serialize(),
success: function(data){
$('#result1').html(data.out);
}
});
And here’s the Ajax Request Headers pulled from Firebug in Firefox:
POST /jquery/ajax_charset_test.php HTTP/1.1
Host: solidgone.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6 (.NET CLR 3.5.30729)
Accept: application/json, text/javascript, \*/\*
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,\*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With: XMLHttpRequest
Referer: <http://solidgone.com/jquery/ajax_charset_test.html>
Content-Length: 28
Pragma: no-cache
Cache-Control: no-cache
Notice the Content-Type charset is UTF-8, despite the jQuery contentType parameter being set to ISO-8859-1, and despite the HTML meta-tag:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
In principle you’ll either be using the XMLHttpRequest object directly with JavaScript, or indirectly through some library, like jQuery. Either way you’ll see options allowing you to specify which character set should be used. What is not clear is that you are not able to specify which character set should be used for the Request portion of the transaction. Despite what the documentation, and the configuration options might lead you to believe, the Request character set is fixed to UTF-8. Normal forms are sent using the encoding of the parent page, Ajax submitted forms will always be sent as UTF-8. This is a problem if you are not using UTF-8 throughout the rest of your application.
When researching this I was able to find only a single authoritative source which describes the forced use of UTF-8 on the Request header, but this is itself not the final specification. Other sources that you might expect to see refer to this behavior had nothing to say on the matter, including the W3C XMLHttpRequest Working Draft, despite some anecdotal references. I did stumble across numerous ideas attempting to force an alternate charset to be used on an Ajax Request, none of which actually work:
- overrideMimeType:
xhr.overrideMimeType("text/html; charset=ISO-8859-1");
- setRequestHeader:
xhr.setRequestHeader("Accept-Charset", "ISO-8859-1");
In practice what this means is that when sending data from the browser to the server, you will be using UTF-8, and if your server-side application is not natively using UTF-8 you will need to transcode from UTF-8 sent from the browser to the character set you are using on the server. Unfortunately some languages are more successful at doing this than others. PHP falls into the ‘others’ category, so good luck!