Bookmark with del.icio.us submit Performance%20Implications%20of%20%22charset%22. digg.com reddit

Performance Implications of "charset".

A while back, I wrote a post documenting the progressive response handling behavior of different browsers. I was specifically interested in how many bytes must be received or how much time must pass before a browser begins to parse content. My friend and colleague, Bryan McQuade (a co-author of Google PageSpeed), recently pointed out that character encoding detection is the source of buffering delay in response handling. Specifically, in the absence of a "charset" param in the Content-Type header, a browser may buffer a large quantity of content while looking for a <meta> tag declaring the content encoding.

I set up a new round of tests to identify the impact that content encoding declarations have on progressive response handling in different browsers. As in previous tests, the server returns several chunks, in 100ms intervals, each containing script that indicates how much of the response has been received.

  Bytes Buffered
Configuration Firefox 3.5 Chrome 3.0 IE 8
Tranfer-Encoding: chunked 1134 1056 300
Content-Type: text/html
Tranfer-Encoding: chunked
204 1056 341
Content-Type: text/html
Tranfer-Encoding: chunked
...
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
166 204 218
Content-Type: text/html; charset=utf-8
Tranfer-Encoding: chunked
204 280 300

Note that the test doesn't account for content in the document <head>, so the numbers for the <meta> configuration are artificially short by ~70b.

It's clear that for Chrome and Firefox, indicating the charset has a measurable impact on performance. Now, is it more desirable to declare the charset in the Content-Type header or a <meta> tag? Darin Fisher suggests that placing it in the response header is more performant: The charset impacts how the response is parsed, so when the browser encounters the <meta> tag, it must reprocess everything it's handled so far. If you're forced to use the <meta>, place it as near as possible to the top of the document to reduce the amount of throwaway work done by the browser.

Comments

Not many people are aware of the impact (lack of) charset has on their pages. Great job evangelizing and quantifying this performance improvement!

I had no idea that the browser waited until it encountered the meta tag (if a Content-Type header was absent) before it starts parsing! Very interesting!

Kyle - Thanks for running these tests. "Free" performance gains are always appreciated. (Moved my Content-Type meta tag based on the data provided).

As an aside, I've been using the iso-8859-1 charset since forever (2004). What are the ramifications of switching to UTF-8? (e.g., would it affect any/all entity values used on various pages?)

Kyle,

Great post. I've doing some work for [META] tags and charsets as well.

http://zoompf.com/blog/2009/12/browser-performance-issues-with-charsets/

Great follow-up post. Thanks :)

stk: You would only want to consider changing your character encoding if you need to support a wider array of characters. I'm really not an expert in encodings.

Billy: Interesting observation in your post. I'm surprised to see the script evaluated twice.

Kyle,

I too was surprised that the script is executed twice. I believe this behavior is wrong. If a browser does not receive charset information in an HTTP header I believe it should not attempt to render the document until it first looked for a META tag with charset information. The problem browsers start rendering *while* simultaneously searching for the charset info, and then have to redo everything when they find it. At worse you are talking a single pass through the character array of the HTML. To handle chunked encoding or other cases where the browser wants to render before receiving all the HTML document browser could only check inside the HEAD.

In terms of switching to utf-8 from ISO-8859-1:

Kyle mentioned "You would only want to consider changing your character encoding if you need to support a wider array of characters."

This is not actually the case. Right now we have a huge mess with any sort of electronic documents because of lack of consistent character sets. The only way out of this mess is when we get all documents moved to unicode. (For instance, if you are in a word processing program using MS Windows (other platforms too?) and use "Smart quotes" (the quotes are curled rather than just straight) and you cut and paste that into an html document, the quotes disappear). The problem is conflicting character sets.

By default all modern browsers (since version 4 actually) implicilety understand utf8 and in a way are kind of in a psuedo-uft8 mode. For instance if you have form elements, all data sent from them are in utf-8 unless you specifically set the enctype attribute on that individual form element. So yes, even if your page is set to ISO-88591-1, your forms are still submitting utf-8 unless you also set the enctype attribute of your form element to ISO-8859-1.

There's not enough room here to go into all the details - but there is a huge mess in transferring files across the world (and even just among "English-speaking" computers. We all need to move to unicode to get out of this mess.

Please use utf8!!!

Interesting... as always.

After reading the info I am forced to write down and bookmark your url, so that I can visit it again and again. and Steve is also present here. I know them.

But I could not understand the table data.

I was looking for the performance implication of UTF-8 and iso-8859-1. Wordpress has a default setting of utf-8 whereas most do not need it.
so, you can write a post on this subject also. And Please inform me if possible. Thank you!

Post a Comment

(optional)
(optional)