Why is this HTML5 document invalid?

I’m getting pretty confused about an error message I’m getting when I try to validate any simple HTML document without a meta encoding like this:

<!DOCTYPE html>
<html>
<head>
<title>Test</title>
</head>
<body>Test</body>
</html>

The W3C validator http://validator.w3.org reluctantly accepts the document as valid with just a few warnings when it is pasted into the direct input form, but when the document is uploaded or loaded by URI, validation fails with this error message

The character encoding was not declared. Proceeding using windows-1252.

There are two things I don’t understand about this error:

  • Why is a missing character encoding considered an error, when fallback rules exist?
  • Why is the validator assuming windows-1252 instead of UTF-8, like any browser would?

Can someone explain these two points please? I’m pretty new to this stuff, so please bear with me.

4 Answers

Well, it depends on what you are using.

  • if you are using the File Upload option, it depends on which encoding the HTML file was saved with.
  • if you are using the Direct Input option, it depends on the navigator.

If you don’t want the validator to guess, and use UTF-8, you can add the following line

<meta charset="UTF-8">

inside the the head element.

It is the “Direct Input” mode of the validator that defaults to UTF-8. User-agents (browsers) will default to other encodings based on a number of things:

wikipedia

If a user agent reads a document with no character encoding information, it can fall back to using some other information. For example, it can rely on the user’s settings, either browser-wide or specific for a given document, or it can pick a default encoding based on the user’s language. For Western European languages, it is typical and fairly safe to assume Windows-1252, which is similar to ISO-8859-1 but has printable characters in place of some control codes.

W3C validator said:

The validator checked your document with an experimental feature: HTML5 Conformance Checker. This feature has been made available for your convenience, but be aware that it may be unreliable, or not perfectly up to date with the latest development of some cutting-edge technologies.

So take some results with a pinch of salt.

Also, there is no useful ‘fall back’, the validator just needs to pick something/anything so it can try to validate for you. W3C can’t determine/decide what encoding you want/need to use. You must declare it yourself based on what characters you need to serve on your web page(s), and then ask W3C to validate your document based on that.

What editor/WYSIWYG are you using to make web pages? Can we have the URL you are trying to validate?

When you use Validate by URI, the server is supposed to announce the character encoding in HTTP headers, more exactly in the charset parameter of the Content-Type header value. In this case, this apparently does not happen. You can check the situation e.g. using Rex Swain’s HTTP Viewer.

According to clause 4.2.5.5 Specifying the document’s character encoding in HTML5 CR, “If an HTML document does not start with a BOM, and its encoding is not explicitly given by Content-Type metadata, and the document is not an iframe srcdoc document, then the character encoding used must be an ASCII-compatible character encoding, and the encoding must be specified using a meta element with a charset attribute or a meta element with an http-equiv attribute in the Encoding declaration state.” This is a bit complicated, but the bottom line is: there are several ways to declare the encoding, but if none of them is used, the document is non-conforming.

Why it specifies so is somewhat speculative, but the general idea is that such rules promote reliability and robustness. When the rule is not obeyed, different browsers may use different defaults or guesswork.

The validator assumes windows-1252, because that’s what HTML5 rules lead to. The processing rules are in 8.2.2.1 Determining the character encoding. They are fairly complicated, but they largely reflect the way modern browsers do (and aims at making it a standard). The rules there are meant to deal with non-conforming documents, too, but this does not make those documents conforming; error processing rules are not really “fallbacks” and should not be relied on, especially since old browsers do not always play by the rules.

The error rules get somewhat loose when it comes to a situation where everything else fails and an “implementation-defined or user-specified default character encoding” is to be used. There are just “suggestions” on what browsers might do (again, reflecting what modern browsers generally do), and this may involve using the “user’s locale”, an obscure concept. The validator uses windows-1252 then, perhaps because that’s the default for English and the validator “speaks” English, or maybe just because it’s the guess that is expected to be correct more often than any other single alternative.

Leave a Reply

Your email address will not be published. Required fields are marked *