Wednesday, March 7, 2012

different results using ISO-8859-1 vs utf-8 with sp_xml_preparedDocument

We have a stored procedure which is passed an xml string (text datatype) and that string is passed on to sp_xml_prepared_Document (btw its a Sql 2000 server). We've had no problems until yesterday when a really long xml string (>18,000) was passed in and all of sudden the proc stopped working with an error message of:

"Msg 6603, Level 16, State 1, Procedure sp_xml_preparedocument, Line 9

XML parsing error: An invalid character was found in text content."

So after doing a lot of googling i found a post stating the following:

"I found that I have to place an xml declaration of

<?xml version="1.0" encoding="ISO-8859-1"?>in my XML string so the sp_xml_prepared_Document stored procedure will treatthe data as UTF-8 and not the Database's code page."

The encoding value I had always used was utf-8. Sure enough when I changed the encoding value to ISO-8859-1, the proc worked as expected. So I would like a better explanation as to what is going on here.

Why do shorter strings work fine when the encoding value is "utf-8" and longer ones only work when ISO-8859-1 is used?

Should I always use ISO-8859-1 as my encoding value?

Thanks for your help!

Thanks to Michael Rys and his blog for helping explain this. The article below was lifted from his site and gives a good explanation of whats happening. I think we will try changing the datatype to NText and see if that alleviates any future headaches.

Recently, I received several customer reports, that sp_xml_preparedocument started rasing the following errors after upgrading their XML-based application from SQL Server 2000 SP3a (or earlier) to either SP4 or SQL Server 2005:

Msg 6602, Level 16, State 2, Procedure sp_xml_preparedocument, Line 1
The error description is `An invalid character was found in text content.`.
Msg 6607, Level 16, State 3, Procedure sp_xml_removedocument, Line 1
sp_xml_removedocument: The value supplied for parameter number 1 is invalid.

This error is now raised because of a stricter error-discovery during parsing. When we moved from MSXML 2.6 to MSXMLSQL for SQL Server 2000 SP4 and SQL Server 2005, we fixed a couple of bugs in the parser that could lead to data corruption (invalid data being parsed). As a consequence, you are now receiving this error code instead of having invalid characters accepted.

The consequence is that one has to be more explicit with setting the encoding. This will also help mitigate against involuntary data corruption (see below for an example). There are two ways to fix an application (besides moving to an XML datatype and the nodes() method):

1. Make sure that the XML document when passed in a TEXT or (VAR)CHAR argument is compatible with the default code page of the database (or the string type) by either

1. setting the encoding property in the XML declaration (e.g., when the code page is ISO-Latin1), or

2. by making sure that the code page of the database (or string type) can preserve all UTF-8 code points that will ever be passed in through the string.

2. Change the type of the argument to NTEXT (or N(VAR)CHAR) and pass in the XML in UTF-16 encoding (an XML declaration is optional).

Here are the technical details:

sp_xml_preparedocument takes either a single-byte character string (TEXT, (VAR)CHAR) or a two-byte character string (NTEXT, N(VAR)CHAR). In the first case, the string is associated with a code page (normally the databases default code page). In the second case the string is assumed to be either UCS-2 or UTF-16 encoded. Sp_xml_preparedocument (unlike the newer XML datatype parser) will only pick up the strings code page in the second case to detect the encoding of the XML document, but not in the first case. Instead it will look at the string and follow the XML 1.0 spec detection rules. That means that for a single-byte character string unless there is an XML declaration saying otherwise, the data will be parsed as UTF-8. This works fine as long as your instance documents happen to only use characters that share the same code points on the UTF-8 and the database code page (e.g., the ASCII range in ISO-Latin1 and UTF-8), but it will lead to problems, if you use characters that are mapped to different code pages.

For example, the character (Unicode U+00AE, represented as 0xAE00 in SQL Server) will be represented in an ISO-LATIN1 code page (SQL_Latin1_General_CP1_CS_AS) as 0xAE. However if you do not specify an encoding on the XML document that contains the character and pass it as a single-byte character string, the XML parser will interpret the code point 0xAE not as the character but as an invalid starting character of a multi-byte UTF-8 encoding: UTF-8 characters bit sequences start either with 0 (1-byte encoded characters), 110 (two byte encoded characters), 1110 (three byte encoded characters), or 11110 (4-byte encoded surrogate pairs), while 0xAE is 10101110.

In MSXML 2.6 (and thus SQL Server 2000 SP3 and earlier), the XML parser would preserve such invalid characters and thus corrupt the data, while in MSXML 3.0/MSXMLSQL (and thus SQL Server 2000 SP4 and SQL Server 2005), the parser will reject them.

Note that you will still have to watch out for data corruption if the ISO-Latin characters make up a valid UTF-8 encoded character (solutions are as outlined above). For example,

declare @.h int
exec sp_xml_preparedocument @.h output, ``
select * from openxml(@.h, `/root`) with (attr nvarchar(200) `@.attr`)
exec sp_xml_removedocument @.h

will parse in SQL Server 2005 but return the character ? (U+05D1) since the two input characters (0xD791 in an ISO-Latin1 encoding) form the bit stream: 11010111 10010001 which is translated into 00000101 11010001 according to the UTF-8 encoding rules which is a valid UTF-8 2-byte encoded character representing U+05D1.

I hope you agree, that detecting such issues is better, even if we break "backwards-bug-compatibility".

No comments:

Post a Comment