Page Index Toggle Pages: [1] 2 
Topic Tools
Hot Topic (More than 10 Replies) Convert to UTF-8 (Read 4,950 times)
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Convert to UTF-8
Sep 13th, 2010 at 11:08pm
Post Tools
I have had a YaBB forum for several years in ISO-8859-1 and am looking to move to UTF-8.

Surprisingly, it seems information on UTF-8 with YaBB is a bit sparse. From reading the forums I've managed to figure out:

1. YaBB seems to support UTF-8.
2. UTF-8 is enabled by editing $yycharset in Admin.lng and Main.lng.

That seems to work. However, what do I do about the database message history in ISO-8859-1 format? Can I just download and batch convert all files to UTF-8 and reupload?

How about members and board files?
  
Back to top
 
IP Logged
 
Captain John
Ex Member


Re: Convert to UTF-8
Reply #1 - Sep 14th, 2010 at 2:57am
Post Tools
No ... UTF-8 should be able to handle the ISO encoding without problems, but acceptance of UTF-8 characters has just recently been enabled in the newest version, Display/Username and I believe password now.
  
Back to top
 
IP Logged
 
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Re: Convert to UTF-8
Reply #2 - Sep 14th, 2010 at 12:32pm
Post Tools
No, it can't. Look at this example.

Say you have the word "pokémon" in your database. It's ISO-8859-1.

You cannot switch to UTF-8 and have that display correctly. Go ahead and try it in your browser. This forum is currently serving ISO-8859-1.

A conversion needs to be done with the existing data to convert it to UTF-8. Then you can use UTF-8 from that point forward. That's typically how I've seen it work with every other database and charset issue I've seen.

You have text data stored in one character set, you can't switch character sets without converting the data. In fact, it technically corrupts your database if you do. Because then you start inputting UTF-8 when you had ISO-8859-1. Now you've got to two different character encodings in your database and it's probably impossible to fix it after that.

From what you've told me, it seems like YaBB isn't ready for UTF-8 yet. Sad In this day and age with a global community, UTF-8 should probably be the standard.
  
Back to top
 
IP Logged
 
JonB
YaBB Administrator
YaBB Next Team
Operations Team
Beta Testers
Support Team
*****
Offline



Posts: 4,036
Location: Land of the Blazing Sun!

YaBB 2.6.1
Re: Convert to UTF-8
Reply #3 - Sep 14th, 2010 at 4:17pm
Post Tools
Here's a thought - and a question

You say you have had a YaBB forum for some years, and yet you say 'database'.  YaBB is a text based, flatfile system.  This has a fair number of implications, particularly since everything is a string,-- fullstop.

Quote:
Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you'll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).


read-um this:
http://www.joelonsoftware.com/articles/Unicode.html

To be sure I'm 'righto', and not 'wrongo' (and thus also the good Captain) I'm going to consult a guru -

Good Luck with your forum.

Cool
« Last Edit: Sep 14th, 2010 at 4:29pm by JonB »  

I find your lack of faith disturbing.
Back to top
IP Logged
 
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Re: Convert to UTF-8
Reply #4 - Sep 14th, 2010 at 5:16pm
Post Tools
UTF-8 is only backwards compatible with ASCII. The example word I just gave you in my post above uses a character value above 127. Its a false assumption to think all your ISO-8859-1 data is ASCII. That's not true if you use any special characters or mix in words from a language other than English for any reason. Half the usable characters will NOT be compatible:

http://en.wikipedia.org/wiki/ISO-8859-1#ISO-8859-1

Terminology aside (I consider it a flat file database), that's why I was asking about the rest of the data. The problem here is the text data strings encoded in ISO-8859-1. For them to work properly in UTF-8, those strings must be converted. It looks like the flat file format delimiting characters are not using any non ASCII characters, so I suspected the whole thing might work if you converted all files to UTF-8. However, I only know about the few files I looked at and do not have overall YaBB development experience.

Your data will show up wrong if you just start serving UTF-8 and don't convert the text data to match. As I said, the same behavior can be duplicated by forcing your browser to UTF-8 encoding right now. You'll see the word will not show up correctly. Then new data will be input as UTF-8 and you will then never be able to correct it as you have multiple character encodings for your strings.
« Last Edit: Sep 14th, 2010 at 5:18pm by blackcatnc »  
Back to top
 
IP Logged
 
JonB
YaBB Administrator
YaBB Next Team
Operations Team
Beta Testers
Support Team
*****
Offline



Posts: 4,036
Location: Land of the Blazing Sun!

YaBB 2.6.1
Re: Convert to UTF-8
Reply #5 - Sep 14th, 2010 at 6:19pm
Post Tools
Umm - not quite correct -

That's not true if you use any special characters or mix in words from a language other than English for any reason.

should be "mix in words from a NON-LATINATE chraracter set'. ISO-8859-1 is NOT English - its Latin -1. which supports most Western European languages.  (I guess you didn't read that Wikipedia page you quoted me very well.) 

and basically - "whatever"

I said I would ask the core person for this question, and I did.

Of course, you could always just set up another YaBB board move your stuff in and see what happens.   Exclaim

I know you want to argue, I don't.

See ya


 
  

I find your lack of faith disturbing.
Back to top
IP Logged
 
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Re: Convert to UTF-8
Reply #6 - Sep 14th, 2010 at 6:50pm
Post Tools
JonB wrote on Sep 14th, 2010 at 6:19pm:
Umm - not quite correct -

That's not true if you use any special characters or mix in words from a language other than English for any reason.

should be "mix in words from a NON-LATINATE chraracter set'. ISO-8859-1 is NOT English - its Latin -1. which supports most Western European languages.  (I guess you didn't read that Wikipedia page you quoted me very well.)


You're just not understanding. Here you go in facts:

é = byte-code 0xE9 in  ISO-8859-1
é = byte-code 0xC3A9 in UTF-8

You need to convert or it won't display correctly.

It's not an argument. It's a simple fact.

I've already tested it. YaBB fails exactly in the manner expected.
« Last Edit: Sep 14th, 2010 at 7:03pm by blackcatnc »  
Back to top
 
IP Logged
 
Captain John
Ex Member


Re: Convert to UTF-8
Reply #7 - Sep 14th, 2010 at 8:24pm
Post Tools
Quote:
Some code points in UCS have been assigned to combining characters. These are similar to the non-spacing accent keys on a typewriter. A combining character is not a full character by itself. It is an accent or other diacritical mark that is added to the previous character. This way, it is possible to place any accent on any character. The most important accented characters, like those used in the orthographies of common languages, have codes of their own in UCS to ensure backwards compatibility with older character sets. They are known as precomposed characters. Precomposed characters are available in UCS for backwards compatibility with older encodings that have no combining characters, such as ISO 8859.


Quote:
Using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings with these encodings can contain as parts of many wide characters bytes like “\0” or “/” which have a special meaning in filenames and other C library function parameters. In addition, the majority of UNIX tools expects ASCII files and cannot read 16-bit words as characters without major modifications. For these reasons, UCS-2 is not a suitable external encoding of Unicode in filenames, text files, environment variables, etc.

The UTF-8 encoding defined in ISO 10646-1:2000 Annex D and also described in RFC 3629 as well as section 3.9 of the Unicode 4.0 standard does not have these problems. It is clearly the way to go for using Unicode under Unix-style operating systems.


Quote:
Server setup

How to make the server send out appropriate 'charset' information depends on the server. You will need the appropriate administrative rights to be able to change server settings.

Apache. This can be done via the AddCharset (Apache 1.3.10 and later) or AddType directives, for directories or individual resources (files). With AddDefaultCharset (Apache 1.3.12 and later), it is possible to set the default 'charset' for a whole server. For more information, see the article on Setting 'charset' information in .htaccess.

Jigsaw. Use an indexer in JigAdmin to associate extensions with charsets, or set the charset directly on a resource .

IIS 5 and 6. In Internet Services Manager, right-click "Default Web Site" (or the site you want to configure) and go to "Properties" => "HTTP Headers" => "File Types..." => "New Type...". Put in the extension you want to map, separately for each extension; IIS users will probably want to map .htm, .html,... Then, for Content type, add "text/html;charset=iso-8859-1" (without the quotes; substitute your desired charset for iso-8859-1; do not leave any spaces anywhere because IIS ignores all text after spaces). For IIS 4, you may have to use "HTTP Headers" => "Creating a Custom HTTP Header" if the above does not work.
Scripting the header

The appropriate header can also be set in server side scripting languages. For example:

Perl. Output the correct header before any part of the actual page. After the last header, use a double linebreak, e.g.:
print "Content-Type: text/html; charset=utf-8\n\n";

Python. Use the same solution as for Perl (except that you don't need a semicolon at the end).

PHP. Use the header() function before generating any content, e.g.:
header('Content-type: text/html; charset=utf-8');

Java Servlets. Use the setContentType method on the ServletResponse before obtaining any object (Stream or Writer) used for output, e.g.:
resource.setContentType ("text/html;charset=utf-8");
If you use a Writer, the Servlet automatically takes care of the conversion from Java Strings to the encoding selected.

JSP. Use the page directive e.g.:
<%@ page contentType="text/html; charset=UTF-8" %>
Output from out.println() or the expression elements (<%= object%>) is automatically converted to the encoding selected. Also, the page itself is interpreted as being in this encoding.

ASP and ASP.Net. content type and charset are set independently, and are methods on the response object. To set the charset, use e.g.:
<%Response.charset="utf-8"%>
In ASP.Net, setting Response.ContentEncoding will take care both of the charset parameter in the HTTP Content-Type as well as of the actual encoding of the document sent out (which of course have to be the same). The default can be set in the globalization element in Web.config (or Machine.config, which is originally set to UTF-8).


blackcatnc wrote on Sep 14th, 2010 at 6:50pm:
You need to convert or it won't display correctly.

  Incorrect ... output is the only thing that "needs" converted (translated)
     ISO8859-1 is a smaller (restricted character set) compared to UTF-8 (which includes ALL characters available in ISO8859-1), in other words ... UTF-8 is "backwards" compatible with ISO8859-1
« Last Edit: Sep 14th, 2010 at 8:30pm by »  
Back to top
 
IP Logged
 
Carsten
Ex Member


Re: Convert to UTF-8
Reply #8 - Sep 14th, 2010 at 9:31pm
Post Tools
I have to go with blackcatnc.

For other than english language a lot of ISO-8859-1 characters are 16 bytes in UTF-8 and will need to be converted to display correctly.

I am now working on a 'converting on the fly' mod...
« Last Edit: Sep 14th, 2010 at 9:35pm by »  
Back to top
 
IP Logged
 
JonB
YaBB Administrator
YaBB Next Team
Operations Team
Beta Testers
Support Team
*****
Offline



Posts: 4,036
Location: Land of the Blazing Sun!

YaBB 2.6.1
Re: Convert to UTF-8
Reply #9 - Sep 14th, 2010 at 11:05pm
Post Tools
Thanks very much Carsten!

Cool
  

I find your lack of faith disturbing.
Back to top
IP Logged
 
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Re: Convert to UTF-8
Reply #10 - Sep 15th, 2010 at 12:05am
Post Tools
Quote:
Quote:
Some code points in UCS have been assigned to combining characters. These are similar to the non-spacing accent keys on a typewriter. A combining character is not a full character by itself. It is an accent or other diacritical mark that is added to the previous character. This way, it is possible to place any accent on any character. The most important accented characters, like those used in the orthographies of common languages, have codes of their own in UCS to ensure backwards compatibility with older character sets. They are known as precomposed characters. Precomposed characters are available in UCS for backwards compatibility with older encodings that have no combining characters, such as ISO 8859.


Yes, and the data YaBB has stored is in neither combining nor precomposed code points. Code point U+00E9 still needs two bytes in UTF-8 to be represented. Two different bytes than the same in UTF-16 or other Unicode encoding that can simply use 0x00 0xe9. YaBB nor the server will interpret it into either.

Just try it. YaBB fails to display for the reasons mentioned.

Quote:
Server setup


Yes, I know how to and have the server outputting Content-Type: text/html; charset=utf-8. It still fails.

Quote:
  Incorrect ... output is the only thing that "needs" converted (translated)
     ISO8859-1 is a smaller (restricted character set) compared to UTF-8 (which includes ALL characters available in ISO8859-1), in other words ... UTF-8 is "backwards" compatible with ISO8859-1


Mmm hmm.. Now explain how a character that is 0xE9 in ISO8859-1 is going to going to get translated into it's proper UTF-8 codepoint with YaBB. Enlighten me.

Changing your server character set output doesn't do it. Changing YaBB to UTF-8 doesn't do it. The fact of the matter is your magic backwards compatibility isn't there. There's a reason every other piece of software on earth that used an older encoding and moved to UTF-8 needs to convert or translate the old data. I've done it before. I've coded UTF-8 converting utilities. I've run a LAMP server. Please explain how the translation process will occur and why it doesn't currently.


Carsten:

The problem with on the fly will be new data that's put in UTF-8 format. You'll end up double converting. Unless you're going to continue to store in ISO8859-1 and just output in UTF-8. But why would you want to do that?
  
Back to top
 
IP Logged
 
Carsten
Ex Member


Re: Convert to UTF-8
Reply #11 - Sep 15th, 2010 at 12:24am
Post Tools
Nah blackcatnc - looks like JonB was right - you are here to argue Wink

Please give me just a little credit - i was'nt born yesterday...
  
Back to top
 
IP Logged
 
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Re: Convert to UTF-8
Reply #12 - Sep 15th, 2010 at 12:49am
Post Tools
I'm here to discuss how YaBB supports UTF-8 and how the character translation will occur with a specific example. No one has done that yet despite claiming it can do it.

Discussing anything specific in detail is automatically arguing that doesn't deserve an answer? I'm interested in the details behind what you're doing also. You gave no details. I made some assumptions fishing for more.

I would like the same credit. The same attitude was given when I fixed the Yabb 1 to 2 member conversion code a few years ago. People here are quick to dismiss any problem, don't want to discuss details, and are quick to point fingers. That's probably not the best way to represent YaBB. I think I'm beginning to understand the basis for YaBB's current standing.

You just can't discuss anything around here. Instead, I get blanket statements of "you are here to argue" and support team members making unnecessary comments such as "I guess you didn't read that Wikipedia page you quoted me very well." This is why I don't contribute anymore.
  
Back to top
 
IP Logged
 
Captain John
Ex Member


Re: Convert to UTF-8
Reply #13 - Sep 15th, 2010 at 2:54am
Post Tools
blackcatnc wrote on Sep 15th, 2010 at 12:49am:
how YaBB supports UTF-8 and how the character translation will occur with a specific example

Specific ?  Check out the International Category, ISO8859-1 supports EU, But look at the others (characters correct) Russian, Chinese (both Traditional & Simplified) AND Japanese ... a character set that requires not only a unique character  but an extended code to display.

and YaBB does that !  Very very Well !
  
Back to top
 
IP Logged
 
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Re: Convert to UTF-8
Reply #14 - Sep 15th, 2010 at 1:15pm
Post Tools
I meant the specific character example I presented with 'é' and the byte-code representation. Explain how old post data with that character gets displayed moving from ISO8859-1 to UTF-8 character set in YaBB. A byte change must take place. I don't think you can because it doesn't work and is not happening as you can see from the link below.

I don't know where the communication breakdown is here. Nobody wants to talk about the specific example I've presented and why it will not display properly in old posts in YaBB if you do nothing more than set YaBB $yycharset to UTF-8 and make your server to serve UTF-8. That alone isn't enough and fails, yet people here claim it doesn't.

Proof since my word meant nothing:
http://transcorp.parodius.com/forum/YaBB.pl?num=1130404517/16#16 (I have since converted all files on this forum to UTF-8 correcting the issue per my explanations in previous posts.)

YaBB is set to UTF-8, server is sending UTF-8, old post data with character  'é' now FAILS.

Same post on forum with ISO8859-1.
http://transcorp.parodius.com/cgi-bin/yabb2/YaBB.pl?num=1130404517/16#16


What do you have to say about that? It clearly doesn't work as you say.
« Last Edit: Sep 16th, 2010 at 12:35am by blackcatnc »  
Back to top
 
IP Logged
 
Page Index Toggle Pages: [1] 2 
Topic Tools