Page Index Toggle Pages: [1] 2 
Topic Tools
Hot Topic (More than 10 Replies) UTF-8 support (Read 6,117 times)
Michael Prager
Boardmod Team
Development Team
*****
Offline



Posts: 976
Location: Germany
UTF-8 support
Jan 29th, 2011 at 5:06pm
Post Tools
I've familiarized myself a bit with the internals of Unicode and the UTF-8 encoding. Good references on the subject I've found:

http://www.unicode.org/versions/Unicode5.2.0/
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://perldoc.perl.org/perluniintro.html
http://perldoc.perl.org/utf8.html
http://perldoc.perl.org/perlunitut.html
http://perldoc.perl.org/perlunicode.html

Looks like UTF-8 is the best choice to me, as it is fully supported by modern browsers as well as perl (as of 5.8) and by mysql (5.0). So we can encode everything in it and bother no more about encoding. Since it is compatible with latin1 it won't have much impact on disk usage for english based forums.

I've played a little with the current code and was shocked how easy it was to get working results: replace $yycharset with "UTF-8" in Languages/English/Main.pl and create a database with utf8 default encoding. That's it. Shocked

But this also reveals a serious problem with current multi language support in YaBB:
the encoding of data written to and read from the database (no matter if mysql or flatfile) is always dependent on the current user's selected language! So if I have two installed languages for my forum (for example one English and one Russian) and one user posts something in Russian language, every other user that uses english language setting will only see corrupted data! That's because the data was saved to the database in KOI8-U while it was loaded in Latin1. So any YaBB forum out there using multi language will probably have mixed or broken encodings Undecided
  

Nail here for a new monitor! --> [x]
Back to top
WWWICQ  
IP Logged
 
Captain John
Ex Member


Re: UTF-8 support
Reply #1 - Jan 29th, 2011 at 5:13pm
Post Tools
Michael Prager wrote on Jan 29th, 2011 at 5:06pm:
That's because the data was saved to the database in KOI8-U while it was loaded in Latin1.

  Isn't this the cure ?  Writing english to UTF http://www.yabbforum.com/community/YaBB.pl?num=1284419291/15#15
  
Back to top
 
IP Logged
 
Michael Prager
Boardmod Team
Development Team
*****
Offline



Posts: 976
Location: Germany
Re: UTF-8 support
Reply #2 - Jan 29th, 2011 at 5:36pm
Post Tools
Yes, we have to make sure data is saved only in one single format (UTF-8). I'm not sure if Carsten's code will solve the problem, as there is no way to tell in what encoding things were stored in the database before the patch.
« Last Edit: Jan 29th, 2011 at 5:37pm by Michael Prager »  

Nail here for a new monitor! --> [x]
Back to top
WWWICQ  
IP Logged
 
Corey Chapman
YaBB Administrator
*****
Offline



Posts: 10,024
Location: Rock Hill, South Carolina

None
Re: UTF-8 support
Reply #3 - Jan 29th, 2011 at 6:28pm
Post Tools
Perhaps with the converter we'll need for the data to the database we can solve this issue.  However, I have experienced what Michael is talking about.  I know that all foreign languages except a couple look like jibberish symbols (not language characters in that language) to me while using the English language pack.  However, if I change to that language's pack, I can see their language fine and the rest are jibberish.  This is not new to YaBB, it's been an ongoing issue with all forums which is why they have all debated encoding over the past couple years.
« Last Edit: Jan 29th, 2011 at 6:29pm by Corey Chapman »  

Back to top
IP Logged
 
Michael Prager
Boardmod Team
Development Team
*****
Offline



Posts: 976
Location: Germany
Re: UTF-8 support
Reply #4 - Jan 30th, 2011 at 8:03pm
Post Tools
I've completed my work on the UTF-8 implementation. Because I can't login on Sourceforge's SVN right now (their password reset function is broken for me) I have uploaded my changes to GIT. As soon as I get access again, I'll commit it to /trunk/.

Download:
YaBB 3 SVN rev 323 with full UTF-8 support

Changelog

YaBB is now working entirely with Unicode. That means it will output HTML in UTF-8, it will read user data in UTF-8 and it will store its data in UTF-8 (flatfile and mysql).

The following should be noted:
  • no other encoding is supported anymore! But there is no need for others anyway
  • YaBB now requires the "Encode" module to be installed
  • YaBB now requires Perl 5.8.1 or above because Perl only fully supports UTF-8 since that version
  • all Language and Help files have to be UTF-8 encoded from now on
  • This does not yet contain functionality to convert existing custom encoded board data to UTF-8. This is no problem for boards that have stored data in Latin1/ISO-8859-1 (e.g. use english language only). That encoding is compatible with UTF-8. But you have to be careful with boards that have data stored in other formats, those will break without proper conversion.

  

Nail here for a new monitor! --> [x]
Back to top
WWWICQ  
IP Logged
 
Jet Li
Legacy Dev Team
Development Team
****
Offline



Posts: 6,588
Location: Hong Kong
Re: UTF-8 support
Reply #5 - Jan 30th, 2011 at 8:18pm
Post Tools
Thnx Michael, but I get this Error if I upload to my Dev Board.

Code
Select All
System Information

An Error Has Occurred! utf8 "\xA0" does not map to Unicode at ./Sources/Subs.pl line 2243.  

  

PM me for YaBB Installation Service
Back to top
WWWGTalkFacebook  
IP Logged
 
Michael Prager
Boardmod Team
Development Team
*****
Offline



Posts: 976
Location: Germany
Re: UTF-8 support
Reply #6 - Jan 30th, 2011 at 8:37pm
Post Tools
Probably some non latin1 encoded data causing the issue there. Guess we need to catch those errors instead of just dying. Should work with a fresh install though.

Btw: resolved my sf account problem, SVN updated Smiley
  

Nail here for a new monitor! --> [x]
Back to top
WWWICQ  
IP Logged
 
Michael Prager
Boardmod Team
Development Team
*****
Offline



Posts: 976
Location: Germany
Re: UTF-8 support
Reply #7 - Jan 30th, 2011 at 9:34pm
Post Tools
I'm curious though... tried to stuff all kind of invalid encoding garbage into my test board, but it never breaks, no error at all. What does the file look like that causes that error?
  

Nail here for a new monitor! --> [x]
Back to top
WWWICQ  
IP Logged
 
Jet Li
Legacy Dev Team
Development Team
****
Offline



Posts: 6,588
Location: Hong Kong
Re: UTF-8 support
Reply #8 - Jan 30th, 2011 at 9:44pm
Post Tools
Maybe doing at Board Index. If I visit Topic via User Profile works. Only if I visit Message Index or Board Index or Rebuild Message Index I get Error.
  

PM me for YaBB Installation Service
Back to top
WWWGTalkFacebook  
IP Logged
 
Michael Prager
Boardmod Team
Development Team
*****
Offline



Posts: 976
Location: Germany
Re: UTF-8 support
Reply #9 - Jan 30th, 2011 at 10:53pm
Post Tools
Ah ok, writing 0xfefeffff to forum.totals does produce the error. Investigating... Smiley
  

Nail here for a new monitor! --> [x]
Back to top
WWWICQ  
IP Logged
 
Jet Li
Legacy Dev Team
Development Team
****
Offline



Posts: 6,588
Location: Hong Kong
Re: UTF-8 support
Reply #10 - Jan 31st, 2011 at 5:00pm
Post Tools
Ok. I will wait. The Boy have same issue on his Test Forum. See Post on Dev Board. Smiley
  

PM me for YaBB Installation Service
Back to top
WWWGTalkFacebook  
IP Logged
 
Michael Prager
Boardmod Team
Development Team
*****
Offline



Posts: 976
Location: Germany
Re: UTF-8 support
Reply #11 - Jan 31st, 2011 at 6:54pm
Post Tools
Ok I've thought about it. The way it currently works is very strict. It will pop an error when incorrect encoded data is encountered. But that's actually not a bad thing. Now the user can notice immediately if something got corrupted.

The question is how to convert existing data. On-the-fly conversion is probably not an option. Because the user has do specify what the encoding for the old data should be. It may even differ between the boards of a forum (like the Russian board on this forum). So we probably need a separate converter that offers a nice interface to specify the encoding. Where to we place such a converter? Should we put it in Setup.pl just like the Y1 converter? That way we wouldn't have to bother about wrong formats within the YaBB main code - handle all the conversion separately.
  

Nail here for a new monitor! --> [x]
Back to top
WWWICQ  
IP Logged
 
Jet Li
Legacy Dev Team
Development Team
****
Offline



Posts: 6,588
Location: Hong Kong
Re: UTF-8 support
Reply #12 - Jan 31st, 2011 at 7:04pm
Post Tools
Michael Prager wrote on Jan 31st, 2011 at 6:54pm:
Should we put it in Setup.pl just like the Y1 converter? That way we wouldn't have to bother about wrong formats within the YaBB main code - handle all the conversion separately.

That would be nice. On Converter Page user can choose 2 Conversions.

YaBB 1.x to YaBB 3.x
YaBB 2.x to YaBB 3.x

or

YaBB 1.x to YaBB 2.x
YaBB 2.x to YaBB 3.x
« Last Edit: Jan 31st, 2011 at 7:06pm by Jet Li »  

PM me for YaBB Installation Service
Back to top
WWWGTalkFacebook  
IP Logged
 
Captain John
Ex Member


Re: UTF-8 support
Reply #13 - Feb 1st, 2011 at 1:37am
Post Tools
Info:
 of the 8 downloadable languages available for YaBB (other than English)

charset's are:
1 is Windows - 1251  (Russian)
1 is Windows - 1256  (Arabic)
5 are ISO-8859-1  (Danish, Spanish, Finnish, Deutsche & Deutsche_Du)
1 is ISO-8859-2  (Polish)

« Last Edit: Feb 1st, 2011 at 6:15pm by »  
Back to top
 
IP Logged
 
Michael Prager
Boardmod Team
Development Team
*****
Offline



Posts: 976
Location: Germany
Re: UTF-8 support
Reply #14 - Feb 1st, 2011 at 3:12am
Post Tools
Ok great, I'll put those into the encoding selection list. It might help the user to select a language instead of an encoding though. Or at least a table that shows which language file used what encoding.
  

Nail here for a new monitor! --> [x]
Back to top
WWWICQ  
IP Logged
 
Page Index Toggle Pages: [1] 2 
Topic Tools
 
  « Board Index ‹ Board  ^Top