Page Index Toggle Pages: [1] 2 
Topic Tools
Hot Topic (More than 10 Replies) Convert to UTF-8 (Read 4,182 times)
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Re: Convert to UTF-8
Reply #20 - Sep 16th, 2010 at 12:46am
Post Tools
If YaBB3 uses a MySQL database, you would also need a single encoding. Unless YaBB adds some magic code such as  Carsten's to account for the mix, and handle it all for you. Something to keep in mind. Smiley

I have converted all files to UTF-8 and everything is working appropriately. To anyone else in this situation UTFCast is a good free utility to do batch file conversions and leave the UTF-8 BOM out (some converters automatically put it in) which will cause YaBB to malfunction.

I'm surprised this is a rare request here. I have seen it often in similar software that started with one encoding and moved to UTF-8 as standard over the years. Wordpress and SMF come to mind off the top of my head.

Hopefully this topic will be useful to others in the future. I'm glad we got it all sorted out.  Smiley
  
Back to top
 
IP Logged
 
Carsten
Ex Member


Re: Convert to UTF-8
Reply #19 - Sep 15th, 2010 at 11:34pm
Post Tools
Yep - you can choose to convert the whole charade to valid UTF-8 now or you can use 'on the fly' and postpone conversion till when/if you decide to change to other forum software.
« Last Edit: Sep 15th, 2010 at 11:35pm by »  
Back to top
 
IP Logged
 
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Re: Convert to UTF-8
Reply #18 - Sep 15th, 2010 at 8:42pm
Post Tools
Thank you Carsten. That sums up everything. We are now on the same page. Smiley I agree with the consensus. I understand your on the fly approach. Certainly solves the issue.

My alternate approach was to attempt to convert the member, board, and message data files to UTF-8, by mass converting all. I just didn't know if this would be a problem for YaBB since the flat-file format delimiters and all would be included. That was my original question in post 1.

Anyone else with this same issue please note that in the event one might want to switch to other forum software in the future, you should probably do the file conversion option. Otherwise, with on the fly, you would have to convert your files to valid UTF-8 first before you'd be able to use any available converters, because you have mixed encoding in your data files.
  
Back to top
 
IP Logged
 
Carsten
Ex Member


Re: Convert to UTF-8
Reply #17 - Sep 15th, 2010 at 8:04pm
Post Tools
Lots of arguments, quotes and fancy words. I'm a simple guy - so let me try and boil this topic down to something even I can understand.

blackcatnc has an existing forum using iso-8859-1 encoding.

Now he wants to change to UTF-8 encoding.

Question: Will YaBB display all the old iso-8859-1 encoded data correctly?

Answer: No - a lot of characters (from non english languages) will need conversion to display correctly - period.

Solution: You can do one of two: Convert the entire base of data, displayed names, signatures, board descriptions... Or you can convert 'on the fly'. I've provided a (first try) method to do the latter. It will test if the string is valid utf-8 - if is nothing happens else the string will be converted.
« Last Edit: Sep 15th, 2010 at 8:07pm by »  
Back to top
 
IP Logged
 
JonB
YaBB Administrator
YaBB Next Team
Operations Team
Beta Testers
Support Team
*****
Offline



Posts: 3,785
Location: Land of the Blazing Sun!

YaBB 2.6.0
Re: Convert to UTF-8
Reply #16 - Sep 15th, 2010 at 6:56pm
Post Tools
blackcatnc -

First, this isn't about your word, rather about how things work. In the 'olive branch' PM I had sent you, I explained that YaBB (from my research) used transliteration (read that as on-the-fly-transcoding) and didn't work through wholesale conversion - thus Carsten's reply (and proposed method) should not have been a surprise. You may also extend that logic to 'YaBB is really an adaptive transcoder' - in that it works with what its got.  I'm not going to dig out LaTex here, and make the cute little dot triangle - I'll say "Therefore - normally, a conversion is not anticipated".  And your desire to convert character sets makes the case that we may need a mechanism that works correctly with the character set adaptations already built into the core code.  I think your request might be the first.  If you look down in the Language Specific Support Boards, you will find working Chinese and Russian forums. Frankly, I'm unsure how often conversions would be an issue, but I think that is worth discussing also.   

Finally - I can say that one of the other posters on this topic had already broached the 'unthinkable' - Do we need/want to consider this complete change in method. So, I guess we (YaBB as a team project) are neither too old in the tooth nor too cantankerous to have our working assumptions challenged.

How's that for a quote from a 'support person'?

Wink





« Last Edit: Sep 15th, 2010 at 6:58pm by JonB »  

I find your lack of faith disturbing.
Back to top
IP Logged
 
Carsten
Ex Member


Re: Convert to UTF-8
Reply #15 - Sep 15th, 2010 at 1:19pm
Post Tools
@blackcatnc - to get details on what i am thinking of you'll have to be a little patient - remember i'm only danish and need a little time for coding and testing Wink

Try this for a start:

In Subs.pl find
Code
Select All
	$_[0] =~ s/\[ch(\d{3,})\]/ $1>127 ? "\&#$1;" : '' /egis; 


and add after
Code
Select All
	if ($yycharset =~ m\utf-8\i) {
		use Encode qw(decode_utf8 from_to);
		if (eval { decode_utf8($_[0], Encode::FB_CROAK); 1 }) {}
		else { from_to($_[0], "iso-8859-1", "utf8"); }
	} 



Remember - this is very early and temporary code.
  
Back to top
 
IP Logged
 
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Re: Convert to UTF-8
Reply #14 - Sep 15th, 2010 at 1:15pm
Post Tools
I meant the specific character example I presented with 'é' and the byte-code representation. Explain how old post data with that character gets displayed moving from ISO8859-1 to UTF-8 character set in YaBB. A byte change must take place. I don't think you can because it doesn't work and is not happening as you can see from the link below.

I don't know where the communication breakdown is here. Nobody wants to talk about the specific example I've presented and why it will not display properly in old posts in YaBB if you do nothing more than set YaBB $yycharset to UTF-8 and make your server to serve UTF-8. That alone isn't enough and fails, yet people here claim it doesn't.

Proof since my word meant nothing:
http://transcorp.parodius.com/forum/YaBB.pl?num=1130404517/16#16 (I have since converted all files on this forum to UTF-8 correcting the issue per my explanations in previous posts.)

YaBB is set to UTF-8, server is sending UTF-8, old post data with character  'é' now FAILS.

Same post on forum with ISO8859-1.
http://transcorp.parodius.com/cgi-bin/yabb2/YaBB.pl?num=1130404517/16#16


What do you have to say about that? It clearly doesn't work as you say.
« Last Edit: Sep 16th, 2010 at 12:35am by blackcatnc »  
Back to top
 
IP Logged
 
Captain John
Ex Member


Re: Convert to UTF-8
Reply #13 - Sep 15th, 2010 at 2:54am
Post Tools
blackcatnc wrote on Sep 15th, 2010 at 12:49am:
how YaBB supports UTF-8 and how the character translation will occur with a specific example

Specific ?  Check out the International Category, ISO8859-1 supports EU, But look at the others (characters correct) Russian, Chinese (both Traditional & Simplified) AND Japanese ... a character set that requires not only a unique character  but an extended code to display.

and YaBB does that !  Very very Well !
  
Back to top
 
IP Logged
 
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Re: Convert to UTF-8
Reply #12 - Sep 15th, 2010 at 12:49am
Post Tools
I'm here to discuss how YaBB supports UTF-8 and how the character translation will occur with a specific example. No one has done that yet despite claiming it can do it.

Discussing anything specific in detail is automatically arguing that doesn't deserve an answer? I'm interested in the details behind what you're doing also. You gave no details. I made some assumptions fishing for more.

I would like the same credit. The same attitude was given when I fixed the Yabb 1 to 2 member conversion code a few years ago. People here are quick to dismiss any problem, don't want to discuss details, and are quick to point fingers. That's probably not the best way to represent YaBB. I think I'm beginning to understand the basis for YaBB's current standing.

You just can't discuss anything around here. Instead, I get blanket statements of "you are here to argue" and support team members making unnecessary comments such as "I guess you didn't read that Wikipedia page you quoted me very well." This is why I don't contribute anymore.
  
Back to top
 
IP Logged
 
Carsten
Ex Member


Re: Convert to UTF-8
Reply #11 - Sep 15th, 2010 at 12:24am
Post Tools
Nah blackcatnc - looks like JonB was right - you are here to argue Wink

Please give me just a little credit - i was'nt born yesterday...
  
Back to top
 
IP Logged
 
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Re: Convert to UTF-8
Reply #10 - Sep 15th, 2010 at 12:05am
Post Tools
Quote:
Quote:
Some code points in UCS have been assigned to combining characters. These are similar to the non-spacing accent keys on a typewriter. A combining character is not a full character by itself. It is an accent or other diacritical mark that is added to the previous character. This way, it is possible to place any accent on any character. The most important accented characters, like those used in the orthographies of common languages, have codes of their own in UCS to ensure backwards compatibility with older character sets. They are known as precomposed characters. Precomposed characters are available in UCS for backwards compatibility with older encodings that have no combining characters, such as ISO 8859.


Yes, and the data YaBB has stored is in neither combining nor precomposed code points. Code point U+00E9 still needs two bytes in UTF-8 to be represented. Two different bytes than the same in UTF-16 or other Unicode encoding that can simply use 0x00 0xe9. YaBB nor the server will interpret it into either.

Just try it. YaBB fails to display for the reasons mentioned.

Quote:
Server setup


Yes, I know how to and have the server outputting Content-Type: text/html; charset=utf-8. It still fails.

Quote:
  Incorrect ... output is the only thing that "needs" converted (translated)
     ISO8859-1 is a smaller (restricted character set) compared to UTF-8 (which includes ALL characters available in ISO8859-1), in other words ... UTF-8 is "backwards" compatible with ISO8859-1


Mmm hmm.. Now explain how a character that is 0xE9 in ISO8859-1 is going to going to get translated into it's proper UTF-8 codepoint with YaBB. Enlighten me.

Changing your server character set output doesn't do it. Changing YaBB to UTF-8 doesn't do it. The fact of the matter is your magic backwards compatibility isn't there. There's a reason every other piece of software on earth that used an older encoding and moved to UTF-8 needs to convert or translate the old data. I've done it before. I've coded UTF-8 converting utilities. I've run a LAMP server. Please explain how the translation process will occur and why it doesn't currently.


Carsten:

The problem with on the fly will be new data that's put in UTF-8 format. You'll end up double converting. Unless you're going to continue to store in ISO8859-1 and just output in UTF-8. But why would you want to do that?
  
Back to top
 
IP Logged
 
JonB
YaBB Administrator
YaBB Next Team
Operations Team
Beta Testers
Support Team
*****
Offline



Posts: 3,785
Location: Land of the Blazing Sun!

YaBB 2.6.0
Re: Convert to UTF-8
Reply #9 - Sep 14th, 2010 at 11:05pm
Post Tools
Thanks very much Carsten!

Cool
  

I find your lack of faith disturbing.
Back to top
IP Logged
 
Carsten
Ex Member


Re: Convert to UTF-8
Reply #8 - Sep 14th, 2010 at 9:31pm
Post Tools
I have to go with blackcatnc.

For other than english language a lot of ISO-8859-1 characters are 16 bytes in UTF-8 and will need to be converted to display correctly.

I am now working on a 'converting on the fly' mod...
« Last Edit: Sep 14th, 2010 at 9:35pm by »  
Back to top
 
IP Logged
 
Captain John
Ex Member


Re: Convert to UTF-8
Reply #7 - Sep 14th, 2010 at 8:24pm
Post Tools
Quote:
Some code points in UCS have been assigned to combining characters. These are similar to the non-spacing accent keys on a typewriter. A combining character is not a full character by itself. It is an accent or other diacritical mark that is added to the previous character. This way, it is possible to place any accent on any character. The most important accented characters, like those used in the orthographies of common languages, have codes of their own in UCS to ensure backwards compatibility with older character sets. They are known as precomposed characters. Precomposed characters are available in UCS for backwards compatibility with older encodings that have no combining characters, such as ISO 8859.


Quote:
Using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings with these encodings can contain as parts of many wide characters bytes like “\0” or “/” which have a special meaning in filenames and other C library function parameters. In addition, the majority of UNIX tools expects ASCII files and cannot read 16-bit words as characters without major modifications. For these reasons, UCS-2 is not a suitable external encoding of Unicode in filenames, text files, environment variables, etc.

The UTF-8 encoding defined in ISO 10646-1:2000 Annex D and also described in RFC 3629 as well as section 3.9 of the Unicode 4.0 standard does not have these problems. It is clearly the way to go for using Unicode under Unix-style operating systems.


Quote:
Server setup

How to make the server send out appropriate 'charset' information depends on the server. You will need the appropriate administrative rights to be able to change server settings.

Apache. This can be done via the AddCharset (Apache 1.3.10 and later) or AddType directives, for directories or individual resources (files). With AddDefaultCharset (Apache 1.3.12 and later), it is possible to set the default 'charset' for a whole server. For more information, see the article on Setting 'charset' information in .htaccess.

Jigsaw. Use an indexer in JigAdmin to associate extensions with charsets, or set the charset directly on a resource .

IIS 5 and 6. In Internet Services Manager, right-click "Default Web Site" (or the site you want to configure) and go to "Properties" => "HTTP Headers" => "File Types..." => "New Type...". Put in the extension you want to map, separately for each extension; IIS users will probably want to map .htm, .html,... Then, for Content type, add "text/html;charset=iso-8859-1" (without the quotes; substitute your desired charset for iso-8859-1; do not leave any spaces anywhere because IIS ignores all text after spaces). For IIS 4, you may have to use "HTTP Headers" => "Creating a Custom HTTP Header" if the above does not work.
Scripting the header

The appropriate header can also be set in server side scripting languages. For example:

Perl. Output the correct header before any part of the actual page. After the last header, use a double linebreak, e.g.:
print "Content-Type: text/html; charset=utf-8\n\n";

Python. Use the same solution as for Perl (except that you don't need a semicolon at the end).

PHP. Use the header() function before generating any content, e.g.:
header('Content-type: text/html; charset=utf-8');

Java Servlets. Use the setContentType method on the ServletResponse before obtaining any object (Stream or Writer) used for output, e.g.:
resource.setContentType ("text/html;charset=utf-8");
If you use a Writer, the Servlet automatically takes care of the conversion from Java Strings to the encoding selected.

JSP. Use the page directive e.g.:
<%@ page contentType="text/html; charset=UTF-8" %>
Output from out.println() or the expression elements (<%= object%>) is automatically converted to the encoding selected. Also, the page itself is interpreted as being in this encoding.

ASP and ASP.Net. content type and charset are set independently, and are methods on the response object. To set the charset, use e.g.:
<%Response.charset="utf-8"%>
In ASP.Net, setting Response.ContentEncoding will take care both of the charset parameter in the HTTP Content-Type as well as of the actual encoding of the document sent out (which of course have to be the same). The default can be set in the globalization element in Web.config (or Machine.config, which is originally set to UTF-8).


blackcatnc wrote on Sep 14th, 2010 at 6:50pm:
You need to convert or it won't display correctly.

  Incorrect ... output is the only thing that "needs" converted (translated)
     ISO8859-1 is a smaller (restricted character set) compared to UTF-8 (which includes ALL characters available in ISO8859-1), in other words ... UTF-8 is "backwards" compatible with ISO8859-1
« Last Edit: Sep 14th, 2010 at 8:30pm by »  
Back to top
 
IP Logged
 
blackcatnc
YaBB Newcomer
*
Offline



Posts: 13
Re: Convert to UTF-8
Reply #6 - Sep 14th, 2010 at 6:50pm
Post Tools
JonB wrote on Sep 14th, 2010 at 6:19pm:
Umm - not quite correct -

That's not true if you use any special characters or mix in words from a language other than English for any reason.

should be "mix in words from a NON-LATINATE chraracter set'. ISO-8859-1 is NOT English - its Latin -1. which supports most Western European languages.  (I guess you didn't read that Wikipedia page you quoted me very well.)


You're just not understanding. Here you go in facts:

é = byte-code 0xE9 in  ISO-8859-1
é = byte-code 0xC3A9 in UTF-8

You need to convert or it won't display correctly.

It's not an argument. It's a simple fact.

I've already tested it. YaBB fails exactly in the manner expected.
« Last Edit: Sep 14th, 2010 at 7:03pm by blackcatnc »  
Back to top
 
IP Logged
 
Page Index Toggle Pages: [1] 2 
Topic Tools
 
  « Board Index ‹ Board  ^Top