Page Index Toggle Pages: [1] 2 3 
Topic Tools
Very Hot Topic (More than 25 Replies) utf-8 support (Read 21,847 times)
MishDan
Full Member
***
Offline



Posts: 228
Location: Kyiv, Ukraine
utf-8 support
Jun 14th, 2007 at 7:37am
Post Tools
Not sure, if this is appropriate place for this...

Maybe add some trigger (switched by $yycharset) in Subs.pl/CountChars to support utf-8 strings. Actually, what is needed - add number of chars in range [\x80-\xBF] to $convertcut (if I right understand that code).

Without this YaBB cuts strings (topic theme in the last post column) in the middle of multibyte char. And then <?> appear.
[img]http://isbear.ho.com.ua/files/sh1.png[/img]
[i]<?>[/i]

Problem of length of multibyte string also affects other places, but there it causes <?>'s on main page.
[img]http://isbear.ho.com.ua/files/sh2.png[/img]
[i]Multibyte string have much smaller limits[/i]
« Last Edit: Jun 14th, 2007 at 8:10am by MishDan »  

P.S. Sorry, my English is not so good, as I wish
Back to top
WWWGTalkICQ  
IP Logged
 
jasinner
YaBB Newcomer
*
Offline



Posts: 46
Re: utf-8 support
Reply #1 - Jun 16th, 2007 at 2:44pm
Post Tools
Can you avoid the problem by using SHIFT-JIS?
It doesn't have these issues.
« Last Edit: Jun 16th, 2007 at 2:44pm by jasinner »  

Have fun, keep a diary in english&&English日記
Back to top
WWW  
IP Logged
 
MishDan
Full Member
***
Offline



Posts: 228
Location: Kyiv, Ukraine
Re: utf-8 support
Reply #2 - Jun 16th, 2007 at 4:24pm
Post Tools
Smiley I can avoid problem by using koi8-u, windows-1251 or cp866.
utf-8 is just more future-oriented and flexible... As for final 2.x release.
  

P.S. Sorry, my English is not so good, as I wish
Back to top
WWWGTalkICQ  
IP Logged
 
jasinner
YaBB Newcomer
*
Offline



Posts: 46
Re: utf-8 support
Reply #3 - Jun 17th, 2007 at 4:07am
Post Tools
Russian characters can also be written with UFF-8? I ran into the same problem when using Japanese.

Also I noticed some simialar issues over in the Chinese support thread. So fixing up this bug could help millions of people to use Yabb.
« Last Edit: Jun 17th, 2007 at 4:10am by jasinner »  

Have fun, keep a diary in english&&English日記
Back to top
WWW  
IP Logged
 
MishDan
Full Member
***
Offline



Posts: 228
Location: Kyiv, Ukraine
Re: utf-8 support
Reply #4 - Jun 17th, 2007 at 10:57am
Post Tools
Almost any characters in the world can be written with utf-8 (even more - you can write in runic and dead scripts Smiley), the only issue - have end-user client appropriate font glyphs to display them or no.
  

P.S. Sorry, my English is not so good, as I wish
Back to top
WWWGTalkICQ  
IP Logged
 
MishDan
Full Member
***
Offline



Posts: 228
Location: Kyiv, Ukraine
Re: utf-8 support
Reply #5 - Jun 22nd, 2007 at 8:30pm
Post Tools
I post this patch here, maybe it will be helpful for someone. Although, as it is noted, it is a hack (maybe, ugly) and I am a perl newbie.

yabb-2.2-utf_8_cut-1.patch: [code]
Date: Ptn CHer 22 23:13:06 EEST 2007
From: Myhailo Danylenko <isbear@ukrpost.net>
Description: Patch fixes inproper truncation of multibyte strings
in Subs.pl/CountChars.
Note: This is hack.
Note: I'm currently very new to perl.

--- nowww/Sources/Subs.pl.orig  2007-06-17 13:07:00.000000000 +0300
+++ nowww/Sources/Subs.pl       2007-06-22 23:11:13.000000000 +0300
@@ -900,11 +900,11 @@
       $convertstr =~ s/(\<.+?\>)/ $1 /ig;
       my @cwords = split(/\s/, $convertstr);
       $convertstr = "";
-       my $curword;
+       my $curword, $tmpword, $tmplen;
       foreach $curword (@cwords) {
               $curword =~ s/&#32;/ /g;
-               $convertstr .= "$curword ";
               if ($curword =~ m~\[ch\d{3,}?\]~ || $curword =~ m~\<.+?\>~) {
+                       $convertstr .= "$curword ";
                       $convertcut += length($curword);
                       if (length($convertstr) > $convertcut) {
                               $clipdiff = length($curword) - (length($convertstr) - $convertcut);
@@ -913,10 +913,19 @@
                               last;
                       }
               }
-               if (length($convertstr) > $convertcut) {
+               $tmpword =  $curword;
+               $tmpword =~ s/[^\x80-\xBF]+//g;
+               $tmplen  =  $convertcut - length ($convertstr);
+               $convertstr .= "$curword ";
+               if (length($curword) + 1  > $tmplen + length($tmpword)) {
                       $cliped = 1;
+                       $tmpword =  $curword;
+                       $tmpword =~ s/^(([\xC0-\xFF][\x80-\xBF]{1,3}|[\x00-\x7F]){$tmplen}).*$/$1/;
+                       $tmpword =~ s/[^\x80-\xBF]+//g;
+                       $convertcut += length ($tmpword);
                       last;
               }
+               $convertcut += length ($tmpword);
       }
       $convertstr =~ s/ (\<.+?\>) /$1/ig;
       $convertstr =~ s/ (\[ch\d{3,}?\])/$1/ig;
[/code]
[edit]Attached it as a file[/edit]
[edit]Note, this patch designed [b]only[/b] for utf-8 encoding. If you also use some 8-bit encoding - you'll need to modify it (add trigger)[/edit]
« Last Edit: Jun 22nd, 2007 at 10:37pm by MishDan »  

yabb-2_2-utf_8-1_patch.zip (Attachment deleted)

P.S. Sorry, my English is not so good, as I wish
Back to top
WWWGTalkICQ  
IP Logged
 
MishDan
Full Member
***
Offline



Posts: 228
Location: Kyiv, Ukraine
Re: utf-8 support
Reply #6 - Jun 28th, 2007 at 11:43am
Post Tools
New patch (not incremental, it contains previous in it).
It also fixes Subs.pl/WrapChars.
[color=#990000]All previous notes still apply[/color]
[tt]From: Myhailo Danylenko <isbear@ukrpost.net>
Date: Чтв Чер 28 17:25:47 EEST 2007
Description: Patch fixes Subs.pl/{Count,Wrap}Chars to work wit utf-8.
Note: This patch will work *only* with utf-8.
Note: Seems, in the posts there are another routine to wrap lines and count chars - so, wait while I find it.
Note: I'm a perl newbie, code quality is not good. It is just hack to get working utf-8. Will be very glad for any suggestions.

--- Subs.pl.orig      2007-06-28 17:20:05.000000000 +0300
+++ nowww/Sources/Subs.pl      2007-06-28 17:06:40.000000000 +0300
@@ -899,11 +899,11 @@
     $convertstr =~ s/(\<.+?\>)/ $1 /ig;
     my @cwords = split(/\s/, $convertstr);
     $convertstr = "";
-      my $curword;
+      my $curword, $tmpcutword, $addlen;
     foreach $curword (@cwords) {
           $curword =~ s/&#32;/ /g;
-            $convertstr .= "$curword ";
           if ($curword =~ m~\[ch\d{3,}?\]~ || $curword =~ m~\<.+?\>~) {
+                  $convertstr .= "$curword ";
                 $convertcut += length($curword);
                 if (length($convertstr) > $convertcut) {
                       $clipdiff = length($curword) - (length($convertstr) - $convertcut);
@@ -911,11 +911,22 @@
                       $cliped = 1;
                       last;
                 }
+                  next;
           }
-            if (length($convertstr) > $convertcut) {
+            $tmpcutword =  $curword;
+            $tmpcutword =~ s/[^\x80-\xBF]+//g;
+            $addlen  =  $convertcut - length ($convertstr);
+            $convertstr .= "$curword ";
+#            if (length($curword) + 1  > length($tmpcutword) + $addlen) {
+            if (length($convertstr) > $convertcut + length($tmpcutword)) {
                 $cliped = 1;
+                  $tmpcutword =  $curword;
+                  $tmpcutword =~ s/^(([\xC0-\xFF][\x80-\xBF]{1,3}|[\x00-\x7F]){$addlen}).*$/$1/;
+                  $tmpcutword =~ s/[^\x80-\xBF]+//g;
+                  $convertcut += length ($tmpcutword);
                 last;
           }
+            $convertcut += length ($tmpcutword);
     }
     $convertstr =~ s/ (\<.+?\>) /$1/ig;
     $convertstr =~ s/ (\[ch\d{3,}?\])/$1/ig;
@@ -926,23 +937,28 @@

sub WrapChars {
     $wrapstr =~ s/(\&\#\d{3,}?\;)/ $1/ig;
-      $wrapstr =~ s~(\S{$wrapcut})~$1 ~gi;
+      # In hope, that all spacechars are lower than 0x21 (\r,\n,\t,\ )
+      # Is other control chars allowed in strings?..
+      # Can we count them as spacechars?
+      $wrapstr =~ s~(([\x21-\x7F]|[\xC0-\xFF][\x80-\xBF]{1,3}){$wrapcut})~$1 ~gi;
     $tmpwrapcut = $wrapcut;
     my @wwords = split(/\s/, $wrapstr);
     $tmpwrapstr = "";
     $wrapstr    = "";
-      my $curword;
+      my $curword, $tmpwrapword;
     foreach $curword (@wwords) {
-
           if ($curword =~ m~\&\#\d{3,}?\;~) {
                 $tmpwrapcut += length($curword);
           }
-            if ((length($tmpwrapstr) + length($curword)) > $tmpwrapcut) {
+            $tmpwrapword =  $curword;
+            $tmpwrapword =~ s/[^\x80-\xBF]+//g;
+            if (length($curword) + length($tmpwrapstr) + 1 > length($tmpwrapword) + $tmpwrapcut) {
                 $wrapstr .= qq~$tmpwrapstr<br />~;
                 $tmpwrapstr = "$curword ";
-                  $tmpwrapcut = $wrapcut;
+                  $tmpwrapcut = $wrapcut + length($tmpwrapword);
           } else {
                 $tmpwrapstr .= "$curword ";
+                  $tmpwrapcut += length($tmpwrapword);
           }
     }
     $wrapstr .= qq~$tmpwrapstr~;
[/tt]
« Last Edit: Jun 28th, 2007 at 11:44am by MishDan »  

yabb-2_2-utf_8-2_patch.zip (Attachment deleted)

P.S. Sorry, my English is not so good, as I wish
Back to top
WWWGTalkICQ  
IP Logged
 
MishDan
Full Member
***
Offline



Posts: 228
Location: Kyiv, Ukraine
Re: utf-8 support
Reply #7 - Jun 28th, 2007 at 1:01pm
Post Tools
Hah, found it.
Third patch.
+ Subs.pl/wrap{,2}.
Although this patch does not fixes inproper postlength, at least, it fixes [i]<?>[/i]'s.
[tt]From: Myhailo Danylenko <isbear@ukrpost.net>
Date: Чтв Чер 28 18:54:00 EEST 2007
Description: Patch fixes Subs.pl/{{Count,Wrap}Chars,wrap{,2} to work wit utf-8.
Note: This patch will work *only* with utf-8.
Note: Multibyte posts still will have much stronger length limits, but at least will wrap properly.
Note: I'm a perl newbie, code quality is not good. It is just hack to get working utf-8. Will be very glad for any suggestions.

--- Subs.pl.orig      2007-06-28 17:20:05.000000000 +0300
+++ nowww/Sources/Subs.pl      2007-06-28 18:29:12.000000000 +0300
@@ -899,11 +899,11 @@
     $convertstr =~ s/(\<.+?\>)/ $1 /ig;
     my @cwords = split(/\s/, $convertstr);
     $convertstr = "";
-      my $curword;
+      my $curword, $tmpcutword, $addlen;
     foreach $curword (@cwords) {
           $curword =~ s/&#32;/ /g;
-            $convertstr .= "$curword ";
           if ($curword =~ m~\[ch\d{3,}?\]~ || $curword =~ m~\<.+?\>~) {
+                  $convertstr .= "$curword ";
                 $convertcut += length($curword);
                 if (length($convertstr) > $convertcut) {
                       $clipdiff = length($curword) - (length($convertstr) - $convertcut);
@@ -911,11 +911,22 @@
                       $cliped = 1;
                       last;
                 }
+                  next;
           }
-            if (length($convertstr) > $convertcut) {
+            $tmpcutword =  $curword;
+            $tmpcutword =~ s/[^\x80-\xBF]+//g;
+            $addlen  =  $convertcut - length ($convertstr);
+            $convertstr .= "$curword ";
+#            if (length($curword) + 1  > length($tmpcutword) + $addlen) {
+            if (length($convertstr) > $convertcut + length($tmpcutword)) {
                 $cliped = 1;
+                  $tmpcutword =  $curword;
+                  $tmpcutword =~ s/^(([\xC0-\xFF][\x80-\xBF]{1,3}|[\x00-\x7F]){$addlen}).*$/$1/;
+                  $tmpcutword =~ s/[^\x80-\xBF]+//g;
+                  $convertcut += length ($tmpcutword);
                 last;
           }
+            $convertcut += length ($tmpcutword);
     }
     $convertstr =~ s/ (\<.+?\>) /$1/ig;
     $convertstr =~ s/ (\[ch\d{3,}?\])/$1/ig;
@@ -926,23 +937,28 @@

sub WrapChars {
     $wrapstr =~ s/(\&\#\d{3,}?\;)/ $1/ig;
-      $wrapstr =~ s~(\S{$wrapcut})~$1 ~gi;
+      # In hope, that all spacechars are lower than 0x21 (\r,\n,\t,\ )
+      # Is other control chars allowed in strings?..
+      # Can we count them as spacechars?
+      $wrapstr =~ s~(([\x21-\x7F]|[\xC0-\xFF][\x80-\xBF]{1,3}){$wrapcut})~$1 ~gi;
     $tmpwrapcut = $wrapcut;
     my @wwords = split(/\s/, $wrapstr);
     $tmpwrapstr = "";
     $wrapstr    = "";
-      my $curword;
+      my $curword, $tmpwrapword;
     foreach $curword (@wwords) {
-
           if ($curword =~ m~\&\#\d{3,}?\;~) {
                 $tmpwrapcut += length($curword);
           }
-            if ((length($tmpwrapstr) + length($curword)) > $tmpwrapcut) {
+            $tmpwrapword =  $curword;
+            $tmpwrapword =~ s/[^\x80-\xBF]+//g;
+            if (length($curword) + length($tmpwrapstr) + 1 > length($tmpwrapword) + $tmpwrapcut) {
                 $wrapstr .= qq~$tmpwrapstr<br />~;
                 $tmpwrapstr = "$curword ";
-                  $tmpwrapcut = $wrapcut;
+                  $tmpwrapcut = $wrapcut + length($tmpwrapword);
           } else {
                 $tmpwrapstr .= "$curword ";
+                  $tmpwrapcut += length($tmpwrapword);
           }
     }
     $wrapstr .= qq~$tmpwrapstr~;
@@ -1027,13 +1043,13 @@
     my @words = split(/\s/, $message);
     $message = "";
     foreach $cur (@words) {
-            if ($cur !~ m~www\.(\S+?)\.~ && $cur !~ m~[ht|f]tp://~ && $cur !~ m~\[\S*\]~ && $cur !~ m~\[\S*\s?\S*?\]~ && $cur !~ m~\[\/\S*\]~) { $cur =~ s~(\S{$linewrap})~$1\n~gi; }
+            if ($cur !~ m~www\.(\S+?)\.~ && $cur !~ m~[ht|f]tp://~ && $cur !~ m~\[\S*\]~ && $cur !~ m~\[\S*\s?\S*?\]~ && $cur !~ m~\[\/\S*\]~) { $cur =~ s~(([\x21-\x7F]|[\xC0-\xFF][\x80-\xBF]{1,3}){$linewrap})~$1\n~gi; }
           if ($cur !~ m~\[table(\S*)\](\S*)\[\/table\]~ && $cur !~ m~\[url(\S*)\](\S*)\[\/url\]~ && $cur !~ m~\[flash(\S*)\](\S*)\[\/flash\]~ && $cur !~ m~\[img(\S*)\](\S*)\[\/img\]~) {
                 $cur =~ s~(\[\S*?\])~ $1 ~g;
                 @splitword = split(/\s/, $cur);
                 $cur = "";
                 foreach $splitcur (@splitword) {
-                        if ($splitcur !~ m~www\.(\S+?)\.~ && $splitcur !~ m~[ht|f]tp://~ && $splitcur !~ m~\[\S*\]~) { $splitcur =~ s~(\S{$linewrap})~$1
~gi; }
+                        if ($splitcur !~ m~www\.(\S+?)\.~ && $splitcur !~ m~[ht|f]tp://~ && $splitcur !~ m~\[\S*\]~) { $splitcur =~ s~(([\x21-\x7F]|[\xC0-\xFF][\x80-\xBF]{1,3}){$linewrap})~$1
~gi; }
                       $cur .= $splitcur;
                 }
           }
@@ -1049,8 +1065,8 @@
}

sub wrap2 {
-      $message =~ s~<a href=("?)(\S*)("?)(\starget="_blank")?>(\S{$linewrap})(\S*)</a>~<a href=$1$2$3$4>$5\n$6</a>~gi;
-      $message =~ s~<a href=("?)(\S*)("?) target=\"_blank\" onclick=\"(.+?)\">(\S{$linewrap})(\S*)</a>~<a href=$1$2$3 target=\"_blank\" onclick=\"$4\">$5\n$6</a>~g;
+      $message =~ s~<a href=("?)(\S*)("?)(\starget="_blank")?>(([\x21-\x7F]|[\xC0-\xFF][\x80-\xBF]{1,3}){$linewrap})(([\x21-\x7F]|[\xC0-\xFF][\x80-\xBF]{1,3})*)</a>~<a href=$1$2$3$4>$5\n$7</a>~gi;
+      $message =~ s~<a href=("?)(\S*)("?) target=\"_blank\" onclick=\"(.+?)\">(([\x21-\x7F]|[\xC0-\xFF][\x80-\xBF]{1,3}){$linewrap})(([\x21-\x7F]|[\xC0-\xFF][\x80-\xBF]{1,3})*)</a>~<a href=$1$2$3 target=\"_blank\" onclick=\"$4\">$5\n$7</a>~g;
}

sub RemoveThreadFiles {
[/tt]
  

yabb-2_2-utf_8-3_patch.zip (Attachment deleted)

P.S. Sorry, my English is not so good, as I wish
Back to top
WWWGTalkICQ  
IP Logged
 
MF-B
Development Team
****
Offline



Posts: 2,405
Location: Moscow, Russia

YaBB 2.5
Re: utf-8 support
Reply #8 - Jun 28th, 2007 at 6:29pm
Post Tools
@MishDan

Nice... Cheesy Smiley

But in YaBB mods use other format for mods/patches

sample

Code
Select All
<search for>
		&ToChars($member{'name'});
</search for>

<add after>
		&ToHTML($member{'location'});
		&ToHTML($member{'bday'});
		&ToHTML($member{'sesquest'});
</add after>

<search for>
		&ToHTML($FORM{'hideemail'});
</search for>

<replace>
		&ToHTML($member{'hideemail'});
</replace>

<search for>
		# Time to print the changes to the username.vars file
		${$uid.$user}{'usertext'}	= "$member{'usertext'}";
</search for>

<add before>
		&ToHTML($member{'userpic'});
		&ToHTML($member{'usertimeoffset'});
		&ToHTML($member{'usertimeselect'});
		&ToHTML($member{'usertemplate'});
		&ToHTML($member{'userlanguage'});
		&ToHTML($member{'timeformat'});
</add before>
 



Wink
« Last Edit: Jun 28th, 2007 at 6:29pm by MF-B »  

Stand!
Back to top
IP Logged
 
Corey Chapman
YaBB Administrator
*****
Offline



Posts: 10,032
Location: Rock Hill, South Carolina

None
Re: utf-8 support
Reply #9 - Jun 28th, 2007 at 6:33pm
Post Tools
Is this something that can be considered a bug and should be put into effect in the next release?
  

Back to top
IP Logged
 
MishDan
Full Member
***
Offline



Posts: 228
Location: Kyiv, Ukraine
Re: utf-8 support
Reply #10 - Jun 28th, 2007 at 7:02pm
Post Tools
2 MF-B:
Ehh, well, and which tools to use to generate it?.. Smiley

2 Corey Chapman:
This will be great, but, IMHO, it also will be great work.
These patches are hacks "to only get it work without <?>'s". It is much better idea to make that by perl's own facilities eg 'use utf-8'. Then all those  strange \xXX regexes will disappear and length($something_utf_8) will return proper length.
But currently I have no perl knowledge and entire yabb code view to go by that path.
  

P.S. Sorry, my English is not so good, as I wish
Back to top
WWWGTalkICQ  
IP Logged
 
MF-B
Development Team
****
Offline



Posts: 2,405
Location: Moscow, Russia

YaBB 2.5
Re: utf-8 support
Reply #11 - Jun 28th, 2007 at 7:17pm
Post Tools
@MishDan

BoardMod, this include ModEditor Wink
  

Stand!
Back to top
IP Logged
 
MishDan
Full Member
***
Offline



Posts: 228
Location: Kyiv, Ukraine
Re: utf-8 support
Reply #12 - Jun 28th, 2007 at 7:24pm
Post Tools
Thanks, I try it and convert existing...
  

P.S. Sorry, my English is not so good, as I wish
Back to top
WWWGTalkICQ  
IP Logged
 
MF-B
Development Team
****
Offline



Posts: 2,405
Location: Moscow, Russia

YaBB 2.5
Re: utf-8 support
Reply #13 - Jun 28th, 2007 at 7:25pm
Post Tools
If be problem - I can help to you... Wink
  

Stand!
Back to top
IP Logged
 
MishDan
Full Member
***
Offline



Posts: 228
Location: Kyiv, Ukraine
Re: utf-8 support
Reply #14 - Jun 29th, 2007 at 2:01am
Post Tools
:D
On machine, where I today working there are no X installed, so I written a perl script-convertor of diff output into .mod ...

Result is attached. It is somewhat not optimized, but it is bet for convertor simplicity.

P.S. It is not zipped!
I have no zip installed also :D
Just $ mv yabb-2.2-utf_8-3.mod{.zip,}

P.P.S. Heh, and how to create file with mod format?
"edit file" and then? "add after?"

[edit]P.P.P.S. Oops... Renewed attachmen - fixed error in convertor.[/edit]
« Last Edit: Jun 29th, 2007 at 3:21am by MishDan »  

yabb-2_2-utf_8-3_mod_001.zip (Attachment deleted)

P.S. Sorry, my English is not so good, as I wish
Back to top
WWWGTalkICQ  
IP Logged
 
Page Index Toggle Pages: [1] 2 3 
Topic Tools
 
  « Board Index ‹ Board  ^Top