From WikiChip
Editing mirc/unicode

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

This page supports semantic in-text annotations (e.g. "[[Is specified as::World Heritage Site]]") to build structured and queryable content provided by Semantic MediaWiki. For a comprehensive description on how to use annotations or the #ask parser function, please have a look at the getting started, in-text annotation, or inline queries help pages.

Latest revision Your text
Line 4: Line 4:
 
=== mIRC and Unicode ===
 
=== mIRC and Unicode ===
  
Before Unicode (mIRC 7.x), mIRC supported Windows' code pages, handling various character sets used in different parts of the world. Code pages encode characters in 8 bits (1 byte): 8 bits can be used to represent 256 values. Code pages all use the first 128 values for ASCII, and then each code page adds the required characters for the language e.g. é for French, although before unicode, for Japanese or Chinese language, they already needed more than 256 values and where already using special encoding, using more than one byte.
+
Before Unicode (mIRC 7.x), mIRC supported Windows' code pages, handling various character sets used in different parts of the world. Code pages encode characters in 8 bits (1 byte): 8 bits can be used to represent 256 values. Code pages all use the first 128 values for ASCII, and then each code page adds the required characters for the language e.g. é for French, although before unicode, for japanese or chinese language, they already needed more than 256 values and where already using special encoding, using more than one byte.
  
 
Unicode provides a single character encoding which supports all languages using 1,114,112 distinct characters. Since IRC users are from all over the world, often using their own language and character set, using a single common character set is a major advantage.
 
Unicode provides a single character encoding which supports all languages using 1,114,112 distinct characters. Since IRC users are from all over the world, often using their own language and character set, using a single common character set is a major advantage.
Line 28: Line 28:
 
For $regsubex this is because mIRC is using the 8 bits pcre api: when replacing it has to convert your 16 bits unit ($chr(55357) in the first iteration) to UTF8.
 
For $regsubex this is because mIRC is using the 8 bits pcre api: when replacing it has to convert your 16 bits unit ($chr(55357) in the first iteration) to UTF8.
  
'''Note:''' In fact mIRC does convert the two surrogates to their utf8 representation, except that this is not really correct because surrogates are not really characters; they are just code points used to form others characters, therefore it's improper to decode those bytes to that character because the general algorithm for UTF8 can still be used, and is used by mIRC here, to encode the lone surrogate. So, when mIRC decodes what $regsubex generated, to UTF8, it recognizes the illegal sequence and simply return the characters corresponding to the byte, so you get 3 characters per surrogate.
+
'''Note:''' In fact mIRC does convert the two surrogates to their utf8 representation, except that this is not really correct because surrogates are not really characters; they are just code points used to form others characters, therefore it's improper to decode those bytes to that character because the general algorithm for UTF8 can still be used, and is used by mIRC here, to encode the lone surrogate. So, when mIRC decodes what $regsubex generated, to UTF8, it recognises the illegal sequence and simply return the characters corresponding to the byte, so you get 3 characters per surrogate.
  
 
=== $utfencode / $utfdecode ===
 
=== $utfencode / $utfdecode ===
Line 75: Line 75:
 
'''Note''': GDI charsets 1 and 255 are system dependent and are therefore expected to return different results across different machines. Values not on the table are treated as a reference to DEFAULT_CHARSET, equivalent to using C = 1.
 
'''Note''': GDI charsets 1 and 255 are system dependent and are therefore expected to return different results across different machines. Values not on the table are treated as a reference to DEFAULT_CHARSET, equivalent to using C = 1.
  
For example, if you want to get the text (FROM GREEK TO UTF8), which used the ISO-8859-7 (GREEK) encoding for Greek letters, in utf8, you need to encode that to utf8, interpreting the bytes as per in the GREEK code page, and then to decode that to utf8: $utfdecode($utfencode(text,161))
+
For example, if you want to get the text (FROM GREEK TO UTF8), which used the ISO-8859-7 (GREEK) encoding for greek letters, in utf8, you need to encode that to utf8, interpreting the bytes as per in the GREEK code page, and then to decode that to utf8: $utfdecode($utfencode(text,161))
  
 
If you want to send the text in GREEK over IRC, mIRC will encode the bytes internally so you must encode the text in utf8, and then decode to utf8, interpreting the bytes as per in the GREEK code page: /raw -n privmsg #chan $utfdecode($utfencode(text),161)
 
If you want to send the text in GREEK over IRC, mIRC will encode the bytes internally so you must encode the text in utf8, and then decode to utf8, interpreting the bytes as per in the GREEK code page: /raw -n privmsg #chan $utfdecode($utfencode(text),161)
Line 93: Line 93:
 
{{mIRC|/sockwrite|/sockwrite -u}} can be used to the same effect, won't encode characters in the range 0-255 to utf8.
 
{{mIRC|/sockwrite|/sockwrite -u}} can be used to the same effect, won't encode characters in the range 0-255 to utf8.
  
=== Normalization ===
+
=== Normalisation ===
It is beyond the scope of this wiki page to explain Unicode normalization in detail, but you should note when e.g. comparing unicode strings that some unicode characters with accents can be encoded as a single integrated character or equally validly as an unaccented character with a modifying accent.  
+
It is beyond the scope of this wiki page to explain Unicode normalisation in detail, but you should note when e.g. comparing unicode strings that some unicode characters with accents can be encoded as a single integrated character or equally validly as an unaccented character with a modifying accent.  
  
 
For example "Ô" can be sent from another IRC client either as $chr(212) or decomposed into a capital O $chr(79) followed by a combining circumflex $chr(770).
 
For example "Ô" can be sent from another IRC client either as $chr(212) or decomposed into a capital O $chr(79) followed by a combining circumflex $chr(770).
Line 100: Line 100:
 
Normalisation is a means of ensuring that all such characters are encoded either as the single integrated character or using modifiers in order that strings which might have a mixture of these techniques can be compared.
 
Normalisation is a means of ensuring that all such characters are encoded either as the single integrated character or using modifiers in order that strings which might have a mixture of these techniques can be compared.
  
mIRC does not support normalization of Unicode strings either explicitly OR implicitly when comparing strings.  
+
mIRC does not support normalisation of Unicode strings either explicitly OR implicitly when comparing strings.  
  
Experimentation suggests that mIRC does not normally recognize combining characters and will not display the combining character at all, which can lead to communication confusion. So a "Ô" sent decomposed into capital O $chr(79) followed by a combining circumflex $chr(770) will be displayed as "O".
+
Experimentation suggests that mIRC does not normally recognise combining characters and will not display the combining character at all, which can lead to communication confusion. So a "Ô" sent decomposed into capital O $chr(79) followed by a combining circumflex $chr(770) will be displayed as "O".
  
To complicate things still further, some unicode characters look the same as or very similar to other completely different characters - and some of these characters are always considered unequal in strict Unicode whilst others can be converted during normalization. mIRC treats such characters as different under all circumstances.
+
To complicate things still further, some unicode characters look the same as or very similar to other completely different characters - and some of these characters are always considered unequal in strict Unicode whilst others can be converted during normalisation. mIRC treats such characters as different under all circumstances.
  
 
=== Case insensitive comparisons ===
 
=== Case insensitive comparisons ===

Please note that all contributions to WikiChip may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WikiChip:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)