Internationalization Functions
iconv Functions
Multibyte String
Multibyte String Functions
function validateEncoding($string, $string_encoding)
$fs = $string_encoding == 'UTF-8' ? 'UTF-32' : $string_encoding;
$ts = $string_encoding == 'UTF-32' ? 'UTF-8' : $string_encoding;
return $string === mb_convert_encoding(mb_convert_encoding($string, $fs, $ts), $ts, $fs);
$encoding = 'UTF-32'; $known_encodings = mb_list_encodings();
if(in_array($encoding, $known_encodings)) $aliases = mb_encoding_aliases($encoding); else $aliases = "Unknown ($encoding) encoding.\n";
$includePath = explode(PATH_SEPARATOR, get_include_path()); echo "<pre>"; print_r($aliases); echo "</pre>";
$multiByte = array();
$multiByte['mb_get_info'] = mb_get_info();
$multiByte['mb_detect_order'] = mb_detect_order();
$multiByte['mb_encoding_aliases'] = mb_encoding_aliases();
Summaries of supported encodings
Encoding | Character Set | Notes |
UCS-4 / UCS4 / ISO-10646-UCS-4 | ISO 10646 | Universal Character Set with 31-bit code space, standardized as UCS-4 by ISO/IEC 10646. |
ISO-10646-UCS-4BE | UCS-4 | Big Endian form. BE at end is assumed by me as per the standard on the rest of the page. |
ISO-10646-UCS-4LE | UCS-4 | Little Endian form. LE at end is assumed by me as per the standard on the rest of the page. |
ISO-10646-UCS-2 | UCS-2 | Universal Character Set with 16-bit code space, standardized as UCS-2 by ISO/IEC 10646. |
ISO-10646-UCS-2BE | UCS-2 | Big Endian form. BE at end is assumed by me as per the standard on the rest of the page. |
ISO-10646-UCS-2LE | UCS-2 | Little Endian form. LE at end is assumed by me as per the standard on the rest of the page. |
UTF-32 | Unicode | Unicode Transformation Format of 32-bit unit width, whose encoding space refers to the Unicode’s codeset standard. This encoding scheme wasn’t identical to UCS-4 because the code space of Unicode were limited to a 21-bit value. |
UTF-32BE | Unicode | String assumed to be in big endian form. |
UTF-32LE | Unicode | String assumed to be in little endian form. |
UTF-16 | Unicode | Unicode Transformation Format of 16-bit unit width. It’s worth a note that UTF-16 is no longer the same specification as UCS-2 because the surrogate mechanism has been introduced since Unicode 2.0 and UTF-16 now refers to a 21-bit code space. |
UTF-16BE | Unicode | String assumed to be in big endian form. |
UTF-16LE | Unicode | String assumed to be in little endian form. |
UTF-8 | Unicode / UCS | Unicode Transformation Format of 8-bit unit width. |
UTF-7 | Unicode | A mail-safe transformation format of Unicode, specified in » RFC2152. |
US-ASCII | ASCII / ISO 646 | Common 7-bit encoding. AKA iso-ir-6 / ANSI_X3.4-1986 / ISO_646.irv:1991 / ASCII / ISO646-US / us / IBM367 / CP367 / csASCII |
7bit | ? | Binary? |
8bit | ? | Binary? |
BASE64 | ? | eMail / Web |
Windows-1251 / CP1251 | ? | Windows |
Windows-1252 / CP1252 | ? | Windows |
function cv_input($str) { $tr = array( chr(132) => 'Ae', chr(150) => 'Oe', chr(156) => 'Ue', chr(159) => 'sz', // ß chr(164) => 'ae', chr(182) => 'oe', chr(188) => 'ue', chr(195) => '' );
return strtr($str, $tr); }
function cp1251_to_utf8($s) { if ((mb_detect_encoding($s,'UTF-8,CP1251')) == 'WINDOWS-1251') { $length = strlen($s); $c209 = chr(209); $c208 = chr(208); $c129 = chr(129); for($i = 0; $i < $length; $i++) { $c = ord($s[$i]); if($c >= 192 and $c <= 239) $t .= $c208.chr($c - 48); elseif($c > 239) $t .= $c209.chr($c - 112); elseif($c == 184) $t .= $c209.$c209; elseif($c == 168) $t .= $c208.$c129; else $t .= $s[$i]; } return $t; }else { return $s; } }
function utf8_to_cp1251($s) { if ((mb_detect_encoding($s, 'UTF-8,CP1251')) == 'UTF-8) { $length = strlen($s); for($c = 0; $c < $length; $c++) { $i = ord($s[$c]); if($i <= 127) $out .= $s[$c]; if($byte2) { $new_c2 = ($c1&3) * 64+($i&63); $new_c1 = ($c1>>2) & 5; $new_i = $new_c1 * 256 + $new_c2; if($new_i == 1025) { $out_i = 168; }else { if($new_i == 1105) $out_i = 184; else $out_i = $new_i - 848; } $out .= chr($out_i); $byte2 = false; } if(($i>>5) == 6) { $c1 = $i; $byte2 = true; } } return $out; }else { return $s; } }