PHP Content Encoding

Multibyte String

Multibyte String Functions
mb_check_encoding

function validateEncoding($string, $string_encoding)
{
	$fs = $string_encoding == 'UTF-8' ? 'UTF-32' : $string_encoding;
	$ts = $string_encoding == 'UTF-32' ? 'UTF-8' : $string_encoding;
	return $string === mb_convert_encoding(mb_convert_encoding($string, $fs, $ts), $ts, $fs);
}

mb_convert_case
mb_convert_encoding
mb_convert_encoding
mb_convert_variables
mb_decode_numericentity
mb_detect_encoding
mb_detect_order
mb_encoding_aliases

$encoding        = 'UTF-32';
$known_encodings = mb_list_encodings();

if(in_array($encoding, $known_encodings))
    $aliases = mb_encoding_aliases($encoding);
else
    $aliases = "Unknown ($encoding) encoding.\n";

$includePath = explode(PATH_SEPARATOR, get_include_path());
echo "<pre>";
print_r($aliases);
echo "</pre>";

mb_get_info
mb_internal_encoding
mb_list_encodings
mb_output_handler
mb_preferred_mime_name

$multiByte = array();
$multiByte['mb_get_info'] = mb_get_info();
$multiByte['mb_detect_order'] = mb_detect_order();
$multiByte['mb_encoding_aliases'] = mb_encoding_aliases();

Supported Encodings

Summaries of supported encodings

Encoding	Character Set	Notes
UCS-4 / UCS4 / ISO-10646-UCS-4	ISO 10646	Universal Character Set with 31-bit code space, standardized as UCS-4 by ISO/IEC 10646.
ISO-10646-UCS-4BE	UCS-4	Big Endian form. BE at end is assumed by me as per the standard on the rest of the page.
ISO-10646-UCS-4LE	UCS-4	Little Endian form. LE at end is assumed by me as per the standard on the rest of the page.
ISO-10646-UCS-2	UCS-2	Universal Character Set with 16-bit code space, standardized as UCS-2 by ISO/IEC 10646.
ISO-10646-UCS-2BE	UCS-2	Big Endian form. BE at end is assumed by me as per the standard on the rest of the page.
ISO-10646-UCS-2LE	UCS-2	Little Endian form. LE at end is assumed by me as per the standard on the rest of the page.
UTF-32	Unicode	Unicode Transformation Format of 32-bit unit width, whose encoding space refers to the Unicode’s codeset standard. This encoding scheme wasn’t identical to UCS-4 because the code space of Unicode were limited to a 21-bit value.
UTF-32BE	Unicode	String assumed to be in big endian form.
UTF-32LE	Unicode	String assumed to be in little endian form.
UTF-16	Unicode	Unicode Transformation Format of 16-bit unit width. It’s worth a note that UTF-16 is no longer the same specification as UCS-2 because the surrogate mechanism has been introduced since Unicode 2.0 and UTF-16 now refers to a 21-bit code space.
UTF-16BE	Unicode	String assumed to be in big endian form.
UTF-16LE	Unicode	String assumed to be in little endian form.
UTF-8	Unicode / UCS	Unicode Transformation Format of 8-bit unit width.
UTF-7	Unicode	A mail-safe transformation format of Unicode, specified in » RFC2152.
US-ASCII	ASCII / ISO 646	Common 7-bit encoding. AKA iso-ir-6 / ANSI_X3.4-1986 / ISO_646.irv:1991 / ASCII / ISO646-US / us / IBM367 / CP367 / csASCII
7bit	?	Binary?
8bit	?	Binary?
BASE64	?	eMail / Web
HTML-ENTITIES	?	Web
Windows-1251 / CP1251	?	Windows
Windows-1252 / CP1252	?	Windows

German UTF-8 to windows-1252

http://us.php.net/manual/en/ref.iconv.php#63221

function cv_input($str)
{
	$tr = array(
		chr(132) => 'Ae',
		chr(150) => 'Oe',
		chr(156) => 'Ue',
		chr(159) => 'sz', // ß
		chr(164) => 'ae',
		chr(182) => 'oe',
		chr(188) => 'ue',
		chr(195) => ''
	);

	return strtr($str, $tr);
}

CP1251/Windows-1251 to UTF-8

“”:

function cp1251_to_utf8($s)
{
	if ((mb_detect_encoding($s,'UTF-8,CP1251')) == 'WINDOWS-1251')
	{
		$length = strlen($s);
		$c209 = chr(209); $c208 = chr(208); $c129 = chr(129);
		for($i = 0; $i < $length; $i++)
		{
			$c = ord($s[$i]);
      		if($c >= 192 and $c <= 239)
				$t .= $c208.chr($c - 48);
      		elseif($c > 239)
				$t .= $c209.chr($c - 112);
      		elseif($c == 184)
				$t .= $c209.$c209;
      		elseif($c == 168)
				$t .= $c208.$c129;
      		else
				$t .= $s[$i];
		}
		return $t;
	}else
	{
		return $s;
	}
}

function utf8_to_cp1251($s)
{
	if ((mb_detect_encoding($s, 'UTF-8,CP1251')) == 'UTF-8)
	{
		$length = strlen($s);
		for($c = 0; $c < $length; $c++)
		{
			$i = ord($s[$c]);
			if($i <= 127)
				$out .= $s[$c];
			if($byte2)
			{
				$new_c2 = ($c1&3) * 64+($i&63);
				$new_c1 = ($c1>>2) & 5;
				$new_i = $new_c1 * 256 + $new_c2;
				if($new_i == 1025)
				{
					$out_i = 168;
				}else
				{
					if($new_i == 1105)
						$out_i = 184;
					else
						$out_i = $new_i - 848;
				}
				$out .= chr($out_i);
				$byte2 = false;
			}
			if(($i>>5) == 6)
			{
				$c1 = $i;
				$byte2 = true;
			}
		}
		return $out;
	}else
	{
		return $s;
	}
}

FileNav

PHP Content Encoding

Links

PHP.net Links

Multibyte String

Supported Encodings

German UTF-8 to windows-1252

CP1251/Windows-1251 to UTF-8