Entries tagged Unicode

Conversion between UTF-16 and UTF-8 in C++

Posted on Jun 5, 2014

#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

// Unicode representation in MS Windows uses the 2-byte wchar_t type.
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> utfconv;

// string conversion
std::wstring wide = L"Hello, World! 안녕하세요?";  // wide string with utf-16 encoding
std::string narrow = utfconv.to_bytes(wide);     // conversion from utf-16 to utf-8
wide = utfconv.from_bytes(narrow);               // back from utf-8 to utf-16

// conversion during file I/O
std::wofstream fout;                             // wide output stream
fout.open("test.txt", fout.out);
fout.imbue(std::locale(fout.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
fout << wide << std::endl;                       // this stream is stored as utf-8
fout << utfconv.from_bytes(narrow) << std::endl; // the same as the above line
fout.close();

std::wifstream fin;
fin.open("test.txt", fin.in);
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
std::wstring hello, world, anyoung, tline;
fin >> hello >> world >> anyoung;      // utf-8 stream is converted to utf-16 string
std::getline(fin, tline);              // read out the end of the line
std::getline(fin, tline);              // read the next line
fin.close();

Printing Unicode strings in Windows console

Posted on May 12, 2014

Win32 Console Applications do not display Unicode strings properly. There is a simple solution of using _setmode(). You need to include two headers.

#include <io.h>
#include <fcntl.h>

int _tmain(int argc, TCHAR* argv[], TCHAR* envp[])
{
	_setmode(_fileno(stdout), _O_U16TEXT);

	wcout << L"안녕하세요?" << endl;
	// or
	// wprintf(L"%s\n", L"안녕하세요?");

	return 0;
}

The Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets

Posted on May 11, 2014

Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too. Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.

For more information, see http://www.joelonsoftware.com/articles/Unicode.html

MFC support for MBCS deprecated in Visual Studio 2013

Posted on Nov 18, 2013

It is time to move onto Unicode. The multi-byte character set (MBCS) in MFC will not be supported in the future versions of Visual Studio. You can still compile the MBCS projects with VS2013, but you need to install the MBCS libraries via a separate download, which is available here. You will see a deprecation warning, when an application is built using MBCS, though. This warning can be suppressed by adding the NO_WARN_MBCS_MFC_DEPRECATION preprocessor definition.

For more information, see http://blogs.msdn.com/b/vcblog/archive/2013/07/08/mfc-support-for-mbcs-deprecated-in-visual-studio-2013.aspx.