Chapter 36. Non-ASCII Character Set Support

If long filename support is enabled (by setting CYGCFG_FS_FAT_LONG_FILE_NAMES) then all strings passed to and from the filesystem may be encoded using UTF-8. This allows files to be named using characters beyond the basic ASCII set.

If long filename support is disabled then file names are limited to the standard 8.3 format. However, these file names are preserved in an 8-bit clean format, if they contain non-ASCII characters, so that any multi-byte encodings are preserved.

Filesystems created by devices that do not support long filenames may have 8.3 names that are encoded using non-ASCII and non-Unicode character sets. Typically these will be encoded according to Microsoft code page character sets. To permit these names to pass through the rest of the filesystem, and compare correctly during file searches, when long filename support is enabled, these names need to be translated into Unicode. Since the filesystem has no built-in internationalization support, beyond Unicode, it is the responsibility of the application or middleware layers to supply the translation of these values to and from Unicode. The FILEIO package defines callbacks that may be used to do this:

typedef int cyg_fs_mbcs_to_utf16le( CYG_ADDRWORD data,
                                    const cyg_uint8 *mbcs,
                                    int size,
                                    cyg_uint16 *utf16le);

typedef int cyg_fs_utf16le_to_mbcs( CYG_ADDRWORD data,
                                    const cyg_uint16 *utf16le,
                                    int size,
                                    cyg_uint8 *mbcs);

struct cyg_fs_mbcs_translate
{
    cyg_fs_mbcs_to_utf16le      *mbcs_to_utf16le;
    cyg_fs_utf16le_to_mbcs      *utf16le_to_mbcs;
    CYG_ADDRWORD                data;
};

These callback functions may be registered after a filesystem has been mounted by using cyg_fs_setinfo() as follows:

struct cyg_fs_mbcs_translate translate;

…

translate.mbcs_to_utf16le = my_mbcs_to_utf16le;
translate.utf16le_to_mbcs = my_utf16le_to_mbcs;
translate.data = (CYG_ADDRWORD)my_data;
err = cyg_fs_setinfo("/disk0", FS_INFO_MBCS_TRANSLATE, &translate, sizeof(translate));

Following this, whenever the filesystem encounters a short file name that contains non-ASCII characters the registered mbcs_to_utf16le() function will be called to translate it. In the call, the data argument will be a copy of the data field of the cyg_fs_mbcs_translate structure. The mbcs argument points to the sequence of size bytes to be translated. The resulting translation should be stored in utf16le and the number of 16-bit values stored returned from the function.

When the filesystem needs to encode a string into the multibyte character set, it will call the utf16le_to_mbcs() function. In the call, the data argument will be a copy of the data field of the cyg_fs_mbcs_translate structure. The utf16le argument points to the sequence of size 16-bit values to be translated. The resulting translation should be stored in mbcs and the number of bytes stored returned from the function.

It is important to note that translation is to and from UTF-16LE. All 16 bit values are stored in little endian byte order and Unicode code points outside the Basic Multilingual Plane are encoded as surrogate pairs. This is the format mandated by Microsoft for long file names in the FAT filesystem. See IETF RFC2781 for details of the encoding.

In the current implementation the utf16le_to_mbcs() will not be called. If long filename support is disabled, then the filesystem will store multibyte characters as they are supplied. If long filename support is enabled then new files will be created with long names if any non-ASCII characters are present. Renamed files will be converted to the long name form automatically. This function is present in case future enhancements require it. For now applications should install a function that simply returns zero.