HDF5 on Windows: UTF-8 filenames support
In a previous post, I have detailed my efforts in getting HDF5 compiled on Windows 64-bit using the MinGW-w64 compiler, which is currently (still) unsupported by the HDF5 developers. Now, I am reporting on yet another problem I encountered with HDF5 on Windows, that is blatantly being ignored by the developers: UTF-8 filename support.
This was brought to my attention by a Japanese user of my software package XMI-MSIM, who ran into trouble running the program when logged in with a username consisting of Japanese characters. The problem could be traced back to an HDF5 file that has to be opened, which is located in a subdirectory of the user’s homedirectory. After investigating the HDF5 code, it became quite clear that this issue is caused by the internal use of the _open
function, which is known not to support the Unicode UTF-8 characterset, necessary to represent the Japanese characters. The HDF5 website confirmed this issue with the following statement:
So basically I was stuck here since there is no way to convert a filename in UTF-8 encoding to the UCS encoding that was expected by _open
. A solution which would have converted the filename to the corresponding short form (8 character filename with 3 character extension) using GetShortPathName
seemed like a very ugly hack and decided not to pursue it. Instead I opted to hack the HDF5 code myself and replace all instances of _open
with its wide char counterparts, that could be fed the UTF-8 filenames after converting them with MultiByteToWideChar
.
Fortunately, it turned out that I was not the first one who ran into this situation and found a better solution on the HDF5 forums. Essentially the solution consists of redefining the HDopen
macro from _open
to a function of our own that converts our filename in UTF-8 encoding to the corresponding wide char representation and feed it to _wopen
. Since we are compiling with MinGW-w64, some other modifications were necessary (the original solution relied on CMake): the src/Makefile.am file was modified in order to compile also the H5FDwindows.c and H5FDwindows.h, the latter being added to the public headers. After running autoreconf -i -f
(I had to downgrade the required autoconf version in configure.ac for this to work, but this didn’t create any problems), and running the configure script, I modified the src/H5pubconf.h file according to my earlier post, but had to add one more line:
//fixes first problem
#ifndef H5_HAVE_WIN32_API
#ifdef WIN32 /* defined for all windows systems */
#define H5_HAVE_WIN32_API 1
#endif
#endif
#ifndef H5_HAVE_MINGW
#ifdef __MINGW32__ /*defined for all MinGW compilers */
#define H5_HAVE_MINGW 1
//additional line necessary for UTF-8 build to succeed
#define H5_HAVE_WINDOWS 1
#endif
#endif
//fixes second problem
#define H5_BUILT_AS_DYNAMIC_LIB 1
Keep in mind that this file is recreated after running configure, so these modifications will have to be repeated!!!
This solution is not pretty, and it will have to be repeated every time you compile a new version of HDF5. Ideally, the HDF5 developers would implement a solution based on this hack, which would bring it in line with most other open-source projects that assume UTF-8 filenames on all platforms. I doubt this will happen anytime soon though, since the original solution posted on the HDF5 forums has not been commented on by any of the developers…