[LINUX] [VBA] Function to judge whether the character string of UTF-8 file system is 255 or less and function to count UTF-8 character string in bytes

From Ansi to UTF-8

Anguish begins with the counting method at ANSI A function ansiLenB Excel that can be easily created before starting Access Match UTF-16LE surrogate pair About excavation surrogate pair Get UTF-16 UTF-8 SJis character code for all characters using Excel for Microsoft 365 VBA Win32 Api and Spill And now

[Problems with file name length and character code] Programmer president's blog (https://blogs.itmedia.co.jp/komata/2012/11/post-85cf.html)

First, there is a character code problem. On Linux and MacOS, the filename is typically UTF-8. Windows seems to be UNICODE in NTFS, but it seems to be Shift_JIS (CP932) in the application (because I'm not so interested ...). Therefore, when zipping, if you zip it as UTF-8, the characters will be garbled. It can't be helped, so if you forcibly create a file with the CP932 file name on Linux or MacOS and zip it, it will work. However, if you create a CP932 file name on MacOS, it will often be heavier for each OS, so Linux is better. ..

Well, if you think that you have cleared the problem of character code, this time the problem of file name length will come up. On Windows, it seems to be up to 256 characters including the path, so long file names will result in an error. There is no choice but to truncate long filenames, ASCII characters are easy to truncate, but multibyte characters are cumbersome to truncate. If you cut it at an appropriate place, the characters will be garbled. It can't be helped, so in C language etc., it feels like converting to wide characters, truncating, and then back to multibyte.

In the past, there were many CP932 for Windows and EUC for UNIX, and in many cases it was possible to assume that Japanese was 2 bytes, but with the spread of the Internet, characters from all over the world have come to coexist. As a result, multi-byte is no longer about 2 bytes, and it has become difficult to manage by myself. It's better to use a library for multi-byte processing obediently ... </ b>

President ...

(Omitted) Now you can finally get a file name that you can unzip on Windows. "Expected" is because the path length is up to 256 characters, so it may be exceeded depending on the directory to be expanded.

By the way, in UNIX-based OS such as Linux, the maximum file name length is 255 bytes. The maximum path length is 1023 bytes (a limit on passing system calls), which is considerably easier to handle than Windows. </ B>

So in the case of a Linux server The file name is The character code is UTF-8, 255 single-byte characters It turns out that. This half-width 255 characters means 85 full-width characters. This is because if one half-width character is 1 byte, it is 3 bytes. In addition, the surrogate pair is 4 bytes. It was revealed this Get UTF-16 UTF-8 SJis character code for all characters using Excel for Microsoft 365 VBA Win32 Api and Spill It will be a sheet created from SheetMake2 here. Within the range of judgment accuracy of this Win32 API, the character string is converted to UTF-8 binary. This is used to count in bytes. This time as well, we will use the Win32 API. The code below is written in addition to the previous.

#If VBA7 Then
Declare PtrSafe Function MessageBox Lib "user32" Alias "MessageBoxW" _
    (ByVal hwnd As LongPtr, ByVal lpText As LongPtr, _
     ByVal lpCaption As LongPtr, ByVal wType As Long) As Long
Declare PtrSafe Function GetFocus Lib "user32" () As LongPtr
Declare PtrSafe Function GetWindowsDirectory Lib "Kernel32" Alias "GetWindowsDirectoryA" (ByVal lpBuffer As String, ByVal nSize As Long) As Long
 Declare PtrSafe Function GetSystemDirectory Lib "Kernel32" Alias "GetSystemDirectoryA" (ByVal lpBuffer As String, ByVal nSize As Long) As Long
 Declare PtrSafe Function GetTempPath Lib "Kernel32" Alias "GetTempPathA" (ByVal nBufferLength As Long, ByVal lpBuffer As String) As Long
'Private Declare PtrSafe Function SetCurrentDirectory Lib "kernel32" (ByVal lpPathName As Long) As Long
' Set rrent Directory
Private Declare PtrSafe Function SetCurrentDirectory Lib "Kernel32" Alias _
                            "SetCurrentDirectoryA" (ByVal CurrentDir As String) As Long
'Private Declare PtrSafe Function SetCurrentDirectory Lib "kernel32.dll" Alias "SetCurrentDirectoryW" (ByVal lpPathName As Long) As Long
'Private Declare PtrSafe Function SetCurrentDirectory Lib "kernel32.dll" Alias "SetCurrentDirectoryA" (ByVal lpPathName As Long) As Long

#Else
'Added the following Declare statement to the standard module
Declare Function MessageBox Lib "user32" Alias "MessageBoxW" _
    (ByVal hwnd As Long, ByVal lpText As Long, _
     ByVal lpCaption As Long, ByVal wType As Long) As Long
Declare Function GetFocus Lib "user32" () As Long
Private Declare Function GetWindowsDirectory Lib "kernel32" _
 Alias "GetWindowsDirectoryA" (ByVal lpBuffer As String, _
 ByVal nSize As Long) As Long
#End If
Private Declare PtrSafe Function MultiByteToWideChar Lib "kernel32.dll" ( _
    ByVal CodePage As Long, _
    ByVal dwFlags As Long, _
    ByVal lpMultiByteStr As LongPtr, _
    ByVal cchMultiByte As Long, _
    ByVal lpWideCharStr As LongPtr, _
    ByVal cchWideChar As Long) As Long

Private Declare PtrSafe Function WideCharToMultiByte Lib "kernel32.dll" ( _
    ByVal CodePage As Long, _
    ByVal dwFlags As Long, _
    ByVal lpWideCharStr As LongPtr, _
    ByVal cchWideChar As Long, _
    ByVal lpMultiByteStr As LongPtr, _
    ByVal cchMultiByte As Long, _
    ByVal lpDefaultChar As LongPtr, _
    ByVal lpUsedDefaultChar As Long) As Long
Private Const CP_UTF8 = 65001

Private Const ERROR_INSUFFICIENT_BUFFER = 122&

'UFT-16LE Sarrogate Pair Code
Sub TestSarrogatePair()
' For Excel
'Assuming that the ActiveCell contains one surrogate pair character
'For testing, seeking its superior and inferior surrogate
'Display with MsgBoxEx
'For this reason, use the Windows API
' UTF-16 LE ,Bb for BE(0) bb(1) bb(2) bb(3)become
Dim b As String
Dim bb() As Byte
Dim Text As String
b = ActiveCell.Value
bb = b
Debug.Print "AscW Result(Dec -> Hex)" & vbTab & Hex(AscW(b))
Debug.Print "&H" & Hex(bb(1)) & Hex(bb(0))
Debug.Print "&H" & Hex(bb(3)) & Hex(bb(2))
Text = ChrW("&H" & Hex(bb(1)) & Hex(bb(0))) & ChrW("&H" & Hex(bb(3)) & Hex(bb(2)))
Debug.Print "ChrW(" & "&H" & Hex(bb(1)) & Hex(bb(0)) & ") & ChrW(&H" & Hex(bb(3)) & Hex(bb(2)) & ")"
MsgBoxEx Text
End Sub
'UTF-16 LE
Function UTF16Sarrogate(s As String)
Dim bb() As Byte
If IsUpperSarrogateCharacter(s) And IsLowerSarrogateCharcter(s) Then
bb = s
UTF16Sarrogate = "0x" & CStr(Hex(bb(1)) & Hex(bb(0))) & " " & "0x" & CStr(Hex(bb(3)) & Hex(bb(2)))
Exit Function
Else
UTF16Sarrogate = ""
End If
End Function
Function IsUpperSarrogateCharacter(s As String)
If Hex(AscW(s)) >= Hex(&HD800) And Hex(AscW(s)) <= Hex(&HDBFF) Then
IsUpperSarrogateCharacter = True: Exit Function
Else
IsUpperSarrogateCharacter = False: Exit Function
End If
End Function
Function IsLowerSarrogateCharcter(s As String)
Dim bb() As Byte
If IsUpperSarrogateCharacter(s) = True Then
bb = s
'Debug.Print Hex("&H" & bb(3) & bb(2))
If Hex("&H" & bb(3) & bb(2)) <= Hex(&HDC00) And Hex("&H" & bb(3) & bb(2)) <= Hex(&HDFFF) Then
IsLowerSarrogateCharcter = True: Exit Function
Else
IsLowerSarrogateCharcter = False: Exit Function
End If
End If
End Function
Function VBAAscW(s As String)
If AscW(s) < 0 Then
VBAAscW = AscW(s) + 65536: Exit Function
Else
VBAAscW = AscW(s): Exit Function
End If
End Function
' Shift-Jis
Function VBA_ASC(s As String)
If isSJIS(s) Then
VBA_ASC = Hex(Asc(s)): Exit Function
Else
VBA_ASC = "": Exit Function
End If
End Function
Function isSJIS(ByVal argStr As String) As Boolean
'https://excel-ubara.com/excelvba4/EXCEL_VBA_403.html
    Dim sQuestion As String
    sQuestion = Chr(63) '?:In case there is a misunderstanding in character literals
    Dim i As Long
    For i = 1 To Len(argStr)
        If Mid(argStr, i, 1) <> sQuestion And _
           Asc(Mid(argStr, i, 1)) = Asc(sQuestion) Then
            isSJIS = False
            Exit Function
        End If
    Next
    isSJIS = True
End Function
Function LenByteUTF8(ByRef s As String) As Long
    If Len(s) = 0 Then
        LenByteUTF8 = 0
        Exit Function
    End If
    Dim nBufferSize As Long
    Dim Ret() As Byte
    nBufferSize = WideCharToMultiByte(CP_UTF8, 0, StrPtr(s), Len(s), 0, 0, 0, 0)
    ReDim Ret(0 To nBufferSize - 1)
    WideCharToMultiByte CP_UTF8, 0, StrPtr(s), Len(s), VarPtr(Ret(0)), nBufferSize, 0, 0
    LenByteUTF8 = UBound(Ret) + 1
End Function
Function IsLengthle255byteOnUTF8(s As String) As Boolean
'It is a test version and is being verified.
' UTF-Convert to 8 and assign to Byte array to count the number
'False if greater than 255
'When using this function
' https://qiita.com/Q11Q/items/a2d61545d9e5c4e15f22
'Pay attention to the VBE settings
    If Len(s) = 0 Then
        IsLengthle255byteOnUTF8 = False
        Exit Function
    End If
    Dim nBufferSize As Long
    Dim Ret() As Byte
    Dim cnt As Long
    nBufferSize = WideCharToMultiByte(CP_UTF8, 0, StrPtr(s), Len(s), 0, 0, 0, 0)
    ReDim Ret(0 To nBufferSize - 1)
    WideCharToMultiByte CP_UTF8, 0, StrPtr(s), Len(s), VarPtr(Ret(0)), nBufferSize, 0, 0
    cnt = UBound(Ret) + 1
    If cnt > 255 Then IsLengthle255byteOnUTF8 = False: Exit Function Else IsLengthle255byteOnUTF8 = True: Exit Function
End Function

Add to the previous module. What I added this time LenByteUTF8

Function LengthUTF8(s As String)
'It is a test version and is being verified.
' UTF-Convert to 8 and assign to Byte array to count the number
'False if greater than 255
'When using this function
' https://qiita.com/Q11Q/items/a2d61545d9e5c4e15f22
'Pay attention to the VBE settings
    If Len(s) = 0 Then
        LengthUTF8 = 0
        Exit Function
    End If
    Dim nBufferSize As Long
    Dim Ret() As Byte
    Dim cnt As Long
    nBufferSize = WideCharToMultiByte(CP_UTF8, 0, StrPtr(s), Len(s), 0, 0, 0, 0)
    ReDim Ret(0 To nBufferSize - 1)
    WideCharToMultiByte CP_UTF8, 0, StrPtr(s), Len(s), VarPtr(Ret(0)), nBufferSize, 0, 0
    cnt = UBound(Ret) + 1
    LengthUTF8 = cnt
End Function

And IsLengthle 255byteOnUTF8 Is.

Function IsLengthle255byteOnUTF8(s As String) As Boolean
'It is a test version and is being verified.
' UTF-Convert to 8 and assign to Byte array to count the number
'False if greater than 255
'When using this function
' https://qiita.com/Q11Q/items/a2d61545d9e5c4e15f22
'Pay attention to the VBE settings
    If Len(s) = 0 Then
        IsLengthle255byteOnUTF8 = False
        Exit Function
    End If
    Dim nBufferSize As Long
    Dim Ret() As Byte
    Dim cnt As Long
    nBufferSize = WideCharToMultiByte(CP_UTF8, 0, StrPtr(s), Len(s), 0, 0, 0, 0)
    ReDim Ret(0 To nBufferSize - 1)
    WideCharToMultiByte CP_UTF8, 0, StrPtr(s), Len(s), VarPtr(Ret(0)), nBufferSize, 0, 0
    cnt = UBound(Ret) + 1
    If cnt > 255 Then IsLengthle255byteOnUTF8 = False: Exit Function Else IsLengthle255byteOnUTF8 = True: Exit Function
End Function

From the previous toUTF8, you can see how to count in byte units.

ReDim Ret(0 To nBufferSize - 1) Byte units are counted here. It may be okay to put it out as it is, but I don't know the behavior of Win32API, so I will definitely assign it to the array. Then add 1 to the number of elements acquired by Ubound. It may be possible to set this to 1 when it is Redim instead of 0, but since this array enters Win32Api, I decided that it should not be changed.

This is not the length of a directory or file on a Windows server

As stated in the president's blog, it is Linux, not Windows, that is said to be 255 characters and 255 bytes in UTF-8, and the file length of the Windows server cannot be measured with this.

Surrogate pair is 4 bytes

When counting on this basis, the surrogate pair consumes 4 bytes and 4 half-width characters. Even if it looks like full-width characters, 42 characters are considered to be the limit. If you search, you will find various theories such as having 5 bytes or 6 bytes, but being hit. According to IBM's materials, UTF-8 has a maximum of 4 bytes. https://www.ibm.com/support/knowledgecenter/ja/SSEPGG_11.5.0/com.ibm.db2.luw.admin.nls.doc/doc/c0004816.html Therefore, the maximum is 4 bytes. And a 4-byte example is a surrogate pair.

Bytes and number of characters

This function converts the character string to UTF-8 bytes. So it seems that you can replace it with a code regardless of whether it is full-width or half-width and make it that long. In other words, the capacity is 255 bytes and the number of characters is 255, so they are used properly in the text.

Win32 API and strengths

It's certainly a long code, but you don't have to reference ADODB. Also, since toUTF8 is converted to a UTF-8 byte array, it would be useless without spills. However, it is only valid in Excel's Worksheet, so it is difficult to use only in Excel. Since this function changes the number of characters to the number of bytes and judges 255 characters, it can be used not only in Excel but also in Access, Word, Publisehr Outlook.

Recommended Posts