Page 1 of 1

Validate UTF8 strings?

Posted: Thu Sep 14, 2023 5:59 am
by wan-may
I read, via the FFI, some nasty untrusted binary sludge sent from who-knows-where.

Sometimes this sludge contains a possibly strange UTF-8 string I might want to display - maybe I want an Elvish localisation when 12.0 adds all that custom ligature support.

The utf8 library is only concerned with encoding, so it doesn't keep text:add from choking on things:

Code: Select all

love.load = function()
  local utf8 = assert( require 'utf8' )
  local s = utf8.char( 62835, 55592 ) --cognitohazardous ZWJ sequence
  assert( utf8.len( s ) ) --This should return fail (nil) if len encounters 'any invalid byte sequence' 
  love.graphics.newText( love.graphics.getFont() ):set( s ) --Throws 'invalid code point' error when decoding anyways
end
I guess in the worst case I can get away with pcalling text:add or something. But:

is there a right way to do this? Is there a function that will decide if my string is acceptable utf8, before I actually pass it to a text object?

Re: Validate UTF8 strings?

Posted: Thu Sep 14, 2023 1:42 pm
by pgimeno
I don't think there's one, and unfortunately Lua doesn't have regular expressions, which would have been a solution.

The best I've found is this:

Code: Select all

local function validate(s)
  for p, c in utf8.codes(s) do
    if c >= 0xD800 and c <= 0xDFFF or c == 0xFFFE or c == 0xFFFF then
      error("invalid UTF-8 codepoint")
    end
  end
end
utf8.codes already catches overlong sequences and codes > U+10FFFF, so that's covered.

Re: Validate UTF8 strings?

Posted: Thu Sep 14, 2023 8:50 pm
by wan-may
I see, thank you!