UTF-8 encoding error when erasing accent characters

Questions about the LÖVE API, installing LÖVE and other support related questions go here.
Forum rules
Before you make a thread asking for help, read this.
Post Reply
User avatar
Amatereasu
Prole
Posts: 9
Joined: Wed May 10, 2023 6:30 am

UTF-8 encoding error when erasing accent characters

Post by Amatereasu »

hello :)
im working on a tool for encoding XML animation data, and part of that is typing in the name of an animation or such, im using

Code: Select all

love.textinput(text)
to detect key input and put it into a variable,

Code: Select all

state.commandprev
and i have a backspace function using

Code: Select all

love.keypressed(key)
to

Code: Select all

string.sub(state.commandprev, 1, -2)
its fine when typing the character, but when its removed it errors at a seemingly random point in the program:

Error

main.lua:1162: UTF-8 decoding error: Invalid UTF-8


Traceback

[love "callbacks.lua"]:228: in function 'handler'
[C]: in function 'getWidth'
main.lua:1162: in function 'draw'
[love "callbacks.lua"]:168: in function <[love "callbacks.lua"]:144>
[C]: in function 'xpcall'

im not sure how to deal with this? ive looked on the wiki and havent found anything about it
i have 3 solutions for this but im not sure about effective implementation
  • prevent them from being typed in the first place (because the generation algorithm i wrote does not handle them, and neither does XML)
  • handle the crash to prevent lost progress
  • convert the characters to their standard counterparts
User avatar
pgimeno
Party member
Posts: 3684
Joined: Sun Oct 18, 2015 2:58 pm

Re: UTF-8 encoding error when erasing accent characters

Post by pgimeno »

Each UTF-8 character above the ASCII range consists of a byte in the range C0-FF followed by one or more bytes in the range 80-BF (the exact range is a bit more restricted but that's irrelevant here). If you don't remove all the bytes from the character, you get an error like the one you're getting.

Since you're deleting from the end, you can use a logic like this:

Code: Select all

function removeLast(str)
  if str == "" then -- nothing to delete?
    return ""
  end
  local lastIdx = -1
  local chr = str:byte(lastIdx)
  if chr >= 0x80 then
    repeat
      lastIdx = lastIdx - 1
      chr = str:byte(lastIdx)
    until chr >= 0xC0
  end
  return str:sub(1, lastIdx - 1)
end
The idea here is that if the current character is >= 0x80 then it scans backwards until it finds one that is >= 0xC0, which can only be the start of a UTF-8 character. Whatever the outcome (whether it found a character < 0x80 or one >= 0xC0), it returns everything up to the previous character to that.

Now you can replace

Code: Select all

string.sub(state.commandprev, 1, -2)
with

Code: Select all

removeLast(state.commandprev)
which will do the right thing with special characters.

Edit: By the way, XML handles these characters just fine, if you set the character set to UTF-8:

Code: Select all

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
Notice the 'encoding="UTF-8"' part
User avatar
Amatereasu
Prole
Posts: 9
Joined: Wed May 10, 2023 6:30 am

Re: UTF-8 encoding error when erasing accent characters

Post by Amatereasu »

pgimeno wrote: Sun Oct 29, 2023 9:13 am Each UTF-8 character above the ASCII range consists of a byte in the range C0-FF followed by one or more bytes in the range 80-BF (the exact range is a bit more restricted but that's irrelevant here). If you don't remove all the bytes from the character, you get an error like the one you're getting.

Since you're deleting from the end, you can use a logic like this:

...

The idea here is that if the current character is >= 0x80 then it scans backwards until it finds one that is >= 0xC0, which can only be the start of a UTF-8 character. Whatever the outcome (whether it found a character < 0x80 or one >= 0xC0), it returns everything up to the previous character to that.

Now you can replace

Code: Select all

string.sub(state.commandprev, 1, -2)
with

Code: Select all

removeLast(state.commandprev)
which will do the right thing with special characters.

Edit: By the way, XML handles these characters just fine, if you set the character set to UTF-8:

Code: Select all

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
Notice the 'encoding="UTF-8"' part
thank you so much @~@ i would have never figured this out
User avatar
zorg
Party member
Posts: 3470
Joined: Thu Dec 13, 2012 2:55 pm
Location: Absurdistan, Hungary
Contact:

Re: UTF-8 encoding error when erasing accent characters

Post by zorg »

Amatereasu wrote: Sun Oct 29, 2023 9:48 am
pgimeno wrote: Sun Oct 29, 2023 9:13 am ...
...
Alternatively, there is a perfectly fine utf-8 library included with löve you can use, especially its own utf8.sub function.
The wiki even gives you this exact thing as an example: https://love2d.org/wiki/utf8
Me and my stuff :3True Neutral Aspirant. Why, yes, i do indeed enjoy sarcastically correcting others when they make the most blatant of spelling mistakes. No bullying or trolling the innocent tho.
User avatar
Amatereasu
Prole
Posts: 9
Joined: Wed May 10, 2023 6:30 am

Re: UTF-8 encoding error when erasing accent characters

Post by Amatereasu »

zorg wrote: Sun Oct 29, 2023 12:19 pm ...
i did actually try this, with no luck, same error
User avatar
slime
Solid Snayke
Posts: 3170
Joined: Mon Aug 23, 2010 6:45 am
Location: Nova Scotia, Canada
Contact:

Re: UTF-8 encoding error when erasing accent characters

Post by slime »

What's your code that uses it? Or maybe you have other code modifying the string's bytes that's not using it but should? Proper use of the utf-8 module should work without issues.
Post Reply

Who is online

Users browsing this forum: Ahrefs [Bot], Google [Bot] and 5 guests