Hello,
I started a pure Lua module to support operation on UTF-8 data.
See lua-utf8
First goal was reached :
- be able to get the good length
- be able to get substring
- automatic convertion
Regards,
TsT
utf8 support in pure lua
Forum rules
Before you make a thread asking for help, read this.
Before you make a thread asking for help, read this.
utf8 support in pure lua
My projects current projects : dragoon-framework (includes lua-newmodule, lua-provide, lovemodular, , classcommons2, and more ...)
Re: utf8 support in pure lua
Hello TsT,TsT wrote:Hello,
I started a pure Lua module to support operation on UTF-8 data.
See lua-utf8
First goal was reached :
- be able to get the good length
- be able to get substring
- automatic convertion
Regards,
TsT
Pleased to see someone else interested in unicode. I have had a look at your online repo. However, are you aware the following example you give will not work in general:
Code: Select all
Sample of use
code>local data = "àbcdéêèf"
local u = require("utf8")
local udata = u(data)
print(type(data), data) -- the orignal print(type(udata), udata) -- automatic convertion to string
print(#data) -- is not the good number of printed characters on screen print(#udata) -- is the number of printed characters on screen
print(udata:sub(4,5)) -- be able to use the sub() like a string
Code: Select all
# coding:utf8
s = u"\u0041\u0302\u0020\u0041\u032D"
print(s) # "Â A̭" (3 chars!)
print(repr(s)) # u'A\u0302 A\u032d'
print (len(s)) # 5
A single character is represented by a suite of codes (1 or more, there is no formal limit in fact). And each code is 1 number in utf-32 and 1 to 4 (or 6) bytes in utf-8, as you know. Thus, decoding utf-8 gives you an array of codes, but not array of character representations, in the everyday or programming sense of "character". As a consequence, your #udata on my example will give 5, not 3.
Anyway, it's still very, very nice to have utf-8 <--> unicode encoding and decoding routines, and I may reuse them if you don't mind.
Regards,
Denis
... la vita e estrany ...
Re: utf8 support in pure lua
Hello spir,
Thanks for your feedback.
I'm also appreciate to meet someone who cares about Unicode!
Unfortunately my current utf8.lua is a simple approach.
I tried to support more advanced stuff like lower/upper cases on Unicode, finally I thought it's too complicated...
I think if someone want a true and full support of UTF-8 (or Unicode) he must use a better solution, like : You searched a way to create an Unicode sequence by numerical code
You may use string.char
If you understand how manage the composite Unicode Characters I will be happy to include changes to support them.
Regards,
EDIT: I discovered the ValidateUnicodeString page.
Thanks for your feedback.
I'm also appreciate to meet someone who cares about Unicode!
Unfortunately my current utf8.lua is a simple approach.
I tried to support more advanced stuff like lower/upper cases on Unicode, finally I thought it's too complicated...
I think if someone want a true and full support of UTF-8 (or Unicode) he must use a better solution, like : You searched a way to create an Unicode sequence by numerical code
You may use string.char
Code: Select all
> a = "Â A̭"
> print(a:byte(1,-1))
195 130 32 65 204 173
> for i,v in ipairs({a:byte(1,-1)}) do print(i,v, ("0x%x"):format(v)) end
1 195 0xc3
2 130 0x82
3 32 0x20
4 65 0x41
5 204 0xcc
6 173 0xad
> b=string.char(195, 130, 32, 65, 204, 173)
> b=string.char(0xc3, 0x82, 0x20, 0x41, 0xcc, 0xad)
> print(b)
 A̭
Regards,
EDIT: I discovered the ValidateUnicodeString page.
My projects current projects : dragoon-framework (includes lua-newmodule, lua-provide, lovemodular, , classcommons2, and more ...)
Well, in fact, as long as people understand the (theoretical) issue with composite characters, any support for unicode code points can be pretty useful. The point is most text will be made of precomposed characters anyway. So if one knows the software that produced it, or is ready to take the risk... It's good in any case to be able to point to or select parts of a byte string while knowing we are at borders of valid code points.TsT wrote:Hello spir,
Unfortunately my current utf8.lua is a simple approach.
I tried to support more advanced stuff like lower/upper cases on Unicode, finally I thought it's too complicated...
I think if someone want a true and full support of UTF-8 (or Unicode) he must use a better solution, like :
About full unicode support, if you mean building a representation which is really a sequence characters, it is doable, but costly. (You need essentially to produce a normalised decomposed form.) If it is unicode support in the sense of providing tools like universal casing or locale-aware sorting or giving information about characters (is it a scripting char? a base or composing one? does it write right-to-left?), then it is another story ==> ICU, as you say.
Yes, thank you!TsT wrote: You searched a way to create an Unicode sequence by numerical code
You may use string.charIf you understand how manage the composite Unicode Characters I will be happy to include changes to support them.Code: Select all
> a = "Â A̭" > print(a:byte(1,-1)) 195 130 32 65 204 173 > for i,v in ipairs({a:byte(1,-1)}) do print(i,v, ("0x%x"):format(v)) end 1 195 0xc3 2 130 0x82 3 32 0x20 4 65 0x41 5 204 0xcc 6 173 0xad > b=string.char(195, 130, 32, 65, 204, 173) > b=string.char(0xc3, 0x82, 0x20, 0x41, 0xcc, 0xad) > print(b) Â A̭
About composite Unicode Characters: no, at least not now, I don't have time for that. (But I have a lib for that in D; I also had a prototype in Lua, but cannot find it anymore.) However, it is probably not worth the pain and the cost (in time and memory).
Denis
... la vita e estrany ...
Who is online
Users browsing this forum: Google [Bot] and 11 guests