utf8 support in pure lua

TsT · Post by **TsT** » Fri Nov 09, 2012 12:09 am

Hello,

I started a pure Lua module to support operation on UTF-8 data.

See lua-utf8

First goal was reached :
- be able to get the good length
- be able to get substring
- automatic convertion

Regards,
TsT

spir · Post by **spir** » Sun Nov 11, 2012 4:24 pm

TsT wrote:Hello,

I started a pure Lua module to support operation on UTF-8 data.

See lua-utf8

First goal was reached :
- be able to get the good length
- be able to get substring
- automatic convertion

Regards,
TsT

Hello TsT,

Pleased to see someone else interested in unicode. I have had a look at your online repo. However, are you aware the following example you give will not work in general:

Code: Select all

Sample of use

code>local data = "àbcdéêèf"

local u = require("utf8")

local udata = u(data)

print(type(data), data) -- the orignal print(type(udata), udata) -- automatic convertion to string

print(#data) -- is not the good number of printed characters on screen print(#udata) -- is the number of printed characters on screen

print(udata:sub(4,5)) -- be able to use the sub() like a string

I will not give you a Lua example because you cannot even type unicode strings in Lua, but here is the best you can have shown in python:

Code: Select all

# coding:utf8
s = u"\u0041\u0302\u0020\u0041\u032D"
print(s)          # "Â A̭"   (3 chars!)
print(repr(s))    # u'A\u0302 A\u032d'
print (len(s))    # 5

The point is what unicode folks call "abstract characters", what is represented by "unicode code points", is not what you, me, or any other one would call "character", but just what they like to list in their set. In particular, basically, composite characters like Â are represented by 2 codes, one for the base 'A', one for the combining '^'. Which is a very good thing, imo: simple, informative, efficient. But there are also "precomposed characters" with codes representing whole composite characters. These are the ones most (if not all) unicode-aware editors and other text-producing software use, indeed, so that everyone thinks "abstract characters" are just characters and codes just represent characters (even programmers working on unicode). But this is not true.

A single character is represented by a suite of codes (1 or more, there is no formal limit in fact). And each code is 1 number in utf-32 and 1 to 4 (or 6) bytes in utf-8, as you know. Thus, decoding utf-8 gives you an array of codes, but not array of character representations, in the everyday or programming sense of "character". As a consequence, your #udata on my example will give 5, not 3.

Anyway, it's still very, very nice to have utf-8 <--> unicode encoding and decoding routines, and I may reuse them if you don't mind.

Regards,
Denis

TsT · Post by **TsT** » Tue Nov 13, 2012 1:11 pm

Hello spir,

Thanks for your feedback.
I'm also appreciate to meet someone who cares about Unicode!

Unfortunately my current utf8.lua is a simple approach.

I tried to support more advanced stuff like lower/upper cases on Unicode, finally I thought it's too complicated...
I think if someone want a true and full support of UTF-8 (or Unicode) he must use a better solution, like :

* ICU4lua
* slnunicode
* craigbarnes/lua-unicode

You searched a way to create an Unicode sequence by numerical code
You may use string.char

Code: Select all

> a = "Â A̭"

> print(a:byte(1,-1))
195	130	32	65	204	173

> for i,v in ipairs({a:byte(1,-1)}) do print(i,v, ("0x%x"):format(v)) end
1	195	0xc3
2	130	0x82
3	32	0x20
4	65	0x41
5	204	0xcc
6	173	0xad

> b=string.char(195,	130,	32,	65,	204,	173)
> b=string.char(0xc3, 0x82, 0x20, 0x41, 0xcc, 0xad)
> print(b)
Â A̭

If you understand how manage the composite Unicode Characters I will be happy to include changes to support them.

Regards,

EDIT: I discovered the ValidateUnicodeString page.

spir · Post by **spir** » Wed Nov 14, 2012 7:06 pm

TsT wrote:Hello spir,
Unfortunately my current utf8.lua is a simple approach.
I tried to support more advanced stuff like lower/upper cases on Unicode, finally I thought it's too complicated...
I think if someone want a true and full support of UTF-8 (or Unicode) he must use a better solution, like :

* ICU4lua
* slnunicode
* craigbarnes/lua-unicode

Well, in fact, as long as people understand the (theoretical) issue with composite characters, any support for unicode code points can be pretty useful. The point is most text will be made of precomposed characters anyway. So if one knows the software that produced it, or is ready to take the risk... It's good in any case to be able to point to or select parts of a byte string while knowing we are at borders of valid code points.

About full unicode support, if you mean building a representation which is really a sequence characters, it is doable, but costly. (You need essentially to produce a normalised decomposed form.) If it is unicode support in the sense of providing tools like universal casing or locale-aware sorting or giving information about characters (is it a scripting char? a base or composing one? does it write right-to-left?), then it is another story ==> ICU, as you say.

TsT wrote: You searched a way to create an Unicode sequence by numerical code
You may use string.char
Code: Select all
> a = "Â A̭"

> print(a:byte(1,-1))
195	130	32	65	204	173

> for i,v in ipairs({a:byte(1,-1)}) do print(i,v, ("0x%x"):format(v)) end
1	195	0xc3
2	130	0x82
3	32	0x20
4	65	0x41
5	204	0xcc
6	173	0xad

> b=string.char(195,	130,	32,	65,	204,	173)
> b=string.char(0xc3, 0x82, 0x20, 0x41, 0xcc, 0xad)
> print(b)
Â A̭
If you understand how manage the composite Unicode Characters I will be happy to include changes to support them.

Yes, thank you!
About composite Unicode Characters: no, at least not now, I don't have time for that. (But I have a lib for that in D; I also had a prototype in Lua, but cannot find it anymore.) However, it is probably not worth the pain and the cost (in time and memory).

Denis

utf8 support in pure lua

utf8 support in pure lua

Re: utf8 support in pure lua

Re: utf8 support in pure lua

Who is online