Equivalent to Gamemaker's sign() function?

raidho36 · Post by **raidho36** » Mon Feb 26, 2018 1:00 am

I've came up with a function that works reliably and contains no branches. It's a variant of "divide by itself" and it run about 15% slower, but if your data set contains 10% zeroes, already it gains an edge over the basic variant, on account of not having any branches in it - conditional variant's performance deteriorates as branch prediction fails grow, while this one always performs the same.

Code: Select all

function sign ( x )
x = x / ( math.abs ( x ) + 1 )
return math.floor ( x ) + math.ceil ( x )
end

With all that in mind, the "multiply by infinity" is still by a wide margin the fastest variant, but only if your data set contains no zeroes.

pgimeno · Post by **pgimeno** » Mon Feb 26, 2018 9:51 am

Have you benchmarked it against the double multiply that coffeecat proposed?

Code: Select all

local function sign_clamp(x)
  return math.max(math.min(x * 1e200 * 1e200, 1), -1)
end

This one has no problem with zeros.

Edit:

Code: Select all

$ luajit
LuaJIT 2.0.4 -- Copyright (C) 2005-2015 Mike Pall. http://luajit.org/
JIT: ON CMOV SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse
> function sign(x)
>>   x = x / (math.abs(x) + 1)
>>   return math.floor(x) + math.ceil(x)
>> end
> print(sign(1e20))
2
>

raidho36 · Post by **raidho36** » Mon Feb 26, 2018 10:43 am

~~Don't know why you'd want to multiply it two times - multiplying it once by 1e400 does the same job. And I'm pretty sure LuaJIT folds that operation anyway.~~ LuaJIT apparently doesn't support 80 bit values, so if you write 1e400 it turns into infinity. But you can get 1e400 inside the FPU if you multiply 1e200 by itself.

As I said, the above code works slower than normal division variant. It however is reliable in terms of producing correct output on various hardware. 1e400 is outside of double precision range so it's represented as infinity in a 64 bit floating point format, but X86's FPU is 80 bit and it has 15 bits of exponent (64 bit double has 11 bits) so you can get away with a little more range for an operation where one of the operands is an 80 bit compile-time value. If you attempt storing the 80 bit 1e400 value, which would be in a 64 bit double, you get a normal infinity and with it, NaNs. If you attempt using 1e5000 (greatly outside of even 80 bit range), you get NaNs. If you attempt this technique on devices with 64 bit FPU, you get NaNs too.

You can however simply test if 80 bits is available and set your sign function as as one of the variants. Obviously the extended internal range trick works on certain hardware, and it greatly boosts performance when it's available.

EDIT:
Oh crud. I worked it out for small values but I forgot that radix works both ways so big values are screwed up. Nevermind then.

EDIT2:
I did however came up with a hybrid variant that's even a notch faster than multiply variant.

Code: Select all

return math.min ( 1, math.max ( -1, math.ceil ( x ) + math.floor ( x ) ) )

pgimeno · Post by **pgimeno** » Mon Feb 26, 2018 11:19 am

raidho36 wrote: ↑Mon Feb 26, 2018 10:43 am Don't know why you'd want to multiply it two times - multiplying it once by 1e400 does the same job. And I'm pretty sure LuaJIT folds that operation anyway.

Floating point multiplication is well known for not being associative. Compilers don't fold these things unless you pass them special flags (e.g. -fassociative-math in gcc). I'm pretty sure LuaJIT doesn't do that, and the results and compiled code agree.

Why do you want to multiply twice? Because when it's zero to start with, you're not multiplying with infinity in any of the steps, and even when it's in the denormal range, it gets promoted to a number guaranteed to be greater than 1 or less than -1. That's why.

raidho36 wrote: ↑Mon Feb 26, 2018 10:43 am As I said, the above code works slower than normal division variant. It however is reliable in terms of producing correct output on various hardware.
It fails to give the correct output in my hardware for certain inputs. See my edit above.

(scratched per your edit)

Edit2: As for your second edit:

raidho36 wrote: ↑Mon Feb 26, 2018 10:43 am LuaJIT apparently doesn't support 80 bit values, so if you write 1e400 it turns into infinity. But you can get 1e400 inside the FPU if you multiply 1e200 by itself.

For SSE2 that won't happen, but even if it's not available, it doesn't matter if you get 1e400 inside the FPU. LuaJIT will still do the math in the order specified, meaning the first product will always give zero when the input is zero. If you get infinity (or 1e400) in the second product, it will be correctly clamped. Therefore the function performs as specified.

raidho36 wrote: ↑Mon Feb 26, 2018 10:43 am
Code: Select all
return math.min ( 1, math.max ( -1, math.ceil ( x ) + math.floor ( x ) ) )

Now you're on to something. Works fine for any ranges I could test. But in my testing, the performance is poorer than the multiply version:

Code: Select all

            Test 1    Test 2
Branch    : 0.039085  0.035571
AbsDiv    : 0.020082  0.019601
Mul Clamp : 0.005007  0.004514
Floor+Ceil: 0.011755  0.011638

Code: Select all

local t = {}
local iter = 5000000

math.randomseed(1)
for i = 1, iter do
  t[i] = math.random() * 200 - 100
end

local function branch(x)
  return x < 0 and -1 or x > 0 and 1 or x
end

local function absdiv(x)
  x = x / (math.abs (x) + 1)
  return math.floor(x) + math.ceil(x)
end


local function clamp(x)
  return math.max(math.min(x * 1e200 * 1e200, 1), -1)
end

local function clamp2(x)
  return math.min(1, math.max (-1, math.ceil(x) + math.floor(x)))
end

local function benchmark(fn)
  math.randomseed(1)
  local sum = 0
  local start = os.clock()
  for i = 1, iter do
    sum = sum + fn(t[i])
  end
  local finish = os.clock()
  print(finish - start, sum)
end

local function benchmark_branch()
  math.randomseed(1)
  local sum = 0
  local start = os.clock()
  for i = 1, iter do
    sum = sum + branch(t[i])
  end
  local finish = os.clock()
  print(finish - start, sum)
end

local function benchmark_absdiv()
  math.randomseed(1)
  local sum = 0
  local start = os.clock()
  for i = 1, iter do
    sum = sum + absdiv(t[i])
  end
  local finish = os.clock()
  print(finish - start, sum)
end

local function benchmark_clamp()
  math.randomseed(1)
  local sum = 0
  local start = os.clock()
  for i = 1, iter do
    sum = sum + clamp(t[i])
  end
  local finish = os.clock()
  print(finish - start, sum)
end

local function benchmark_clamp2()
  math.randomseed(1)
  local sum = 0
  local start = os.clock()
  for i = 1, iter do
    sum = sum + clamp2(t[i])
  end
  local finish = os.clock()
  print(finish - start, sum)
end


-- First "priming" pass
benchmark(branch)
benchmark(absdiv)
benchmark(clamp)
benchmark(clamp2)
benchmark_branch()
benchmark_absdiv()
benchmark_clamp()
benchmark_clamp2()
print("----- Results:")
-- Actual pass
benchmark(branch)
benchmark(absdiv)
benchmark(clamp)
benchmark(clamp2)
benchmark_branch()
benchmark_absdiv()
benchmark_clamp()
benchmark_clamp2()

Edit3:
For the sake of completeness, I tested in Lua too:

Code: Select all

Branch    : 0.268995  0.267536
AbsDiv    : 1.001139  1.001986
Mul Clamp : 0.657281  0.655257
Floor+Ceil: 1.123145  1.117501

In this case, the branch version outperforms the rest, with the double-mul version being in second place.

raidho36 · Post by **raidho36** » Mon Feb 26, 2018 1:02 pm

To me, it generated code like this:

Code: Select all

->LOOP:
0bcef6a0  cmp dword [rcx+rbp*8+0x4], 0xfffeffff
0bcef6a8  jnb 0x0bce0018	->2
0bcef6ae  movsd xmm6, [rcx+rbp*8]
0bcef6b3  mulsd xmm6, xmm2
0bcef6b7  mulsd xmm6, xmm2
0bcef6bb  minsd xmm6, xmm1
0bcef6bf  maxsd xmm6, xmm0
0bcef6c3  addsd xmm7, xmm6
0bcef6c7  add ebp, +0x01
0bcef6ca  cmp ebp, eax
0bcef6cc  jle 0x0bcef6a0	->LOOP
0bcef6ce  jmp 0x0bce001c	->3

Code: Select all

->LOOP:
0bcef4e0  cmp dword [rcx+rbp*8+0x4], 0xfffeffff
0bcef4e8  jnb 0x0bce0018	->2
0bcef4ee  movsd xmm6, [rcx+rbp*8]
0bcef4f3  roundsd xmm5, xmm6, 0x0a
0bcef4f9  roundsd xmm6, xmm6, 0x09
0bcef4ff  addsd xmm6, xmm5
0bcef503  maxsd xmm6, xmm1
0bcef507  minsd xmm6, xmm0
0bcef50b  addsd xmm7, xmm6
0bcef50f  add ebp, +0x01
0bcef512  cmp ebp, eax
0bcef514  jle 0x0bcef4e0	->LOOP

It would make perfect sense that almost same exact code but involving less floating point math operations would run faster. No clue why it didn't work for you. Somehow your entire timing table exhibits pretty wild and counter-intuitive dispersion - your branching variant even runs faster than division variant. Maybe that has to do with you calling the function by reference so often? In my benchmark, the tight testing loop calls it directly.

Code: Select all

	min	max	avg	mean	check
branch	0.7235	0.7268	0.7242	0.7239	2720000
div	0.2133	0.2134	0.2133	0.2133	2720000
fast	0.0841	0.0862	0.0849	0.0846	2720000
sign	0.0822	0.0849	0.0835	0.0829	2720000
dummy	0.0805	0.0807	0.0806	0.0806	1251059.0378535

pgimeno · Post by **pgimeno** » Mon Feb 26, 2018 3:19 pm

The branch version is the slowest in all LuaJIT tests, not sure why you say otherwise. Only the pure Lua version is faster, which I attribute to function call overhead.

I don't know why our performances are different. The compiled code I observe is the same as yours. Here are the results I get with your program:

Code: Select all

$ love10 .
testing branch .
testing div .
testing fast .
testing sign .
testing dummy .
testing branch ..........
testing div ..........
testing fast ..........
testing sign ..........
testing dummy ..........
	min	max	avg	mean	check
branch	0.4934	0.4947	0.4942	0.4943	2720000
div	0.2432	0.2432	0.2432	0.2432	2720000
fast	0.0557	0.0558	0.0557	0.0557	2720000
sign	0.146	0.1461	0.146	0.146	2720000
dummy	0.0548	0.0548	0.0548	0.0548	1251059.0378535

One possible reason is a difference in the optimization of rounding in each processor.

Code: Select all

$ cat /proc/cpuinfo | grep model\ name | uniq
model name	: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

raidho36 · Post by **raidho36** » Tue Feb 27, 2018 1:54 am

Well, provided your version of LuaJIT generated the same loop code, my only guess is that rounding operations on your CPU are as expensive as division operations - for whatever reason, they really should be single cycle.

Just goes to show that different hardware and software environment exhibits different performance patterns.

coffeecat · Post by **coffeecat** » Tue Feb 27, 2018 10:07 am

raidho36, thanks for sharing your new ideas.

Please note that we can't get a double multiply function that works with both double precision and 80-bit extended precision. The minimum positive representable number in 80-bit extended precision is approximately 3.65×10^−4951, and the maximum finite representable number in double precision is approximately 1.80×10^308.

However the official Lua 5.1 manual specifies:

Number represents real (double-precision floating-point) numbers.

"Double-precision floating-point" should usually refer to the double-precision floating-point format as specified in IEEE 754, but I'm not sure whether it's the case with Lua.

Then again, I think it's a fun puzzle to solve, but writing such performance-sensitive code in C should be a saner approach in real projects.

raidho36 · Post by **raidho36** » Tue Feb 27, 2018 12:10 pm

If you intend on delivering a multi-platform app but don't have resources to maintain its own binaries for every platform, then it doing Lua-side optimization is a perfectly valid approach. Not to mention, LuaJIT is a fully fledged compiler, and often enough produces assembly rivaling that of classic compilers.

coffeecat · Post by **coffeecat** » Wed Feb 28, 2018 5:32 am

Generally speaking, doing optimization in Lua could be harder than in C. It's harder to control or predict the performance of a piece of Lua code, since the gap between Lua and hardware instructions is larger. The performance may depend on the version of LuaJIT, some random state in the runtime environment, and others. JIT compilation contributes to this uncertainty by only kicking in under some opaque conditions.

Equivalent to Gamemaker's sign() function?

Re: Equivalent to Gamemaker's sign() function?

Re: Equivalent to Gamemaker's sign() function?

Re: Equivalent to Gamemaker's sign() function?

Re: Equivalent to Gamemaker's sign() function?

Re: Equivalent to Gamemaker's sign() function?

Re: Equivalent to Gamemaker's sign() function?

Re: Equivalent to Gamemaker's sign() function?

Re: Equivalent to Gamemaker's sign() function?

Re: Equivalent to Gamemaker's sign() function?

Re: Equivalent to Gamemaker's sign() function?

Who is online