Reading binary files quickly.

zorg · Post by **zorg** » Sun Mar 03, 2019 5:48 pm

ingsoc451 wrote: ↑Sun Mar 03, 2019 5:31 pm Prior to running your game, you can export the bin files to a format that is easier/faster to parse by your game (like a (compressed) lua table). This is a one time processing. But if you are not able to do it for reasons, then using a compressed lua table would be of no use.
Then you can write a c/c++ module to the parsing of legacy format

To my understanding, the reasons they were using binary files were: "It's an old game they're trying to port" so they have no other option.
Then, instead of needing to touch c/c++ at all, if they don't want to, they can just use what has already been suggested, either love.data.unpack or moonblob.

gradualgames · Post by **gradualgames** » Mon Mar 04, 2019 8:17 pm

Thanks for the responses. I've been looking at the documentation for unpack and moonblob, but it is not clear to me that I could use these without modification. For instance, this file format contains strings which have a uShort header (2 bytes) followed by the characters of the strings. I couldn't infer from the unpack or moonblob documentation if the header for a string was a ushort or a normal 32 bit integer.

"cn: a fixed-sized string with n bytes"

What is that c? a 2 byte integer? a 4 byte integer? The documentation is unclear. People do say rtfm a lot but people don't say wtfm enough I think.

So far I have been writing my own parser for the data that just advances an offset through a string containing the file's data. I may wind up sticking with this approach for maximum control and minimum dependencies.

Not sure yet how to parse floating point numbers though, that will be interesting.

grump · Post by **grump** » Mon Mar 04, 2019 8:50 pm

gradualgames wrote: ↑Mon Mar 04, 2019 8:17 pm For instance, this file format contains strings which have a uShort header (2 bytes) followed by the characters of the strings. I couldn't infer from the unpack or moonblob documentation if the header for a string was a ushort or a normal 32 bit integer.

Code: Select all

local reader = BlobReader(data)
local len = reader:u16()
local string = reader:raw(len)

gradualgames · Post by **gradualgames** » Mon Mar 04, 2019 8:56 pm

grump wrote: ↑Mon Mar 04, 2019 8:50 pm
gradualgames wrote: ↑Mon Mar 04, 2019 8:17 pm For instance, this file format contains strings which have a uShort header (2 bytes) followed by the characters of the strings. I couldn't infer from the unpack or moonblob documentation if the header for a string was a ushort or a normal 32 bit integer.
Code: Select all
local reader = BlobReader(data)
local len = reader:u16()
local string = reader:raw(len)

Good idea

Thank you. I actually almost just dug into the moonblob code to learn how to parse floating point numbers because I think I'm almost through all the types I need to parse in my own mini binary parser.

grump · Post by **grump** » Mon Mar 04, 2019 9:02 pm

gradualgames wrote: ↑Mon Mar 04, 2019 8:56 pm I actually almost just dug into the moonblob code to learn how to parse floating point numbers because I think I'm almost through all the types I need to parse in my own mini binary parser.

BlobReader:f32 reads 32 bits floating point numbers. BlobReader:f64 reads 64 bits floating point numbers.
Here is the documentation for BlobReader.

moonblob makes heavy use of ffi cdata types, so there is no actual "parsing" involved. It reads 4 or 8 bytes and interprets those bits as a floating point number by using "type punning".

love.data.unpack provides identifiers for floating point numbers, but you can't be sure their size matches the actual size of your data. It depends on the platform.

gradualgames · Post by **gradualgames** » Mon Mar 04, 2019 10:16 pm

grump wrote: ↑Mon Mar 04, 2019 9:02 pm
gradualgames wrote: ↑Mon Mar 04, 2019 8:56 pm I actually almost just dug into the moonblob code to learn how to parse floating point numbers because I think I'm almost through all the types I need to parse in my own mini binary parser.
BlobReader:f32 reads 32 bits floating point numbers. BlobReader:f64 reads 64 bits floating point numbers.
Here is the documentation for BlobReader.

moonblob makes heavy use of ffi cdata types, so there is no actual "parsing" involved. It reads 4 or 8 bytes and interprets those bits as a floating point number by using "type punning".

love.data.unpack provides identifiers for floating point numbers, but you can't be sure their size matches the actual size of your data. It depends on the platform.

Thanks; I'll give moonblob a try rather than continuing to roll my own.

pgimeno · Post by **pgimeno** » Tue Mar 05, 2019 10:45 pm

Code: Select all

print(love.data.unpack('<s2s2', '\005\000ABCDE\002\000FG')) -- prints "ABCDE    FG     12"
print(love.data.unpack('<c5c2', 'ABCDEFG')) -- prints "ABCDE    FG      8"
print(love.data.unpack('<i2i4s2I2', '\254\255\001\000\000\000\018\000STRING OF 18 BYTES\254\255')) --prints:
-- -2	1	STRING OF 18 BYTES	65534	29

-- Example: Decode a TGA header - http://www.paulbourke.net/dataformats/tga/
local idlen, cmaptype, imgtype, cmapstart, cmaplength, cmapbits, xorigin, yorigin, xsize, ysize, pixelbits, flags =
  love.data.unpack('<BBBI2I2BI2I2I2I2BB', TGA_file)

grump · Post by **grump** » Wed Mar 06, 2019 5:09 am

The thing with unpack is: if your data is more complex and you can't use static format strings anymore, it starts getting ugly pretty fast.

Parse header with offsets and size information, seek to offsets of each chunk you're interested in, read each chunk of variable sizes and formats. You have to do lots of string slicing and formatting/concatenation for that, and it results in slow code that's hard to read.

pgimeno · Post by **pgimeno** » Wed Mar 06, 2019 12:51 pm

grump wrote: ↑Wed Mar 06, 2019 5:09 amYou have to do lots of string slicing and formatting/concatenation for that, and it results in slow code that's hard to read.

unpack accepts an offset parameter. Wouldn't that obviate the need of slicing?

Where would you need the slicing/formatting/concatenation with unpack which you don't with other methods? Could you give an example?

grump · Post by **grump** » Wed Mar 06, 2019 3:22 pm

pgimeno wrote: ↑Wed Mar 06, 2019 12:51 pm unpack accepts an offset parameter. Wouldn't that obviate the need of slicing?

Ah, you're right, my bad. No slicing required then.

Where would you need the slicing/formatting/concatenation with unpack which you don't with other methods? Could you give an example?

After thinking about it for a bit, concatenation may not always be required, but...
Consider this simple structure:

Code: Select all

{
	uint16_t len
	uint16_t data[len]
}

in a file with 1,000,000 records of this type.

With love.data.unpack:

Code: Select all

local data = "\x10\x0000112233445566778899aabbccddeeff"
local result = {}
for i = 1, 1e6 do
	local len = love.data.unpack('<H', data)
	result = { love.data.unpack('<' .. ('H'):rep(len), data, 3) }
	result[#result] = nil -- unpack returns an additional value
end

Runtime: 0.65s

But we can get rid of concatenation and also the table:

Code: Select all

local data = "\x10\x0000112233445566778899aabbccddeeff"
local result = {}
for i = 1, 1e6 do
	local len = love.data.unpack('<H', data)
	for i = 1, len do
		result[i] = love.data.unpack('<H', data, i * 2 + 1)
	end
end

Runtime: 1.39s

The concatenation is gone, but now it takes more than twice the time to complete. With more complex data structures and more calls to unpack, this may quickly become considerable.

moonblob:

Code: Select all

local data = "\x10\x0000112233445566778899aabbccddeeff"
local r = BlobReader(data, '<')
local result = {}
for i = 1, 1e6 do
	r:rewind():array('u16', r:u16(), result)
end

Runtime: 0.13s

moonblob is 5x-10x faster and the code is a lot more readable imho - more succint, no fiddling with strings and no strange looking format identifiers that you have to look up to understand their meaning (except the eyesore that is the endianess specifier).

You'll probably come up with an ingenious solution using unpack that proves me utterly wrong

I can't believe there's no way to tell unpack to parse n values at once.

Reading binary files quickly.

Re: Reading binary files quickly.

Re: Reading binary files quickly.

Re: Reading binary files quickly.

Re: Reading binary files quickly.

Re: Reading binary files quickly.

Re: Reading binary files quickly.

Re: Reading binary files quickly.

Re: Reading binary files quickly.

Re: Reading binary files quickly.

Re: Reading binary files quickly.

Who is online