Buy Percona ServicesBuy Now!

The Anatomy of LuaJIT Tables and What’s Special About Them

 | April 29, 2020 |  Posted In: Advanced Level, Tools

I don’t know about you, but I really like to get inside all sorts of systems. In this article, I’m going to tell you about the internals of Lua tables and special considerations for their use. Lua is my primary professional programming language, and if one wants to write good code, one needs at least to peek behind the curtain. If you are curious, follow me.

Lua has several implementations and several versions. In this article, I’m going to discuss mostly LuaJIT 2.1.0, which is used in Tarantool. Our version is a bit patched as compared to the authentic LuaJIT, but the differences don’t impact tables.

There is another good presentation about tables in PUC-Rio implementation, if you’re interested. You can also find this on slideshare.net.

Overview

Tables in Lua are the only composite data type designed to suit any purpose. A table can be used both as a data array (if the keys are integer-valued) and as a key-value repository. Values of any type can be used as keys (except for nil). Scientifically speaking, a table is an associative array.

Anatomy of tables

In LuaJIT sources, a table is represented by the following structure (I’ve omitted a part of fields which are not relevant here for simplicity):

A table has two parts: array and hashmap. Both are represented by continuous storage areas. I was about to describe the logic LuaJIT is guided by when new elements are inserted, but this a rather sophisticated algorithm. Moreover, as a Lua developer, I have no need to speculate about what a table looks like on the inside. There’s something that is more important: two tables may contain the same keys and values, but differ in terms of internal representation. This internal representation will affect behaviors of some functions, which I’m going to discuss here.

For demonstration, I took LuaJIT source code and patched the tostring() method a bit so that it would print out the size and all the contents of C structure to stdout using printf().

What the patch is about

The most complete version of code from the publication, including all the experiments, can be found on GitHub. Conceptually, this is what the code looks like:

Let me show how this works:

We’ve created an empty table, and LuaJIT has allocated storage space for 0 array elements and 1 element in hashmap, to which is placed nil key and nil value (i.e. nothing). Now let’s try to populate this table:

String keys are added to hashmap as expected. Here the first reason becomes visible as to why the internal representation of tables may differ: hash collisions. To resolve hash collisions, LuaJIT uses a hybrid of open addressing and separate chaining methods (thanks to Igor Munkin for clarification).

Depending on the order in which elements are added to the table, their order during integration will differ. For demonstration, I’ve set up one more auxiliary function traverse(), which runs for cycle through the table and prints out its contents.

Implementation of traverse

A complete version of the code can still be found on GitHub. For brevity, I’m going to demonstrate only its operating principle here.

One more interesting fact about the internals of tables: the deletion of a value does not lead to deletion of a key. Sounds like Captain Obvious, but it’s true.

This is done for a reason, to make it possible to delete values and not disturb iteration order. What is strongly discouraged is to add new keys to the table within the cycle as reallocation or re-hashing may occur, and then iteration will bust.

An additional note from Igor Munkin: the reason is lookup in general, not iteration. Collisions are resolved by the chain method. Generally, when the main node is deleted for the searched key, the link to the colliding element needs to be reassigned to the predecessor (if present) of this main node. Task O(n) to search for it needs to be performed when deleting. The cost of a search does not worsen when a dead node is present on the way to collision resolution. See also this reference.

Arrays

OK, iteration for string keys is clear, no more questions. Now let’s talk about the second half of the table — the array.

Lua is often made fun of because indexation in this language starts with one. Interestingly, LuaJIT allocates space to store the null element anyway. This makes it easier for it to avoid multiple addition / deduction of one.

If it appears to you that arrays don’t have any surprises, then I’m here to disappoint you (or to make you happy) — they do:

I’ve changed syntax a bit, so one element gets to the array, and another gets to the hashmap. So consider yourself warned: it’s no use taking guesses about internal representation as your guess can turn out to be wrong at any time.

FAQ: WHAT?

Answer: the LuaJIT interpreter, just like regular Lua, generates a different bytecode in these two cases.

In the first case, space is allocated to place two array elements, which is logical.

In the second case, one element in the array and one in the hashmap.

But still, my primary concern as a developer is the correctness of my code. If LuaJIT is comfortable with presenting tables differently, it has the right to do it this way. For me, the main thing is that it is done seamlessly. So I suggest that we go over functions and spell out the expectations.

Iterator pairs()

Iterator pairs() has already been mentioned here earlier. It does not guarantee the iteration order. Even in an array. From the inside of LuaJIT, pairs() runs in sequence, first over the array, then over the hashmap. So if a numerical key gets to hashmap some way, then iteration order will be disturbed:

FAQ: WHAT?

Answer: Function table.new(narr, nrec) pre-allocates the required storage space — this is the binding of the standard function lua_createtable(L, a, h) from Lua C API. If the table size is known beforehand (e.g. in the case of copying), this can be used to save on reallocations when the table is subsequently populated.

This trick works exclusively due to pre-allocation. Another element being added to the table and the following re-hashing (which is a rather expensive and complex operation) will totally remove the veil of mystery from the table:

By no means do I want to sound paranoid, and I consider this case highly implausible. In 99.9% of instances, “arrays” in Lua will really be represented by arrays, and the iteration order will be sequential.

Table length table.getn()

Much more often, errors occur because of array length being calculated incorrectly. The main thing one needs to know about it is this definition from Lua specification:

That is to say, not every table has the property of length. Surprise — undefined behavior can occur in Lua. Tables with “holes” simply don’t have such a property. In the LuaJIT implementation, function lj_tab_len is commented as follows:

LuaJIT searches for a “boundary” in the table. If there are missing values in the array, and thus several boundaries exist, then, depending on the circumstances, LuaJIT will be able to find any of those — and will be right:

FAQ: WHAT?

Answer: These two tables differ in terms of internal representation.

LuaJIT searches for a boundary with a binary search, and it first checks the last element in the array. If it exists, the search proceeds to the hashmap, otherwise it continues in the array.

And this definitely does not sound paranoid as there have been such errors in the past. Table length is implicitly used in other functions, so undefined behavior is something you don’t want to play with. To be fair, it should be noted that in arrays without holes, the behavior is strictly determined and does not depend on internal representation.

Table.sort() sorting

Table sorting always works within the range from 1 to #t, and there is no way to influence this. So it is helpful to check input values in those functions which deal with a table as an array:

Lua allows length to exist though, even when string keys are present, but as far as we are talking about typical contracts of functions, hybrid tables rarely work.

Pack/unpack

Where do those arrays with holes really come from? A rare weird case, as it might seem. But nope, there are dangers lurking every step of the way:

The multiple periods here are not an ellipsis but rather a designation of a variable number of function arguments in Lua.

Then someone calls this function with arguments vararg(nil, “err”), and here you are: your function handles a holed array. If one calls unpack(t) after this, then the tail may fall off (or it may not, depends on luck, because this is UB).

The Lua specification reads:

To avoid the inadvertent loss of tail, instead of implicit unpack(t, 1, #t) always write explicitly unpack(t, 1, n). But where to take this n from? The answer depends on what kind of table you’ve got. In the case of varargs, you can use table.pack() method, which returns a hybrid:

In LuaJIT (and in Tarantool), the table.pack function is inaccessible by default, but it can be enabled with flag of -DLUAJIT_ENABLE_LUA52COMPAT compiler. But it is even easier to implement this functionality manually:

This is where black magic comes into play — select(‘#’, …) — but a detailed description of its mechanism simply would not fit into this article. I’d just give you a tip: this has to do with Lua stack — the mechanism which provides an interface between Lua and C (Lua C API). Meanwhile, we’ve got to move on.

Iterator ipairs()

This one is a good operator, predictable enough. ipairs should be thought of as a shortened version of such while cycle as:

No internal representation differences can affect it, and “holes” don’t lead to undefined behavior — iteration simply stops.

Pitfalls of FFI

An attentive reader could have noticed that I have already used twice a strange expression type(x) ~= ‘nil’. Why not just x == nil? This is because LuaJIT, unlike PUC-Rio Lua, has a magic type cdata:

This is a notorious pitfall in Tarantool named box.NULL. Note that condition if NULL is interpreted as true (unlike if nil), despite the fact that NULL == nil.

Another pitfall in LuaJIT is the use of FFI types as a table key. This issue is not so frequently discussed in Lua manuals because there are no FFI types in the vanilla version of Lua. But LuaJIT can surprise you and backfire. Here’s what’s written in the manual:

Simply put, for cdata-keys, hash is counted from the pointer (i.e. void*), so they show unpredictable behavior and should be avoided in practice at all cost. A joke from Tarantool life:

I hope no one in one’s right mind would write things like this in code, but cdata is not infrequent, and this has occurred in the past. As an example, consider uuid and clock.time64() from Tarantool. Or big values stored in spaces in unsigned format.

As long as we are on this subject, note that it’s possible to obtain unexpected results without using too, though I’ve never seen this in practice:

Summary

  • Table length operator #t is only defined for arrays without holes. All the rest is Undefined Behavior.
  • Function ipairs() is iterated as while type(t[i]) ~= ‘nil’ — not for all keys, but, on the plus side, in a predictable way, and the order is guaranteed.
  • Function pairs() is iterated for all keys, but the iteration order is affected by the internal representation of table.
  • Functions unpack, table.sort, table.insert, and table.remove, once called with default arguments, hold undefined behavior in them due to implicit #t.
  • The use of “strange” (ffi) values in keys affords you plenty of ways to shoot yourself in the foot. They should be avoided.


The content in this blog is provided in good faith by members of the open source community. Percona has not edited or tested the technical content. Views expressed are the authors’ own. When using the advice from this or any other online resource test ideas before applying them to your production systems, and always secure a working back up.

Yaroslav Dynnikov

Tarantool Solution Engineer and Software Developer. I have used Lua since 2007. For the last three years this has been my primary programming language professionally. I used it for video streaming business logic and now within Tarantool to develop an in-memory database and a Lua application server.

Leave a Reply