not logged in | [Login]
The Allocation Sinking and Store Sinking Optimization was added to LuaJIT 2.0 in July 2012. This page documents the rationale for this optimization, the algorithm used and its implementation. The resulting performance is discussed and compared to other languages, too.
The research and the development of this optimization has been sponsored by an anonymous corporate sponsor in 2012.
Although this optimization is based on other well-known techniques, the combination may be considered innovative. I (Mike Pall) hereby donate the related intellectual property to the public domain. If you find it useful, please use it and share it!
Avoiding temporary allocations is an important optimization for high-level languages. Allocations are expensive and cause additional overhead since they need to be garbage collected later. The cost can be reduced with considerable effort (cf. the New Garbage Collector), but an eliminated allocation is always the cheapest option.
LuaJIT already eliminates many temporary allocations with multiple techniques: e.g. floating-point numbers aren't boxed and the JIT compiler eliminates allocations for most immutable objects. However, many allocations remain, especially in OO-heavy workloads.
The traditional techniques to avoid temporary allocations are escape analysis and scalar replacement of aggregates (SRA). Escape analysis verifies that an object doesn't escape through any code path. But code generated for dynamic languages has many fallback paths, e.g. to handle dynamic typing. Sadly, this means that almost any object inevitably escapes through at least one of these uncommon code paths. This makes the traditional techniques ineffective for dynamic languages.
For explanatory purposes, the following synthetic example has an explicit code path where the allocation escapes to. Of course, the same underlying logic applies for implicit code paths caused by the dynamic language semantics.
local x, z = 0, nil
for i=1,100 do
local t = {i}
if i == 90 then
z = t
end
x = x + t[1]
end
print(x, z[1])
Although the uncommon code path for i == 90
is only taken once,
traditional escape analysis would fail, because t
does escape to
z
at one point in the control flow.
[Granted, no sane programmer would write such code. But it may arise indirectly as a consequence of higher language abstractions, e.g. when operating on aggregates. See below for some practical examples.]
Let's take a look at the IR and the snapshots generated by LuaJIT with
the sinking optimization disabled (-O-sink
). Only the looping part of
the trace is shown:
.... SNAP #4 [ ---- 0009 ---- 0010 ---- ---- 0010 ]
0012 ------ LOOP ------------
0013 > tab TNEW #3 #0
0014 p32 FLOAD 0013 tab.array
0015 p32 AREF 0014 +1
0016 num CONV 0010 num.int
0017 num ASTORE 0015 0016
.... SNAP #5 [ ---- 0009 ---- 0010 ---- ---- 0010 0013 ]
0018 > int NE 0010 +90
0019 + num ADD 0016 0009
0020 + int ADD 0010 +1
.... SNAP #6 [ ---- 0019 ---- ]
0021 > int LE 0020 +100
0022 int PHI 0010 0020
0023 num PHI 0009 0019
As we can see, the table allocation TNEW
in instruction 0013 escapes
to the snapshot for side exit #5. Much more interesting is to see that
the load t[1]
has been eliminated. The stored value i
(the left PHI
operand at 0010, not shown) has been forwarded through the
integer-to-number conversion (0016) and then added to x
in instruction
0019. x
itself escapes through side exit #6, but the allocation is
dead before the next loop iteration.
[Note: the code path starting at side exit #5 is never compiled, since the side exit is not taken often enough. What happens when a side trace is compiled is shown below.]
The solution to the problem discussed above starts with an observation: there are no loads left for the temporary object! Thanks to advanced alias analysis, store-to-load-forwarding in LuaJIT is generally quite effective at removing all loads from temporary objects: they are simply replaced with the corresponding stored values.
The only references to a temporary allocation are then the stores
themselves (via the AREF
and the FLOAD
in the above example).
However, these stores are pointless if the allocation is not used later
on in the fast code path. The allocation is thus only created for
consistency with the other code paths, but performs no real function for
the fast path.
The basic idea is now to move the temporary allocation where it's really needed. This is 'code motion' in compiler terminology. Since we want to move it to a side path of the code, a more precise term is 'sinking'.
Here's the above example again, but with store-to-load-forwarding and sinking applied by hand on the right side:
local x, z = 0, nil | local x, z = 0, nil
for i=1,100 do | for i=1,100 do
local t = {i} | ---.
if i == 90 then | if i == 90 then |
| local t = {i} <--´
z = t | z = t |
end | end |
x = x + t[1] | x = x + i <--´
end | end
print(x, z[1]) | print(x, z[1])
You can easily verify that both produce the exact same output. But the IR for the hand-optimized Lua code is very different:
.... SNAP #4 [ ---- 0005 ---- 0006 ---- ---- 0006 ]
0008 ------ LOOP ------------
.... SNAP #5 [ ---- 0005 ---- 0006 ---- ---- 0006 ]
0009 > int NE 0006 +90
0010 num CONV 0006 num.int
0011 + num ADD 0010 0005
0012 + int ADD 0006 +1
.... SNAP #6 [ ---- 0011 ---- ]
0013 > int LE 0012 +100
0014 int PHI 0006 0012
0015 num PHI 0005 0011
The allocation and the related store is gone from the fast path of the loop, which means it'll run much faster!
The main innovation of the approach described above is to combine store-to-load-forwarding with store sinking and allocation sinking. This is highly effective in avoiding temporary allocations in the fast paths, even under the presence of many uncommon paths where the temporary object may escape to.
The combination of the two optimizations has the same effect as scalar replacement of aggregates (SRA), but it's applicable in more contexts. This approach is most effective for dynamic languages, but may be successfully applied elsewhere, when the classic techniques fail.
The goal is to automatically perform the above optimization, whenever feasible and profitable.
Note this optimization completely eliminates the allocation from the fast path. It does not turn a heap allocation into a stack allocation. The stored values are still live, usually in registers and sometimes in spill slots. However, the layout of spill slots does not need to match the layout of an allocated object (nor is it suitable to be passed to a C function expecting such an object).
The above example is greatly simplified, of course. In practice, an allocation may escape to more than one snapshot or to more than one stack slot in the same snapshot. There may be stores both before and after a snapshot, which may even overwrite each other's values. And, most importantly, temporary allocations may be held in loop-carried variables, which further complicates the analysis.
Sinking an allocation that escapes to a snapshot means extra work for the data-driven exit handler: the allocation needs to be unsunk, i.e. created on-the-fly and the related stores need to be performed, too.
Similarly, a side trace may need to be compiled corresponding to a snapshot with a sunk allocation: the allocation and the related stores need to be replayed in the side trace. For a truly general solution, a sunk allocation that's replayed in a side trace may be sunk again, of course.
The same approach needs to work for all types of allocations, not just for Lua tables. Temporary allocations of mutable and immutable FFI objects are common, too.
The exact algorithm used for this optimization is explained in the implementation section below. For better understanding, this section shows the results of the optimization for a range of examples.
Here's the motivating example again plus the IR and the machine code for the loop part with the sinking optimization turned on:
local x, z = 0, nil
for i=1,100 do
local t = {i}
if i == 90 then
z = t
end
x = x + t[1]
end
print(x, z[1])
.... SNAP #4 [ ---- 0009 ---- 0010 ---- ---- 0010 ]
0012 ------------ LOOP ------------
0013 {sink} tab TNEW #3 #0
0014 p32 FLOAD 0013 tab.array
0015 p32 AREF 0014 +1
0016 xmm6 num CONV 0010 num.int
0017 {0013} num ASTORE 0015 0016
.... SNAP #5 [ ---- 0009 ---- 0010 ---- ---- 0010 0013 ]
0018 > int NE 0010 +90
0019 xmm7 + num ADD 0016 0009
0020 rbp + int ADD 0010 +1
.... SNAP #6 [ ---- 0019 ---- ]
0021 > int LE 0020 +100
0022 rbp int PHI 0010 0020
0023 xmm7 num PHI 0009 0019
->LOOP:
394cffd0 xorps xmm6, xmm6
394cffd3 cvtsi2sd xmm6, ebp
394cffd7 cmp ebp, +0x5a
394cffda jz 0x394c0024 ->5
394cffe0 addsd xmm7, xmm6
394cffe4 add ebp, +0x01
394cffe7 cmp ebp, +0x64
394cffea jle 0x394cffd0 ->LOOP
394cffec jmp 0x394c0028 ->6
As you can see, the allocation and the store have been sunk. No register
or spill slot is allocated to them: it shows {sink}
or the reference
of the allocation corresponding to the store instead. The machine code
for the loop contains only the conversion, the i == 90
check, the
summation of x
and the loop increment plus boundary check.
[Note: the xorps
avoids a partial-register-write stall for the SSE2
conversion instruction.]
Here's an example that shows how a sunk allocation is sunk again in a side trace:
local z = nil
for i=1,200 do
local t = {i}
if i > 100 then
if i == 190 then z = t end
end
end
print(z[1])
Here's the IR for the first trace (the }
shows sunk instructions):
.... SNAP #0 [ ---- ]
0001 int SLOAD #2 CI
0002 } tab TNEW #3 #0
0003 p32 FLOAD 0002 tab.array
0004 p32 AREF 0003 +1
0005 num CONV 0001 num.int
0006 } num ASTORE 0004 0005
.... SNAP #1 [ ---- ---- 0001 ---- ---- 0001 0002 ]
0007 > int LE 0001 +100
0008 + int ADD 0001 +1
.... SNAP #2 [ ---- ---- ]
0009 > int LE 0008 +200
.... SNAP #3 [ ---- ---- 0008 ---- ---- 0008 ]
0010 ------ LOOP ------------
0011 } tab TNEW #3 #0
0012 p32 FLOAD 0011 tab.array
0013 p32 AREF 0012 +1
0014 num CONV 0008 num.int
0015 } num ASTORE 0013 0014
.... SNAP #4 [ ---- ---- 0008 ---- ---- 0008 0011 ]
0016 > int LE 0008 +100
0017 + int ADD 0008 +1
.... SNAP #5 [ ---- ---- ]
0018 > int LE 0017 +200
0019 int PHI 0008 0017
A side trace is spawned, the sunk allocation and the sunk store is replayed before the first snapshot. Then it's sunk again:
0001 int SLOAD #2 PI
0002 } tab TNEW #3 #0
0003 p32 FLOAD 0002 tab.array
0004 p32 AREF 0003 +1
0005 num CONV 0001 num.int
0006 } num ASTORE 0004 0005
.... SNAP #0 [ ---- ---- 0001 ---- ---- 0001 0002 ]
0007 > nil GCSTEP
.... SNAP #1 [ ---- ---- 0001 ---- ---- ---- 0002 ]
0008 > int NE 0001 +190
0009 int ADD 0001 +1
.... SNAP #2 [ ---- ---- ]
0010 > int LE 0009 +200
0011 num CONV 0009 num.int
.... SNAP #3 [ ---- ---- 0011 ---- ---- 0011 ]
[An explicit GCSTEP
instruction is needed after the first snapshot to
prevent implicit GC steps before the first snapshot. It's really a no-op
here, since no allocations are performed.]
Here's a practical example of a higher language abstraction: a (simplified) 'point' class. It needs to create temporary objects as part of arithmetic operations on point objects.
First, the version for plain Lua tables:
local point
point = {
new = function(self, x, y)
return setmetatable({x=x, y=y}, self)
end,
__add = function(a, b)
return point:new(a.x + b.x, a.y + b.y)
end,
}
point.__index = point
local a, b = point:new(1.5, 2.5), point:new(3.25, 4.75)
for i=1,100000000 do a = (a + b) + b end
print(a.x, a.y)
It creates two temporary objects per iteration, but LuaJIT is able to optimize them away. Here's the machine code for the loop part:
->LOOP:
394cffe0 addsd xmm6, xmm1
394cffe4 addsd xmm7, xmm0
394cffe8 addsd xmm6, xmm1
394cffec addsd xmm7, xmm0
394cfff0 add ebp, +0x01
394cfff3 cmp ebp, 0x05f5e100
394cfff9 jle 0x394cffe0 ->LOOP
394cfffb jmp 0x394c0028 ->6
[Note: floating-point arithmetic is not associative. The compiler is not
allowed to transform (a+b)+b
into a+(b+b)
and then hoist out the
computation of b+b
. It really needs two FP additions per component for
a total of four SSE2 addsd
instructions per loop iteration.]
local ffi = require("ffi")
local point
point = ffi.metatype("struct { double x, y; }", {
__add = function(a, b)
return point(a.x + b.x, a.y + b.y)
end
})
local a, b = point(1.5, 2.5), point(3.25, 4.75)
for i=1,100000000 do a = (a + b) + b end
print(a.x, a.y)
The machine code inside the loop is identical to the above example. A similar example with immutable 'complex' cdata objects generates the same code, too.
For comparison, here's the same point class implemented in C++:
#include <iostream>
class Point {
public:
Point(double x, double y) : x(x), y(y) { }
Point operator+(const Point &b) const {
return Point(x+b.x, y+b.y);
}
double x, y;
};
int main()
{
int i;
Point a(1.5, 2.5), b(3.25, 4.75);
for (i = 0; i < 100000000; i++)
a = (a + b) + b;
std::cout << a.x << ' ' << a.y << '\n';
return 0;
}
Note that this code is passing aggregate values around and not actually allocating objects. However, the C++ compiler still has to perform SRA for optimal results. Not surprisingly, the machine code for the loop is basically the same as the code LuaJIT generates.
[We'll ignore auto-vectorization/auto-simdization for the moment.]
And here's the same point class implemented in Java:
public class Point {
public double x, y;
public Point(double x0, double y0) { x = x0; y = y0; }
public Point add(Point b) { return new Point(x + b.x, y + b.y); }
public static void main(String args[])
{
int i;
Point a = new Point(1.5, 2.5);
Point b = new Point(2.25, 4.75);
for (i = 0; i < 100000000; i++)
a = a.add(b).add(b);
System.out.println(a.x + " " + a.y);
}
}
JVM/Hotspot 1.7 is unable to eliminate the allocations. Adding the
option -XX:+DoEscapeAnalysis
doesn't change anything. Moving the loop
to a separate method or using an outer loop doesn't help either.
Here's the runtime for the point class in seconds (YMMV). Lower is better:
Time | Point object | VM/Compiler |
---|---|---|
140.0 | Lua table | Lua 5.1.5 |
26.9 | Lua table | LuaJIT 2.0 git HEAD -O-sink |
10.9 | FFI struct | LuaJIT 2.0 git HEAD -O-sink |
0.2 | Lua table | LuaJIT 2.0 git HEAD -O+sink |
0.2 | FFI struct | LuaJIT 2.0 git HEAD -O+sink |
0.2 | C++ class | GCC 4.4.3 -O2 (or -O3) |
1.2 | Java class | JVM/Hotspot 1.7.0_05 |
LuaJIT is around 700 times faster than plain Lua and it's the same speed as C++ -- for this example.
JVM 1.7 is unable to eliminate the allocations, but has a fast allocator and garbage collector. Still, the JVM is around 6 times slower than LuaJIT or C++ for this example.
Note: this cannot be extrapolated to other code, of course. And it does not lead to a generalizable statement about the relative performance of Lua, LuaJIT, Java or C++.
[Yes, a runtime of 0.2 seconds is too low for precise results. But for this simple example, scaling the loop iterations up yields consistent results. The code generated by LuaJIT and C++ really has the same performance.]
Different parts of the JIT-compiler are involved in handling allocation sinking and store sinking. The following subsections describe the required additions and changes to the implementation.
The SINK optimization pass is only invoked, if:
-O3
, which is the default.TNEW
, TDUP
, CNEW
or CNEWI
instruction.This pass traverses the IR of the current trace in two phases, very much like a classic mark & sweep garbage collection algorithm:
The mark phase is a single-pass backward propagation algorithm. Initially, the marks of all IR instructions are clear. Then the roots are marked (most of this happens on-the-fly) and the marks are propagated backwards. The following roots are considered:
PHI
instructions that do not
reference an allocation instruction or that reference two different
opcodes.CALLL
instructions.CALLS
or CALLXS
instructions.CNEWI
.A store instruction is only eligible, if all of these conditions are true:
Finally, the PHI references are iteratively remarked. If the left side and the right side have different marks or a different number of PHI values stored, then both sides are marked. This is repeated until the marks converge.
After this pass is complete, the allocations that are unmarked are considered safe to be sunk.
The sweep phase is a single-pass through the IR instructions:
PHI
instruction is tagged as sunk if it refers to unmarked
allocations (it's enough to check one side).NEWREF
is tagged as sunk if the corresponding
allocation is unmarked.All marks are cleared, too. Tagging works as follows:
NEWREF
and stores have the register field set to
RID_SINK (treated like no register has been allocated to it).NEWREF
have a zero spill slot field.The assembler backend translates the IR to machine code. It needs to know about sunk instructions to avoid generating machine code for them.
The first encountered guard that needs to add an exit for a snapshot calls a special routine that prepares the current snapshot. Every instruction producing an escaping value must have a register or spill slot allocated to it if it doesn't have one already.
This ensures that all values escaping through the snapshot are live at all exits. Processing is backwards, so any value live at the first encountered guard (highest IR instruction) is live at all other guards, up to the snapshot.
The snapshot preparation checks for sunk allocations and allocates the stored values instead. This involves searching for all stores related to an allocation. Only the store instructions between the allocation reference and the snapshot reference need to be considered.
Once this is done, the register for an allocation is changed from RID_SINK to RID_SUNK. This prevents extra invocations in case the allocation escapes to more than one stack slot or through another snapshot further up.
Additional logic tries to allocate a value for the input of an integer-to-number conversion instead of the conversion itself. This effectively sinks conversions and removes them from the fast path.
A sunk allocation instruction has its register set to RID_SINK or RID_SUNK and a zero (unused) spill slot. The result is thus considered unused by on-the-fly dead-code elimination (DCE). No code is emitted for a sunk allocation.
Unused, sunk conversions are eliminated by DCE and no code is emitted for them.
Store instructions and the NEWREF
instruction are not eligible for
DCE, so they need an explicit check for RID_SINK in their backend
handler to not emit any code.
Sunk PHI
instructions need the same check to avoid allocating a
register or spill slot for the referenced allocations.
Snapshots hold the execution state from the view of the bytecode interpreter at selected points in a trace. Sunk allocations and sunk conversions are not actually executed on-trace and thus do not generate a value. However, the stored values (or input values for conversions) are computed and can be used to restore or replay a snapshot.
A taken side exit that has no attached side trace (yet) needs to restore the Lua stack to a sane state. The snapshot corresponding to the exit has a list of Lua stack slots to be restored and their IR references. These can be used together with the register and spill slot assignments in the IR to restore the Lua stack from the exit state of the trace, which holds the current register and spill slot values.
Sunk allocations (that escape through a snapshot) have their register set to RID_SUNK. The allocation needs to be 'unsunk': the object needs to be allocated and the related stores need to be performed, as if the object had been allocated on-trace.
The snapshot allocation in the assembler backend makes sure that all sunk stores have live values at the exit. The actual values are restored via their IR reference and the exit state, too. Keys of sunk stores are always constant, so they can be reconstructed solely from the IR.
Care needs to be taken to de-duplicate allocation references. A sunk allocation may escape to more than one stack slot and all of them must refer to the same object.
The IR for a side trace needs to start with the instructions that link
to the parent values included in the snapshot. The SLOAD
instruction
with the flag IRSLOAD_PARENT provides parent links for values that
correspond to a Lua stack slot.
Sunk allocations, sunk stores and sunk conversions need to be replayed
instead. Sunk stores and conversions may reference values from the
parent that do not have a corresponding Lua stack slot. The PVAL
instruction provides the necessary parent links.
All parent links must be at the start of the trace, since their
registers and stack slots must be coalesced with the parent trace
atomically. Also, a PVAL
must not be used if there's a parent SLOAD
for the same reference. This causes some ordering constraints, which
mandates a multi-pass algorithm:
SLOAD
instructions for non-sunk slots from the
snapshot.PVAL
instructions for all dependent values from sunk
allocations or conversions, except if there's a parent SLOAD
instruction for the same reference.[Note: this sounds more expensive, than it is. This algorithm iterates a limited number of times -- usually less than five. Also, passes two and three are only performed when needed.]