This section is not intended to be a general guide to the writing of code generators, but it seems worthwhile to highlight some of the optimisations that appear particularly relevant to the ARM and to this standard.
In order to make effective use of the APCS, compilers must compile code a procedure at a time. Line at a time compilation is insufficient.
In the case of leaf functions, much of the standard entry sequence can be omitted. In very small functions, such as those that frequently occur implementing data abstractions, the function-call overhead can be tiny. Consider:
typedef struct {...; int a; ...} foo; int foo_get_a(foo* f) {return(f-a);}
The function foo_get_a can compile to just:
LDR a1, [a1, #aOffset] MOV pc, lr ; MOVS in 26-bit modes
In functions with a conditional as the top level statement, in which one or other arm of the conditional is leaf (calls no functions), the formation of a stack frame can be delayed. For example, the C function:
int get(Stream *s { if (s->cnt > 0) { --s; return *(s-p++); } else { ... } }
… could be compiled (non-reentrantly) into:
get MOV a3, a1 ; if (s->cnt > 0) LDR a2, [a3, #cntOffset] CMPS a2, #0 ; try the fast case,frameless and heavily conditionalized SUBGT a2, a2, #1 STRGT a2, [a3, #cntOffset] LDRGT a2, [a3, #pOffset] LDRBGT a1, [a2], #1 STRGT a2, [a3, #pOffset] MOVGT pc, lr ; else, form a stack frame and handle the rest as normal code. MOV ip, sp STMDB sp!, {v1-v3, fp, ip, lr, pc} CMP sp, sl BLLT |__rt_stkovf_split_small| ... LDMEA fp, {v1-v3, fp, sp, pc}
This is only worthwhile if the test can be compiled using any spare of a1-a4 and ip, as scratch registers. This technique can significantly accelerate certain speed-critical functions, such as read and write character.
Finally, it is often worth applying the tail call optimisation, especially to procedures which need to save no registers. For example:
extern void *malloc(size_t n) { return primitive_alloc(NOTGCABLEBIT, BYTESTOWORDS(n)); }
…is compiled (non-reentrantly) by the C compiler into:
malloc ADD a1, a1, #3 ; 1S MOV a2, a1, LSR #2 ; 1S - BITESTOWORDS(n) MOV a1, #1073741824 ; 1S - NOTGCABLEBIT B primitive_alloc ; 1N+2S = 4S
In this case, the optimisation avoids saving and restoring the call-frame registers and saves 5 instructions (and many cycles-17 S cycles on an uncached ARM with N=2S).