This recipe shows you how to construct a sequence of ARM instructions to multiply by a constant.
For some applications multiply is used extensively, so it is important to make the application run as fast as possible. For instance, most DSP (Digital Signal Processing) applications perform a lot of multiplication.
In many cases where a multiply is used, one of the values is a constant (eg. weeks*7). A naive programmer would assume that the only way to calculate this would be to use the MUL instruction - but there is an alternative…
This recipe demonstrates how to improve the speed of multiply-by-constant by using a sequence of arithmetic instructions instead of the general-purpose multiplier.
Throughout this recipe, registers are referred to using register names (eg. Rd, Rm, Rs), but you should use only register names which have been declared using the RN directive (eg. a1, r4 etc.) in assembler source code. This recipe does not refer to any example programs; it should be viewed as an explanation of the multiply-by-constant technique.
MUL has the following syntax:
MUL Rd, Rm, Rs
The timing of this instruction depends on the value in Rs. The ARM6 datasheet specifies that for Rs between 2^(2m-3) and 2^(2m-1)-1 inclusive takes 1S + mI cycles. For more details on the multiplier, see ARM6 multiplier performance issues. There is, of course, no guarantee that MUL will not be implemented differently (possibly faster) in the future…
When multiplying by a constant value, it is possible to replace the general multiply with a fixed sequence of adds and subtracts which have the same effect. For instance, multiply by 5 could be achieved using a single instruction:
ADD Rd, Rm, Rm, LSL #2 ; Rd = Rm + (Rm * 4) = Rm * 5
This ADD version is obviously better than the MUL version below:
MOV Rs, #5 MUL Rd, Rm, Rs
The 'cost' of the general multiply includes the instructions needed to load the constant into a register (up to 4 may be needed, or an LDR from a literal pool) as well as the multiply itself.
The difficulty in using a sequence of arithmetic instructions is that the constant must be decomposed into a set of operations which can be done by one instruction each. Consider multiply by 105:
This could be achieved by decomposing thus:
105 == 128 - 13 == 128 - 16 + 3 == 128 - 16 + 2 + 1 ADD Rd, Rm, Rm, LSL #1 ; Rd = Rm*3 SUB Rd, Rd, Rm, LSL #4 ; Rd = Rm*3 - Rm*16 ADD Rd, Rd, Rm, LSL #7 ; Rd = Rm*3 - Rm*16 + Rm*128
Or, decomposing differently:
105 == 15 * 7 == (16 - 1) * (8 - 1) RSB Rt, Rm, Rm, LSL #4 ; Rt = Rm*15 (tmp reg) RSB Rd, Rt, Rt, LSL #3 ; Rd = Rt*7 = Rm*105
The second method is the optimal solution (fairly easy to find for small values such as 105). However, the problem of finding the optimum becomes much more difficult for larger constant values. A program can be written to search exhaustively for the optimum, but it may take a long time to execute. There are no known algorithms which solve this problem quickly.
Temporary registers can be used to store intermediate results to help achieve the shortest sequence. For a large constant, more than one temporary may be needed, otherwise the sequence will be longer.
The C-compiler restricts the amount of searching it performs in order to minimise the impact on compilation time. The current version of armcc has a cut-off so that it uses a normal MUL if the number of instructions used in the multiply-by-constant sequence exceeds some number N. This is to avoid the sequence becoming too long.
When writing a speed-critical ARM assembler program, it is a good idea to code it in C first (to check the algorithm) before converting it to hand tuned assembler. It is helpful to see the ARM code which the compiler generates as a starting point for your work.
Invoking armcc with the -S flag will generate an assembly file instead of an object file. For example, consider the following simple C code:
int mulby105( int num ) { return num * 105; }
Compile this using:
armcc -li -S mulby105.c
Now, examine the file mulby105.s which has been created:
; generated by Norcroft ARM C vsn 4.41 (Advanced RISC Machines) AREA |C$$code|, CODE, READONLY |x$codeseg| EXPORT mulby105 mulby105 RSB a1,a1,a1,LSL #4 RSB a1,a1,a1,LSL #3 MOV pc,lr AREA |C$$data|,DATA |x$dataseg| END
Notice that the compiler has found the short multiply-by-constant sequence.
To evaluate the speed gains from using multiply-by-constant, consider multiplying by 11585 (which is 8192*sqr2) as an example:
A normal multiply consists of:
MOV Rs, #0x2D << 8 ; load constant ORR Rs, Rs, #0x41 ; load constant, now Rs = 11585 MUL Rd, Rm, Rs ; do the multiply
The load-a-constant stage may take up to four 4 instructions (in this case 2) or an LDR,= and the multiply takes 1 instruction fetch plus a number of internal cycles to calculate the multiply (on ARM6 based processors) .
The optimal multiply-by-constant sequence consists of:
ADD Rd, Rm, Rm, LSL #1 ; Rd = Rm*3 RSB Rd, Rd, Rd, LSL #4 ; Rd = Rd*15 = Rm*45 ADD Rd, Rm, Rd, LSL #8 ; Rd = Rm + Rd*256 = Rm*11521 ADD Rd, Rd, Rm, LSL #6 ; Rd = Rd + Rm*64 = Rm*11585
This is just 4 data processing instructions.
----------------------------------------------------- Method |Cycles ----------------------------------------------------- Normal multiply |3 instructions + MUL internal |cycles ----------------------------------------------------- Multiply-by-constant |4 instructions -----------------------------------------------------
Considering the ARM6 family, the 2-bit Booth's Multiplier used by MUL takes a number of I-cycles depending on the value in Rs (in this case m=8, as Rs lies between 8192 and 32767).
Hence multiply-by-constant looks to be the winner in this case.
An instruction fetch is an external memory S-cycle on the ARM60, or a cache F-cycle (if there is a cache hit) on cached processors like the ARM610.
With slow memory systems and non-cached processors, I-cycles can be much faster than other cycles because they are internal to the ARM core. This means that the general multiply can sometimes be the fastest option (for large constants where an efficient solution cannot be found)- it should also use less memory. If the load-a-constant stage could be moved outside a loop, the balance would tip further in favour of the general multiply as there is only the MUL to execute.
---------------------------------------------------------- Method |Cycles on ARM60|Cycles on ARM610 ---------------------------------------------------------- Normal multiply |3S + 8I |11F ---------------------------------------------------------- Multiply-by-const|4S |4F ant | | ----------------------------------------------------------