Update 2017-05-17. I no longer work for the company where this question originated, and do not have access to Delphi XEx. While I was there, the problem was solved by migrating to mixed FPC+GCC (Pascal+C), with NEON intrinsics for some routines where it made a difference. (FPC+GCC is highly recommended also because it enables using standard tools, particularly Valgrind.) If someone can demonstrate, with credible examples, how they are actually able to produce optimized ARM code from Delphi XEx, I'm happy to accept the answer.


Embarcadero's Delphi compilers use an LLVM backend to produce native ARM code for Android devices. I have large amounts of Pascal code that I need to compile into Android applications and I would like to know how to make Delphi generate more efficient code. Right now, I'm not even talking about advanced features like automatic SIMD optimizations, just about producing reasonable code. Surely there must be a way to pass parameters to the LLVM side, or somehow affect the result? Usually, any compiler will have many options to affect code compilation and optimization, but Delphi's ARM targets seem to be just "optimization on/off" and that's it.

LLVM应该能够生成相当紧凑和合理的代码,但似乎Delphi正在以一种奇怪的方式使用它的工具。Delphi非常希望使用堆栈,它通常只使用处理器的寄存器r0-r3作为临时变量。也许最疯狂的是,它似乎将正常的32位整数加载为4个1字节的加载操作。如何让Delphi生成更好的ARM代码,而不像Android那样逐字节地处理代码?

起初,我认为逐字节加载是为了从big-endian交换字节顺序,但事实并非如此,它实际上只是用4个单字节加载加载一个32位数字。*它可能是加载完整的32位,而不做一个未对齐的字大小的内存加载。(它是否应该避免这是另一回事,这将暗示整个事情是一个编译器错误)*

让我们看看这个简单的函数:

function ReadInteger(APInteger : PInteger) : Integer;
begin
  Result := APInteger^;
end;

即使打开了优化,Delphi XE7的更新包1,以及XE6,为该功能生成以下ARM汇编代码:

Disassembly of section .text._ZN16Uarmcodetestform11ReadIntegerEPi:

00000000 <_ZN16Uarmcodetestform11ReadIntegerEPi>:
   0:   b580        push    {r7, lr}
   2:   466f        mov r7, sp
   4:   b083        sub sp, #12
   6:   9002        str r0, [sp, #8]
   8:   78c1        ldrb    r1, [r0, #3]
   a:   7882        ldrb    r2, [r0, #2]
   c:   ea42 2101   orr.w   r1, r2, r1, lsl #8
  10:   7842        ldrb    r2, [r0, #1]
  12:   7803        ldrb    r3, [r0, #0]
  14:   ea43 2202   orr.w   r2, r3, r2, lsl #8
  18:   ea42 4101   orr.w   r1, r2, r1, lsl #16
  1c:   9101        str r1, [sp, #4]
  1e:   9000        str r0, [sp, #0]
  20:   4608        mov r0, r1
  22:   b003        add sp, #12
  24:   bd80        pop {r7, pc}

只需计算Delphi为此需要的指令和内存访问的数量。从4个单字节负载构造一个32位整数…如果我稍微改变一下函数,并使用var参数而不是指针,它会稍微不那么复杂:

Disassembly of section .text._ZN16Uarmcodetestform14ReadIntegerVarERi:

00000000 <_ZN16Uarmcodetestform14ReadIntegerVarERi>:
   0:   b580        push    {r7, lr}
   2:   466f        mov r7, sp
   4:   b083        sub sp, #12
   6:   9002        str r0, [sp, #8]
   8:   6801        ldr r1, [r0, #0]
   a:   9101        str r1, [sp, #4]
   c:   9000        str r0, [sp, #0]
   e:   4608        mov r0, r1
  10:   b003        add sp, #12
  12:   bd80        pop {r7, pc}

我不会在这里包括反汇编,但对于iOS, Delphi为指针和var形参版本生成相同的代码,它们与Android的var形参版本几乎相同,但并不完全相同。 编辑:澄清一下,字节加载只在Android上。而且只有在Android上,指针和var参数的版本彼此不同。在iOS上,两个版本生成的代码完全相同。

作为比较,以下是FPC 2.7.1(2014年3月的SVN中继版本)对优化级别为-O2的功能的看法。指针和var形参的版本完全相同。

Disassembly of section .text.n_p$armcodetest_$$_readinteger$pinteger$$longint:

00000000 <P$ARMCODETEST_$$_READINTEGER$PINTEGER$$LONGINT>:

   0:   6800        ldr r0, [r0, #0]
   2:   46f7        mov pc, lr

我还用Android NDK附带的C编译器测试了一个等效的C函数。

int ReadInteger(int *APInteger)
{
    return *APInteger;
}

这基本上和FPC做的是一样的:

Disassembly of section .text._Z11ReadIntegerPi:

00000000 <_Z11ReadIntegerPi>:
   0:   6800        ldr r0, [r0, #0]
   2:   4770        bx  lr

我们正在调查此事。简而言之,它取决于指针引用的Integer的潜在错对(到32边界)。还需要一点时间来得到所有的答案……以及解决这个问题的计划。 Marco Cantù, Delphi Developers的版主

另外参考为什么Delphi的zlib和zip库在64位下这么慢?因为Win64库是在没有优化的情况下发布的。


QP报告:RSP-9922 坏的ARM代码产生的编译器,$O指令忽略?, Marco补充了如下解释:

There are multiple issues here: As indicated, optimization settings apply only to entire unit files and not to individual functions. Simply put, turning optimization on and off in the same file will have no effect. Furthermore, simply having "Debug information" enabled turns off optimization. Thus, when one is debugging, explicitly turning on optimizations will have no effect. Consequently, the CPU view in the IDE will not be able to display a disassembled view of optimized code. Third, loading non-aligned 64bit data is not safe and does result in errors, hence the separate 4 one byte operations that are needed in given scenarios.