Update 2017-05-17. I no longer work for the company where this question originated, and do not have access to Delphi XEx. While I was there, the problem was solved by migrating to mixed FPC+GCC (Pascal+C), with NEON intrinsics for some routines where it made a difference. (FPC+GCC is highly recommended also because it enables using standard tools, particularly Valgrind.) If someone can demonstrate, with credible examples, how they are actually able to produce optimized ARM code from Delphi XEx, I'm happy to accept the answer.
Embarcadero's Delphi compilers use an LLVM backend to produce native ARM code for Android devices. I have large amounts of Pascal code that I need to compile into Android applications and I would like to know how to make Delphi generate more efficient code. Right now, I'm not even talking about advanced features like automatic SIMD optimizations, just about producing reasonable code. Surely there must be a way to pass parameters to the LLVM side, or somehow affect the result? Usually, any compiler will have many options to affect code compilation and optimization, but Delphi's ARM targets seem to be just "optimization on/off" and that's it.
LLVM应该能够生成相当紧凑和合理的代码,但似乎Delphi正在以一种奇怪的方式使用它的工具。Delphi非常希望使用堆栈,它通常只使用处理器的寄存器r0-r3作为临时变量。也许最疯狂的是,它似乎将正常的32位整数加载为4个1字节的加载操作。如何让Delphi生成更好的ARM代码,而不像Android那样逐字节地处理代码?
起初,我认为逐字节加载是为了从big-endian交换字节顺序,但事实并非如此,它实际上只是用4个单字节加载加载一个32位数字。*它可能是加载完整的32位,而不做一个未对齐的字大小的内存加载。(它是否应该避免这是另一回事,这将暗示整个事情是一个编译器错误)*
让我们看看这个简单的函数:
function ReadInteger(APInteger : PInteger) : Integer;
begin
Result := APInteger^;
end;
即使打开了优化,Delphi XE7的更新包1,以及XE6,为该功能生成以下ARM汇编代码:
Disassembly of section .text._ZN16Uarmcodetestform11ReadIntegerEPi:
00000000 <_ZN16Uarmcodetestform11ReadIntegerEPi>:
0: b580 push {r7, lr}
2: 466f mov r7, sp
4: b083 sub sp, #12
6: 9002 str r0, [sp, #8]
8: 78c1 ldrb r1, [r0, #3]
a: 7882 ldrb r2, [r0, #2]
c: ea42 2101 orr.w r1, r2, r1, lsl #8
10: 7842 ldrb r2, [r0, #1]
12: 7803 ldrb r3, [r0, #0]
14: ea43 2202 orr.w r2, r3, r2, lsl #8
18: ea42 4101 orr.w r1, r2, r1, lsl #16
1c: 9101 str r1, [sp, #4]
1e: 9000 str r0, [sp, #0]
20: 4608 mov r0, r1
22: b003 add sp, #12
24: bd80 pop {r7, pc}
只需计算Delphi为此需要的指令和内存访问的数量。从4个单字节负载构造一个32位整数…如果我稍微改变一下函数,并使用var参数而不是指针,它会稍微不那么复杂:
Disassembly of section .text._ZN16Uarmcodetestform14ReadIntegerVarERi:
00000000 <_ZN16Uarmcodetestform14ReadIntegerVarERi>:
0: b580 push {r7, lr}
2: 466f mov r7, sp
4: b083 sub sp, #12
6: 9002 str r0, [sp, #8]
8: 6801 ldr r1, [r0, #0]
a: 9101 str r1, [sp, #4]
c: 9000 str r0, [sp, #0]
e: 4608 mov r0, r1
10: b003 add sp, #12
12: bd80 pop {r7, pc}
我不会在这里包括反汇编,但对于iOS, Delphi为指针和var形参版本生成相同的代码,它们与Android的var形参版本几乎相同,但并不完全相同。 编辑:澄清一下,字节加载只在Android上。而且只有在Android上,指针和var参数的版本彼此不同。在iOS上,两个版本生成的代码完全相同。
作为比较,以下是FPC 2.7.1(2014年3月的SVN中继版本)对优化级别为-O2的功能的看法。指针和var形参的版本完全相同。
Disassembly of section .text.n_p$armcodetest_$$_readinteger$pinteger$$longint:
00000000 <P$ARMCODETEST_$$_READINTEGER$PINTEGER$$LONGINT>:
0: 6800 ldr r0, [r0, #0]
2: 46f7 mov pc, lr
我还用Android NDK附带的C编译器测试了一个等效的C函数。
int ReadInteger(int *APInteger)
{
return *APInteger;
}
这基本上和FPC做的是一样的:
Disassembly of section .text._Z11ReadIntegerPi:
00000000 <_Z11ReadIntegerPi>:
0: 6800 ldr r0, [r0, #0]
2: 4770 bx lr