什么是尾部调用优化?

很简单，什么是尾部调用优化?

更具体地说，有哪些小代码段可以应用，哪些地方不可以，并解释原因?

当前回答

GCC C最小可运行示例与x86拆装分析

让我们看看GCC如何通过查看生成的程序集自动执行尾部调用优化。

这将作为一个非常具体的例子，说明其他答案(如https://stackoverflow.com/a/9814654/895245)中提到的优化可以将递归函数调用转换为循环。

这反过来又节省了内存并提高了性能，因为内存访问通常是现在使程序变慢的主要原因。

作为输入，我们给GCC一个基于阶乘的非优化朴素堆栈:

tail_call.c

#include <stdio.h>
#include <stdlib.h>

unsigned factorial(unsigned n) {
    if (n == 1) {
        return 1;
    }
    return n * factorial(n - 1);
}

int main(int argc, char **argv) {
    int input;
    if (argc > 1) {
        input = strtoul(argv[1], NULL, 0);
    } else {
        input = 5;
    }
    printf("%u\n", factorial(input));
    return EXIT_SUCCESS;
}

GitHub上游。

编译和分解:

gcc -O1 -foptimize-sibling-calls -ggdb3 -std=c99 -Wall -Wextra -Wpedantic \
  -o tail_call.out tail_call.c
objdump -d tail_call.out

其中-foptimize-sibling-calls是根据man gcc泛化尾部调用的名称:

   -foptimize-sibling-calls
       Optimize sibling and tail recursive calls.

       Enabled at levels -O2, -O3, -Os.

如上所述:我如何检查gcc是否执行尾递归优化?

我选择-O1是因为:

优化不是用-O0完成的。我怀疑这是因为缺少必要的中间转换。 -O3产生了非常高效的代码，虽然它也是尾部调用优化。

使用-fno-optimize-sibling-calls进行反汇编:

0000000000001145 <factorial>:
    1145:       89 f8                   mov    %edi,%eax
    1147:       83 ff 01                cmp    $0x1,%edi
    114a:       74 10                   je     115c <factorial+0x17>
    114c:       53                      push   %rbx
    114d:       89 fb                   mov    %edi,%ebx
    114f:       8d 7f ff                lea    -0x1(%rdi),%edi
    1152:       e8 ee ff ff ff          callq  1145 <factorial>
    1157:       0f af c3                imul   %ebx,%eax
    115a:       5b                      pop    %rbx
    115b:       c3                      retq
    115c:       c3                      retq

-foptimize-sibling-calls:

0000000000001145 <factorial>:
    1145:       b8 01 00 00 00          mov    $0x1,%eax
    114a:       83 ff 01                cmp    $0x1,%edi
    114d:       74 0e                   je     115d <factorial+0x18>
    114f:       8d 57 ff                lea    -0x1(%rdi),%edx
    1152:       0f af c7                imul   %edi,%eax
    1155:       89 d7                   mov    %edx,%edi
    1157:       83 fa 01                cmp    $0x1,%edx
    115a:       75 f3                   jne    114f <factorial+0xa>
    115c:       c3                      retq
    115d:       89 f8                   mov    %edi,%eax
    115f:       c3                      retq

两者之间的关键区别在于:

the -fno-optimize-sibling-calls uses callq, which is the typical non-optimized function call. This instruction pushes the return address to the stack, therefore increasing it. Furthermore, this version also does push %rbx, which pushes %rbx to the stack. GCC does this because it stores edi, which is the first function argument (n) into ebx, then calls factorial. GCC needs to do this because it is preparing for another call to factorial, which will use the new edi == n-1. It chooses ebx because this register is callee-saved: What registers are preserved through a linux x86-64 function call so the subcall to factorial won't change it and lose n. the -foptimize-sibling-calls does not use any instructions that push to the stack: it only does goto jumps within factorial with the instructions je and jne. Therefore, this version is equivalent to a while loop, without any function calls. Stack usage is constant.

在Ubuntu 18.10, GCC 8.2中测试。

2019-03-18 21:39:18

其他回答

首先请注意，并非所有语言都支持它。

TCO适用于递归的特殊情况。它的要点是，如果你在函数中做的最后一件事是调用自己(例如，它从“尾部”位置调用自己)，这可以被编译器优化为像迭代一样而不是标准的递归。

You see, normally during recursion, the runtime needs to keep track of all the recursive calls, so that when one returns it can resume at the previous call and so on. (Try manually writing out the result of a recursive call to get a visual idea of how this works.) Keeping track of all the calls takes up space, which gets significant when the function calls itself a lot. But with TCO, it can just say "go back to the beginning, only this time change the parameter values to these new ones." It can do that because nothing after the recursive call refers to those values.

2008-11-22 07:09:41

我所发现的关于尾部调用、递归尾部调用和尾部调用优化的最好的高级描述可能是博客文章

“这到底是什么鬼:一个跟踪电话”

作者:Dan Sugalski。关于尾部调用优化，他写道:

Consider, for a moment, this simple function: sub foo (int a) { a += 15; return bar(a); } So, what can you, or rather your language compiler, do? Well, what it can do is turn code of the form return somefunc(); into the low-level sequence pop stack frame; goto somefunc();. In our example, that means before we call bar, foo cleans itself up and then, rather than calling bar as a subroutine, we do a low-level goto operation to the start of bar. Foo's already cleaned itself out of the stack, so when bar starts it looks like whoever called foo has really called bar, and when bar returns its value, it returns it directly to whoever called foo, rather than returning it to foo which would then return it to its caller.

关于尾递归:

尾递归发生在函数作为其最后一个操作返回时调用自身的结果。尾递归更容易处理因为与其跳到某个随机的开头函数，你只需要回到开始你自己，这是件非常简单的事情。

所以:

Sub foo (int a, int b) { 如果(b == 1) { 返回一个; }其他{ 返回foo(a*a + a, b - 1); ｝

悄然变成:

Sub foo (int a, int b) { 标签: 如果(b == 1) { 返回一个; }其他{ A = A * A + A; B = B - 1; goto标签; ｝

我喜欢这个描述的地方在于，对于那些有命令式语言(C、c++、Java)背景的人来说，它是如此的简洁和容易掌握。

2012-08-26 01:27:05

TCO (Tail Call Optimization) is the process by which a smart compiler can make a call to a function and take no additional stack space. The only situation in which this happens is if the last instruction executed in a function f is a call to a function g (Note: g can be f). The key here is that f no longer needs stack space - it simply calls g and then returns whatever g would return. In this case the optimization can be made that g just runs and returns whatever value it would have to the thing that called f.

这种优化可以使递归调用占用恒定的堆栈空间，而不是爆炸。

示例:这个阶乘函数不是TCOptimizable:

from dis import dis

def fact(n):
    if n == 0:
        return 1
    return n * fact(n-1)


dis(fact)
  2           0 LOAD_FAST                0 (n)
              2 LOAD_CONST               1 (0)
              4 COMPARE_OP               2 (==)
              6 POP_JUMP_IF_FALSE       12

  3           8 LOAD_CONST               2 (1)
             10 RETURN_VALUE

  4     >>   12 LOAD_FAST                0 (n)
             14 LOAD_GLOBAL              0 (fact)
             16 LOAD_FAST                0 (n)
             18 LOAD_CONST               2 (1)
             20 BINARY_SUBTRACT
             22 CALL_FUNCTION            1
             24 BINARY_MULTIPLY
             26 RETURN_VALUE

这个函数除了在它的return语句中调用另一个函数之外还做其他事情。

下面这个函数是TCOptimizable:

def fact_h(n, acc):
    if n == 0:
        return acc
    return fact_h(n-1, acc*n)

def fact(n):
    return fact_h(n, 1)


dis(fact)
  2           0 LOAD_GLOBAL              0 (fact_h)
              2 LOAD_FAST                0 (n)
              4 LOAD_CONST               1 (1)
              6 CALL_FUNCTION            2
              8 RETURN_VALUE

这是因为在这些函数中最后发生的事情是调用另一个函数。

2008-11-22 07:12:47

GCC C最小可运行示例与x86拆装分析

让我们看看GCC如何通过查看生成的程序集自动执行尾部调用优化。

这将作为一个非常具体的例子，说明其他答案(如https://stackoverflow.com/a/9814654/895245)中提到的优化可以将递归函数调用转换为循环。

这反过来又节省了内存并提高了性能，因为内存访问通常是现在使程序变慢的主要原因。

作为输入，我们给GCC一个基于阶乘的非优化朴素堆栈:

tail_call.c

#include <stdio.h>
#include <stdlib.h>

unsigned factorial(unsigned n) {
    if (n == 1) {
        return 1;
    }
    return n * factorial(n - 1);
}

int main(int argc, char **argv) {
    int input;
    if (argc > 1) {
        input = strtoul(argv[1], NULL, 0);
    } else {
        input = 5;
    }
    printf("%u\n", factorial(input));
    return EXIT_SUCCESS;
}

GitHub上游。

编译和分解:

gcc -O1 -foptimize-sibling-calls -ggdb3 -std=c99 -Wall -Wextra -Wpedantic \
  -o tail_call.out tail_call.c
objdump -d tail_call.out

其中-foptimize-sibling-calls是根据man gcc泛化尾部调用的名称:

   -foptimize-sibling-calls
       Optimize sibling and tail recursive calls.

       Enabled at levels -O2, -O3, -Os.

如上所述:我如何检查gcc是否执行尾递归优化?

我选择-O1是因为:

优化不是用-O0完成的。我怀疑这是因为缺少必要的中间转换。 -O3产生了非常高效的代码，虽然它也是尾部调用优化。

使用-fno-optimize-sibling-calls进行反汇编:

0000000000001145 <factorial>:
    1145:       89 f8                   mov    %edi,%eax
    1147:       83 ff 01                cmp    $0x1,%edi
    114a:       74 10                   je     115c <factorial+0x17>
    114c:       53                      push   %rbx
    114d:       89 fb                   mov    %edi,%ebx
    114f:       8d 7f ff                lea    -0x1(%rdi),%edi
    1152:       e8 ee ff ff ff          callq  1145 <factorial>
    1157:       0f af c3                imul   %ebx,%eax
    115a:       5b                      pop    %rbx
    115b:       c3                      retq
    115c:       c3                      retq

-foptimize-sibling-calls:

0000000000001145 <factorial>:
    1145:       b8 01 00 00 00          mov    $0x1,%eax
    114a:       83 ff 01                cmp    $0x1,%edi
    114d:       74 0e                   je     115d <factorial+0x18>
    114f:       8d 57 ff                lea    -0x1(%rdi),%edx
    1152:       0f af c7                imul   %edi,%eax
    1155:       89 d7                   mov    %edx,%edi
    1157:       83 fa 01                cmp    $0x1,%edx
    115a:       75 f3                   jne    114f <factorial+0xa>
    115c:       c3                      retq
    115d:       89 f8                   mov    %edi,%eax
    115f:       c3                      retq

两者之间的关键区别在于:

在Ubuntu 18.10, GCC 8.2中测试。

2019-03-18 21:39:18

在函数式语言中，尾部调用优化就好像函数调用可以返回部分求值的表达式作为结果，然后由调用者求值。

f x = g x

f6变成了g6。因此，如果实现可以返回g6作为结果，然后调用该表达式，它将保存一个堆栈帧。

Also

f x = if c x then g x else h x.

还原到f6到g6或h6。所以如果实现计算c6，发现它是真的，那么它可以减少，

if true then g x else h x ---> g x

f x ---> h x

一个简单的非尾部调用优化解释器可能是这样的，

class simple_expresion
{
    ...
public:
    virtual ximple_value *DoEvaluate() const = 0;
};

class simple_value
{
    ...
};

class simple_function : public simple_expresion
{
    ...
private:
    simple_expresion *m_Function;
    simple_expresion *m_Parameter;

public:
    virtual simple_value *DoEvaluate() const
    {
        vector<simple_expresion *> parameterList;
        parameterList->push_back(m_Parameter);
        return m_Function->Call(parameterList);
    }
};

class simple_if : public simple_function
{
private:
    simple_expresion *m_Condition;
    simple_expresion *m_Positive;
    simple_expresion *m_Negative;

public:
    simple_value *DoEvaluate() const
    {
        if (m_Condition.DoEvaluate()->IsTrue())
        {
            return m_Positive.DoEvaluate();
        }
        else
        {
            return m_Negative.DoEvaluate();
        }
    }
}

尾部调用优化解释器可能是这样的，

class tco_expresion
{
    ...
public:
    virtual tco_expresion *DoEvaluate() const = 0;
    virtual bool IsValue()
    {
        return false;
    }
};

class tco_value
{
    ...
public:
    virtual bool IsValue()
    {
        return true;
    }
};

class tco_function : public tco_expresion
{
    ...
private:
    tco_expresion *m_Function;
    tco_expresion *m_Parameter;

public:
    virtual tco_expression *DoEvaluate() const
    {
        vector< tco_expression *> parameterList;
        tco_expression *function = const_cast<SNI_Function *>(this);
        while (!function->IsValue())
        {
            function = function->DoCall(parameterList);
        }
        return function;
    }

    tco_expresion *DoCall(vector<tco_expresion *> &p_ParameterList)
    {
        p_ParameterList.push_back(m_Parameter);
        return m_Function;
    }
};

class tco_if : public tco_function
{
private:
    tco_expresion *m_Condition;
    tco_expresion *m_Positive;
    tco_expresion *m_Negative;

    tco_expresion *DoEvaluate() const
    {
        if (m_Condition.DoEvaluate()->IsTrue())
        {
            return m_Positive;
        }
        else
        {
            return m_Negative;
        }
    }
}

2020-03-21 05:46:54

什么是尾部调用优化?

推荐文章

最新文章

标签