我目前正在开发一个非常性能关键的程序,我决定探索的一条路径可能有助于减少资源消耗,即增加我的工作线程的堆栈大小,这样我就可以将我将访问的大部分数据(float[]s)移动到堆栈上(使用stackalloc)。

我已经读到一个线程的默认堆栈大小是1 MB,所以为了移动所有的float[],我必须将堆栈扩展大约50倍(到50 MB~)。

我知道这通常被认为是“不安全的”,不推荐,但在对我当前的代码进行了基准测试后,我发现处理速度提高了530% !所以我不能在没有进一步调查的情况下简单地跳过这个选项,这就引出了我的问题;将堆栈增加到如此大的大小有什么危险(可能会出错),我应该采取什么预防措施来最小化这种危险?

我的测试代码,

public static unsafe void TestMethod1()
{
    float* samples = stackalloc float[12500000];

    for (var ii = 0; ii < 12500000; ii++)
    {
        samples[ii] = 32768;
    }
}

public static void TestMethod2()
{
    var samples = new float[12500000];

    for (var i = 0; i < 12500000; i++)
    {
        samples[i] = 32768;
    }
}

当前回答

有一件事可能会出错,那就是你可能没有得到这样做的许可。除非在完全信任模式下运行,框架将忽略对更大堆栈大小的请求(参见线程构造函数上的MSDN (ParameterizedThreadStart, Int32))

与其将系统堆栈大小增加到如此巨大的数字,我建议重写代码,以便在堆上使用迭代和手动堆栈实现。

其他回答

编辑:(代码和测量中的微小变化会导致结果的巨大变化)

首先,我在调试器(F5)中运行优化的代码,但这是错误的。它应该在没有调试器的情况下运行(Ctrl+F5)。其次,代码可能被彻底优化了,所以我们必须把它复杂化,这样优化器就不会干扰我们的测量。我让所有方法都返回数组中的最后一项,数组的填充方式不同。在OP的TestMethod2中还有一个额外的0,它总是使它慢十倍。

除了你提供的两种方法外,我还尝试了其他一些方法。方法3的代码与方法2相同,但该函数被声明为不安全的。方法4是使用指针访问定期创建的数组。方法5是使用指针访问非托管内存,如Marc Gravell所述。这五种方法的运行时间非常相似。M5是最快的(M1紧随其后)。最快和最慢之间的差距是5%,这不是我关心的问题。

    public static unsafe float TestMethod3()
    {
        float[] samples = new float[5000000];

        for (var ii = 0; ii < 5000000; ii++)
        {
            samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
        }

        return samples[5000000 - 1];
    }

    public static unsafe float TestMethod4()
    {
        float[] prev = new float[5000000];
        fixed (float* samples = &prev[0])
        {
            for (var ii = 0; ii < 5000000; ii++)
            {
                samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
            }

            return samples[5000000 - 1];
        }
    }

    public static unsafe float TestMethod5()
    {
        var ptr = Marshal.AllocHGlobal(5000000 * sizeof(float));
        try
        {
            float* samples = (float*)ptr;

            for (var ii = 0; ii < 5000000; ii++)
            {
                samples[ii] = 32768 + (ii != 0 ? samples[ii - 1] : 0);
            }

            return samples[5000000 - 1];
        }
        finally
        {
            Marshal.FreeHGlobal(ptr);
        }
    }

由于性能差异太大,问题几乎与配置无关。这很可能是由数组访问引起的。

我分解了函数的循环体:

TestMethod1:

IL_0011:  ldloc.0 
IL_0012:  ldloc.1 
IL_0013:  ldc.i4.4 
IL_0014:  mul 
IL_0015:  add 
IL_0016:  ldc.r4 32768.
IL_001b:  stind.r4 // <----------- This one
IL_001c:  ldloc.1 
IL_001d:  ldc.i4.1 
IL_001e:  add 
IL_001f:  stloc.1 
IL_0020:  ldloc.1 
IL_0021:  ldc.i4 12500000
IL_0026:  blt IL_0011

TestMethod2:

IL_0012:  ldloc.0 
IL_0013:  ldloc.1 
IL_0014:  ldc.r4 32768.
IL_0019:  stelem.r4 // <----------- This one
IL_001a:  ldloc.1 
IL_001b:  ldc.i4.1 
IL_001c:  add 
IL_001d:  stloc.1 
IL_001e:  ldloc.1 
IL_001f:  ldc.i4 12500000
IL_0024:  blt IL_0012

我们可以检查指令的使用情况,更重要的是,他们在ECMA规范中抛出的异常:

stind.r4: Store value of type float32 into memory at address

它抛出的异常:

System.NullReferenceException

And

stelem.r4: Replace array element at index with the float32 value on the stack.

它抛出的异常:

System.NullReferenceException
System.IndexOutOfRangeException
System.ArrayTypeMismatchException

如您所见,stenem在数组范围检查和类型检查方面做了更多的工作。由于循环体只做很少的事情(只赋值),检查的开销支配了计算时间。这就是为什么性能相差530%的原因。

这也回答了你的问题:危险在于数组范围和类型检查的缺失。这是不安全的(正如函数声明中提到的;D)。

我发现处理速度提高了530% !

我认为这是目前为止最大的危险。你的基准测试有严重的问题,行为如此不可预测的代码通常隐藏着一个严重的错误。

除了过度的递归,在. net程序中消耗大量堆栈空间是非常非常困难的。托管方法的堆栈框架的大小是固定的。简单地说就是方法的参数和方法中的局部变量的和。减去可以存储在CPU寄存器中的那些,你可以忽略它,因为它们的数量太少了。

增加堆栈大小并不能完成任何事情,你只会保留一堆永远不会被使用的地址空间。当然,没有机制可以解释不使用内存会增加性能。

This is unlike a native program, particularly one written in C, it can also reserve space for arrays on the stack frame. The basic malware attack vector behind stack buffer overflows. Possible in C# as well, you'd have to use the stackalloc keyword. If you are doing that then the obvious danger is having to write unsafe code that is subject to such attacks, as well as random stack frame corruption. Very hard to diagnose bugs. There is a counter-measure against this in later jitters, I think starting at .NET 4.0, where the jitter generates code to put a "cookie" on the stack frame and checks if it is still intact when the method returns. Instant crash to the desktop without any way to intercept or report the mishap if that happens. That's ... dangerous to the user's mental state.

The main thread of your program, the one started by the operating system, will have a 1 MB stack by default, 4 MB when you compile your program targeting x64. Increasing that requires running Editbin.exe with the /STACK option in a post build event. You can typically ask for up to 500 MB before your program will have trouble getting started when running in 32-bit mode. Threads can too, much easier of course, the danger zone typically hovers around 90 MB for a 32-bit program. Triggered when your program has been running for a long time and address space got fragmented from previous allocations. Total address space usage must already be high, over a gig, to get this failure mode.

再三检查你的代码,一定有问题。除非显式地编写代码来利用更大的堆栈,否则无法获得x5的加速。这总是需要不安全的代码。在c#中使用指针总是有一个创建更快代码的诀窍,它不受数组边界检查的影响。

我在那里有一个预定,我只是不知道如何预测它——权限、GC(需要扫描堆栈)等等——所有这些都可能受到影响。我非常倾向于使用非托管内存:

var ptr = Marshal.AllocHGlobal(sizeBytes);
try
{
    float* x = (float*)ptr;
    DoWork(x);
}
finally
{
    Marshal.FreeHGlobal(ptr);
}

高性能的数组可能以与普通c#数组相同的方式访问,但这可能是麻烦的开始:考虑以下代码:

float[] someArray = new float[100]
someArray[200] = 10.0;

您希望出现一个越界异常,这完全有意义,因为您正在尝试访问元素200,但允许的最大值是99。如果你使用stackalloc路由,那么将没有对象包裹在你的数组上进行绑定检查,下面将不会显示任何异常:

Float* pFloat =  stackalloc float[100];
fFloat[200]= 10.0;

Above you are allocating enough memory to hold 100 floats and you are setting the sizeof(float) memory location which starts at the location started of this memory + 200*sizeof(float) for holding your float value 10. Unsurprisingly this memory is outside the allocated memory for the floats and nobody would know what could be stored in that address. If you are lucky you might have used some currently unused memory but at the same time it is likely you might overwrite some location that was used for storing other variables. To Summarize: Unpredictable runtime behaviour.