众所周知,calloc与malloc不同,因为它初始化分配的内存。使用calloc,内存被设置为0。使用malloc,内存不会被清除。

所以在日常工作中,我将calloc视为malloc+memset。 顺便说一句,为了好玩,我为基准测试编写了以下代码。

结果令人困惑。

代码1:

#include<stdio.h>
#include<stdlib.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
        int i=0;
        char *buf[10];
        while(i<10)
        {
                buf[i] = (char*)calloc(1,BLOCK_SIZE);
                i++;
        }
}

代码1的输出:

time ./a.out  
**real 0m0.287s**  
user 0m0.095s  
sys 0m0.192s  

代码2:

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
        int i=0;
        char *buf[10];
        while(i<10)
        {
                buf[i] = (char*)malloc(BLOCK_SIZE);
                memset(buf[i],'\0',BLOCK_SIZE);
                i++;
        }
}

代码2的输出:

time ./a.out   
**real 0m2.693s**  
user 0m0.973s  
sys 0m1.721s  

在代码2中用bzero(buf[i],BLOCK_SIZE)替换memset产生相同的结果。

我的问题是:为什么malloc+memset比calloc慢这么多?calloc如何做到这一点?


当前回答

简短的版本:总是使用calloc()而不是malloc()+memset()。在大多数情况下,它们是一样的。在某些情况下,calloc()将做更少的工作,因为它可以完全跳过memset()。在其他情况下,calloc()甚至可以欺骗,不分配任何内存!但是,malloc()+memset()将始终完成全部工作。

要理解这一点,需要对记忆系统做一个简短的介绍。

快速记忆之旅

这里有四个主要部分:您的程序、标准库、内核和页表。你已经知道你的程序了,所以…

Memory allocators like malloc() and calloc() are mostly there to take small allocations (anything from 1 byte to 100s of KB) and group them into larger pools of memory. For example, if you allocate 16 bytes, malloc() will first try to get 16 bytes out of one of its pools, and then ask for more memory from the kernel when the pool runs dry. However, since the program you're asking about is allocating for a large amount of memory at once, malloc() and calloc() will just ask for that memory directly from the kernel. The threshold for this behavior depends on your system, but I've seen 1 MiB used as the threshold.

The kernel is responsible for allocating actual RAM to each process and making sure that processes don't interfere with the memory of other processes. This is called memory protection, it has been dirt common since the 1990s, and it's the reason why one program can crash without bringing down the whole system. So when a program needs more memory, it can't just take the memory, but instead it asks for the memory from the kernel using a system call like mmap() or sbrk(). The kernel will give RAM to each process by modifying the page table.

页表将内存地址映射到实际的物理RAM。进程的地址(32位系统上的0x00000000到0xFFFFFFFF)不是真实内存,而是虚拟内存中的地址。处理器将这些地址划分为4个KiB页,每个页可以通过修改页表分配到物理RAM的不同部分。只有内核被允许修改页表。

为什么它不起作用

下面是分配256个MiB不工作的原因:

你的进程调用calloc()并请求256个MiB。 标准库调用mmap()并请求256个MiB。 内核找到256个未使用内存的MiB,并通过修改页表将其提供给你的进程。 标准库使用memset()将RAM归零,并从calloc()返回。 您的进程最终退出,内核回收RAM,以便其他进程可以使用它。

它是如何工作的

上述过程是可行的,但它不是这样发生的。有三个主要的区别。

When your process gets new memory from the kernel, that memory was probably used by some other process previously. This is a security risk. What if that memory has passwords, encryption keys, or secret salsa recipes? To keep sensitive data from leaking, the kernel always scrubs memory before giving it to a process. We might as well scrub the memory by zeroing it, and if new memory is zeroed we might as well make it a guarantee, so mmap() guarantees that the new memory it returns is always zeroed. There are a lot of programs out there that allocate memory but don't use the memory right away. Sometimes memory is allocated but never used. The kernel knows this and is lazy. When you allocate new memory, the kernel doesn't touch the page table at all and doesn't give any RAM to your process. Instead, it finds some address space in your process, makes a note of what is supposed to go there, and makes a promise that it will put RAM there if your program ever actually uses it. When your program tries to read or write from those addresses, the processor triggers a page fault and the kernel steps in to assign RAM to those addresses and resumes your program. If you never use the memory, the page fault never happens and your program never actually gets the RAM. Some processes allocate memory and then read from it without modifying it. This means that a lot of pages in memory across different processes may be filled with pristine zeroes returned from mmap(). Since these pages are all the same, the kernel makes all these virtual addresses point to a single shared 4 KiB page of memory filled with zeroes. If you try to write to that memory, the processor triggers another page fault and the kernel steps in to give you a fresh page of zeroes that isn't shared with any other programs.

最终的过程是这样的:

Your process calls calloc() and asks for 256 MiB. The standard library calls mmap() and asks for 256 MiB. The kernel finds 256 MiB of unused address space, makes a note about what that address space is now used for, and returns. The standard library knows that the result of mmap() is always filled with zeroes (or will be once it actually gets some RAM), so it doesn't touch the memory, so there is no page fault, and the RAM is never given to your process. Your process eventually exits, and the kernel doesn't need to reclaim the RAM because it was never allocated in the first place.

如果使用memset()将页面归零,memset()将触发页面错误,导致分配RAM,然后将其归零,即使它已经被零填充。这是一项巨大的额外工作,并解释了为什么calloc()比malloc()和memset()更快。如果最终还是要使用内存,那么calloc()仍然比malloc()和memset()快,但差异并没有那么大。


这并不总是有效的

并非所有系统都有分页虚拟内存,因此并非所有系统都可以使用这些优化。这适用于非常老的处理器,如80286,以及对于复杂的内存管理单元来说太小的嵌入式处理器。

这也并不总是适用于较小的分配。通过较小的分配,calloc()从共享池中获取内存,而不是直接访问内核。通常,共享池中可能存储了来自free()使用和释放的旧内存的垃圾数据,因此calloc()可以获取这些内存并调用memset()将其清除。公共实现将跟踪共享池的哪些部分是原始的并且仍然充满了零,但并非所有实现都这样做。

消除一些错误的答案

根据操作系统的不同,内核在空闲时间可能会将内存归零,也可能不会,以防您以后需要一些归零的内存。Linux不会提前将内存归零,Dragonfly BSD最近也从内核中删除了这一特性。然而,其他一些内核提前不使用内存。在空闲期间将页面归零并不足以解释巨大的性能差异。

calloc()函数没有使用memset()的一些特殊内存对齐版本,而且这不会使它更快。大多数现代处理器的memset()实现是这样的:

function memset(dest, c, len)
    // one byte at a time, until the dest is aligned...
    while (len > 0 && ((unsigned int)dest & 15))
        *dest++ = c
        len -= 1
    // now write big chunks at a time (processor-specific)...
    // block size might not be 16, it's just pseudocode
    while (len >= 16)
        // some optimized vector code goes here
        // glibc uses SSE2 when available
        dest += 16
        len -= 16
    // the end is not aligned, so one byte at a time
    while (len > 0)
        *dest++ = c
        len -= 1

所以你可以看到,memset()是非常快的,你真的不会得到任何更好的大内存块。

memset()将已经归零的内存归零的事实并不意味着内存将被归零两次,但这只能解释2倍的性能差异。这里的性能差异要大得多(在我的系统中,malloc()+memset()和calloc()之间的差异超过了三个数量级)。

党技巧

与其循环10次,不如编写一个分配内存的程序,直到malloc()或calloc()返回NULL。

如果添加memset()会发生什么?

其他回答

因为在许多系统上,在空闲处理时间中,操作系统会自行将空闲内存设置为0,并将其标记为calloc()安全,因此当调用calloc()时,它可能已经有空闲的、归零的内存可以提供给您。

在某些平台上,在某些模式下,malloc在返回内存之前初始化内存为一些典型的非零值,因此第二个版本可以很好地初始化内存两次

简短的版本:总是使用calloc()而不是malloc()+memset()。在大多数情况下,它们是一样的。在某些情况下,calloc()将做更少的工作,因为它可以完全跳过memset()。在其他情况下,calloc()甚至可以欺骗,不分配任何内存!但是,malloc()+memset()将始终完成全部工作。

要理解这一点,需要对记忆系统做一个简短的介绍。

快速记忆之旅

这里有四个主要部分:您的程序、标准库、内核和页表。你已经知道你的程序了,所以…

Memory allocators like malloc() and calloc() are mostly there to take small allocations (anything from 1 byte to 100s of KB) and group them into larger pools of memory. For example, if you allocate 16 bytes, malloc() will first try to get 16 bytes out of one of its pools, and then ask for more memory from the kernel when the pool runs dry. However, since the program you're asking about is allocating for a large amount of memory at once, malloc() and calloc() will just ask for that memory directly from the kernel. The threshold for this behavior depends on your system, but I've seen 1 MiB used as the threshold.

The kernel is responsible for allocating actual RAM to each process and making sure that processes don't interfere with the memory of other processes. This is called memory protection, it has been dirt common since the 1990s, and it's the reason why one program can crash without bringing down the whole system. So when a program needs more memory, it can't just take the memory, but instead it asks for the memory from the kernel using a system call like mmap() or sbrk(). The kernel will give RAM to each process by modifying the page table.

页表将内存地址映射到实际的物理RAM。进程的地址(32位系统上的0x00000000到0xFFFFFFFF)不是真实内存,而是虚拟内存中的地址。处理器将这些地址划分为4个KiB页,每个页可以通过修改页表分配到物理RAM的不同部分。只有内核被允许修改页表。

为什么它不起作用

下面是分配256个MiB不工作的原因:

你的进程调用calloc()并请求256个MiB。 标准库调用mmap()并请求256个MiB。 内核找到256个未使用内存的MiB,并通过修改页表将其提供给你的进程。 标准库使用memset()将RAM归零,并从calloc()返回。 您的进程最终退出,内核回收RAM,以便其他进程可以使用它。

它是如何工作的

上述过程是可行的,但它不是这样发生的。有三个主要的区别。

When your process gets new memory from the kernel, that memory was probably used by some other process previously. This is a security risk. What if that memory has passwords, encryption keys, or secret salsa recipes? To keep sensitive data from leaking, the kernel always scrubs memory before giving it to a process. We might as well scrub the memory by zeroing it, and if new memory is zeroed we might as well make it a guarantee, so mmap() guarantees that the new memory it returns is always zeroed. There are a lot of programs out there that allocate memory but don't use the memory right away. Sometimes memory is allocated but never used. The kernel knows this and is lazy. When you allocate new memory, the kernel doesn't touch the page table at all and doesn't give any RAM to your process. Instead, it finds some address space in your process, makes a note of what is supposed to go there, and makes a promise that it will put RAM there if your program ever actually uses it. When your program tries to read or write from those addresses, the processor triggers a page fault and the kernel steps in to assign RAM to those addresses and resumes your program. If you never use the memory, the page fault never happens and your program never actually gets the RAM. Some processes allocate memory and then read from it without modifying it. This means that a lot of pages in memory across different processes may be filled with pristine zeroes returned from mmap(). Since these pages are all the same, the kernel makes all these virtual addresses point to a single shared 4 KiB page of memory filled with zeroes. If you try to write to that memory, the processor triggers another page fault and the kernel steps in to give you a fresh page of zeroes that isn't shared with any other programs.

最终的过程是这样的:

Your process calls calloc() and asks for 256 MiB. The standard library calls mmap() and asks for 256 MiB. The kernel finds 256 MiB of unused address space, makes a note about what that address space is now used for, and returns. The standard library knows that the result of mmap() is always filled with zeroes (or will be once it actually gets some RAM), so it doesn't touch the memory, so there is no page fault, and the RAM is never given to your process. Your process eventually exits, and the kernel doesn't need to reclaim the RAM because it was never allocated in the first place.

如果使用memset()将页面归零,memset()将触发页面错误,导致分配RAM,然后将其归零,即使它已经被零填充。这是一项巨大的额外工作,并解释了为什么calloc()比malloc()和memset()更快。如果最终还是要使用内存,那么calloc()仍然比malloc()和memset()快,但差异并没有那么大。


这并不总是有效的

并非所有系统都有分页虚拟内存,因此并非所有系统都可以使用这些优化。这适用于非常老的处理器,如80286,以及对于复杂的内存管理单元来说太小的嵌入式处理器。

这也并不总是适用于较小的分配。通过较小的分配,calloc()从共享池中获取内存,而不是直接访问内核。通常,共享池中可能存储了来自free()使用和释放的旧内存的垃圾数据,因此calloc()可以获取这些内存并调用memset()将其清除。公共实现将跟踪共享池的哪些部分是原始的并且仍然充满了零,但并非所有实现都这样做。

消除一些错误的答案

根据操作系统的不同,内核在空闲时间可能会将内存归零,也可能不会,以防您以后需要一些归零的内存。Linux不会提前将内存归零,Dragonfly BSD最近也从内核中删除了这一特性。然而,其他一些内核提前不使用内存。在空闲期间将页面归零并不足以解释巨大的性能差异。

calloc()函数没有使用memset()的一些特殊内存对齐版本,而且这不会使它更快。大多数现代处理器的memset()实现是这样的:

function memset(dest, c, len)
    // one byte at a time, until the dest is aligned...
    while (len > 0 && ((unsigned int)dest & 15))
        *dest++ = c
        len -= 1
    // now write big chunks at a time (processor-specific)...
    // block size might not be 16, it's just pseudocode
    while (len >= 16)
        // some optimized vector code goes here
        // glibc uses SSE2 when available
        dest += 16
        len -= 16
    // the end is not aligned, so one byte at a time
    while (len > 0)
        *dest++ = c
        len -= 1

所以你可以看到,memset()是非常快的,你真的不会得到任何更好的大内存块。

memset()将已经归零的内存归零的事实并不意味着内存将被归零两次,但这只能解释2倍的性能差异。这里的性能差异要大得多(在我的系统中,malloc()+memset()和calloc()之间的差异超过了三个数量级)。

党技巧

与其循环10次,不如编写一个分配内存的程序,直到malloc()或calloc()返回NULL。

如果添加memset()会发生什么?