我们一直在为accelerator .framework做这样的事情,这是一个高度向量化的OS X / iOS库,在那里我们必须一直注意对齐。有很多选择,其中一两个我在上面没有提到。
对于这样的小数组,最快的方法就是把它放在堆栈上。GCC / clang:
void my_func( void )
{
uint8_t array[1024] __attribute__ ((aligned(16)));
...
}
不需要free()。这通常是两条指令:从堆栈指针减去1024,然后用-align对堆栈指针进行AND运算。假设请求者需要堆上的数据,因为数组的生命周期超过了堆栈,或者递归在工作,或者堆栈空间非常宝贵。
在OS X / iOS上,所有调用malloc/calloc/etc。总是16字节对齐。例如,如果你需要为AVX对齐32字节,那么你可以使用posix_memalign:
void *buf = NULL;
int err = posix_memalign( &buf, 32 /*alignment*/, 1024 /*size*/);
if( err )
RunInCirclesWaivingArmsWildly();
...
free(buf);
有些人提到c++接口的工作原理与此类似。
不要忘记页是按2的大幂进行对齐的,因此页对齐的缓冲区也是16字节对齐的。因此,mmap()和valloc()以及其他类似的接口也是选项。Mmap()的优点是,如果您愿意,可以在缓冲区中预先初始化一些非零的东西。由于它们具有页面对齐的大小,因此您将无法从中获得最小分配,并且在第一次接触它时可能会出现VM故障。
Cheesy:打开守卫malloc或类似的。像这样大小为n*16字节的缓冲区将对齐为n*16字节,因为VM用于捕获溢出,并且其边界位于页面边界。
Some Accelerate.framework functions take in a user supplied temp buffer to use as scratch space. Here we have to assume that the buffer passed to us is wildly misaligned and the user is actively trying to make our life hard out of spite. (Our test cases stick a guard page right before and after the temp buffer to underline the spite.) Here, we return the minimum size we need to guarantee a 16-byte aligned segment somewhere in it, and then manually align the buffer afterward. This size is desired_size + alignment - 1. So, In this case that is 1024 + 16 - 1 = 1039 bytes. Then align as so:
#include <stdint.h>
void My_func( uint8_t *tempBuf, ... )
{
uint8_t *alignedBuf = (uint8_t*)
(((uintptr_t) tempBuf + ((uintptr_t)alignment-1))
& -((uintptr_t) alignment));
...
}
添加align -1会将指针移动到第一个对齐地址之前,然后使用-align进行and(例如0xfff…)Ff0 for alignment=16)将它带回对齐的地址。
正如其他文章所描述的,在其他没有16字节对齐保证的操作系统上,您可以调用更大的malloc,稍后将指针预留给free(),然后按照上面所述进行对齐并使用对齐的指针,这与我们的临时缓冲区的情况非常相似。
As for aligned_memset, this is rather silly. You only have to loop in up to 15 bytes to reach an aligned address, and then proceed with aligned stores after that with some possible cleanup code at the end. You can even do the cleanup bits in vector code, either as unaligned stores that overlap the aligned region (providing the length is at least the length of a vector) or using something like movmaskdqu. Someone is just being lazy. However, it is probably a reasonable interview question if the interviewer wants to know whether you are comfortable with stdint.h, bitwise operators and memory fundamentals, so the contrived example can be forgiven.