诚然,我不明白。假设您有一个内存,内存字的长度为1字节。为什么你不能访问一个4字节长的变量在一个内存访问上一个未对齐的地址(即。不能被4整除,就像对齐地址的情况一样?


当前回答

If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.

其他回答

If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.

在PowerPC上,可以毫无问题地从奇数地址加载整数。

Sparc、I86和(我认为)Itatnium会在您尝试时引发硬件异常。

在大多数现代处理器上,一个32位负载和四个8位负载并没有太大区别。数据是否已经在缓存中将产生更大的影响。

这是许多底层处理器的限制。它通常可以通过进行4次低效的单字节读取来解决,而不是进行一次高效的单词读取,但许多语言说明符认为,直接禁止它们并强制所有内容对齐会更容易。

OP在这个链接中发现了更多的信息。

从根本上讲,这是因为内存总线有一些特定的长度,它比内存大小小得多。

因此,CPU从芯片上的L1缓存中读取,现在通常是32KB。但是连接L1缓存到CPU的内存总线的缓存线宽度要小得多。这将是128位的数量级。

So:

262,144 bits - size of memory
    128 bits - size of bus

未对齐的访问偶尔会重叠两条缓存线,这将需要一个全新的缓存读取来获取数据。它甚至可能会错过到DRAM中。

此外,CPU的某些部分将不得不倒立起来,从这两条不同的缓存线中拼凑出一个单独的对象,每条缓存线都有一块数据。在一行上,它是非常高阶的位,在另一行上,它是非常低阶的位。

将会有专门的硬件完全集成到管道中,处理将对齐的对象移动到CPU数据总线的必要位上,但是对于未对齐的对象可能缺乏这样的硬件,因为使用这些晶体管来加速正确优化的程序可能更有意义。

在任何情况下,无论有多少特殊用途的硬件(假设的和愚蠢的)致力于修补错位的内存操作,有时必要的第二次内存读取都会减慢管道。

@joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like. In addition here's a link to a Github gist with the code for the test. The test code is adapted from the article written by Jonathan Rentzsch which @joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.