曾经,为了编写x86汇编程序,例如,你会有这样的指令:“用值5加载EDX寄存器”,“增加EDX”寄存器,等等。

对于拥有4核(甚至更多)的现代cpu,在机器代码级别上,它是否看起来就像有4个独立的cpu(即只有4个不同的“EDX”寄存器)?如果是这样,当你说“增加EDX寄存器”时,是什么决定哪个CPU的EDX寄存器被增加?现在在x86汇编器中有“CPU上下文”或“线程”概念吗?

内核之间的通信/同步是如何工作的?

如果您正在编写一个操作系统,通过硬件公开的什么机制允许您在不同的内核上调度执行?是一些特殊的特权指令吗?

如果你正在为一个多核CPU编写一个优化编译器/字节码虚拟机,你需要特别了解什么,比如说,x86,以使它生成跨所有核高效运行的代码?

为了支持多核功能,x86机器码做了哪些改变?


当前回答

这根本不是在机器指令中完成的;这些核心假装是不同的cpu,并且没有任何相互通信的特殊能力。他们有两种沟通方式:

它们共享物理地址空间。硬件处理缓存一致性,因此一个CPU写入另一个CPU读取的内存地址。 它们共享一个APIC(可编程中断控制器)。这是映射到物理地址空间的内存,一个处理器可以使用它来控制其他处理器,打开或关闭它们,发送中断等等。

http://www.cheesecake.org/sac/smp.html是一个很好的参考,但url有点傻。

其他回答

What has been added on every multiprocessing-capable architecture compared to the single-processor variants that came before them are instructions to synchronize between cores. Also, you have instructions to deal with cache coherency, flushing buffers, and similar low-level operations an OS has to deal with. In the case of simultaneous multithreaded architectures like IBM POWER6, IBM Cell, Sun Niagara, and Intel "Hyperthreading", you also tend to see new instructions to prioritize between threads (like setting priorities and explicitly yielding the processor when there is nothing to do).

但是基本的单线程语义是相同的,您只是添加额外的设施来处理与其他核心的同步和通信。

如果你在写优化 多核编译器/字节码虚拟机 中央处理器,你需要知道什么 特别是关于x86的制作 它生成有效运行的代码 在所有的核上?

作为编写优化编译器/字节码虚拟机的人,我可能能够在这里帮助你。

您不需要特别了解x86,就可以让它生成跨所有核心高效运行的代码。

但是,您可能需要了解cmpxchg及其相关知识,以便编写能够在所有核心上正确运行的代码。多核编程要求在执行线程之间使用同步和通信。

您可能需要了解一些关于x86的知识,以便让它生成在x86上高效运行的代码。

你还可以学习其他一些有用的东西:

您应该了解操作系统(Linux或Windows或OSX)提供的允许您运行多个线程的功能。你应该学习并行化api,比如OpenMP和Threading Building Blocks,或者OSX 10.6“Snow Leopard”即将推出的“Grand Central”。

您应该考虑编译器是否应该自动并行,或者编译器编译的应用程序的作者是否需要在他的程序中添加特殊的语法或API调用来利用多核。

The main difference between a single- and a multi-threaded application is that the former has one stack and the latter has one for each thread. Code is generated somewhat differently since the compiler will assume that the data and stack segment registers (ds and ss) are not equal. This means that indirection through the ebp and esp registers that default to the ss register won't also default to ds (because ds!=ss). Conversely, indirection through the other registers which default to ds won't default to ss.

The threads share everything else including data and code areas. They also share lib routines so make sure that they are thread-safe. A procedure that sorts an area in RAM can be multi-threaded to speed things up. The threads will then be accessing, comparing and ordering data in the same physical memory area and executing the same code but using different local variables to control their respective part of the sort. This of course is because the threads have different stacks where the local variables are contained. This type of programming requires careful tuning of the code so that inter-core data collisions (in caches and RAM) are reduced which in turn results in a code which is faster with two or more threads than it is with just one. Of course, an un-tuned code will often be faster with one processor than with two or more. To debug is more challenging because the standard "int 3" breakpoint will not be applicable since you want to interrupt a specific thread and not all of them. Debug register breakpoints do not solve this problem either unless you can set them on the specific processor executing the specific thread you want to interrupt.

其他多线程代码可能涉及在程序的不同部分运行的不同线程。这种类型的编程不需要同样的调优,因此更容易学习。

Intel x86最小可运行的裸金属示例

可运行的裸露金属的例子,所有所需的样板。下面将介绍所有主要部分。

在Ubuntu 15.10 QEMU 2.3.0和联想ThinkPad T400真实硬件客户端上测试。

英特尔手册第3卷系统编程指南- 325384-056US 2015年9月涵盖SMP第8章,第9章和第10章。

表8 - 1。“Broadcast INIT-SIPI-SIPI序列和超时的选择”包含了一个基本工作的示例:

MOV ESI, ICR_LOW    ; Load address of ICR low dword into ESI.
MOV EAX, 000C4500H  ; Load ICR encoding for broadcast INIT IPI
                    ; to all APs into EAX.
MOV [ESI], EAX      ; Broadcast INIT IPI to all APs
; 10-millisecond delay loop.
MOV EAX, 000C46XXH  ; Load ICR encoding for broadcast SIPI IP
                    ; to all APs into EAX, where xx is the vector computed in step 10.
MOV [ESI], EAX      ; Broadcast SIPI IPI to all APs
; 200-microsecond delay loop
MOV [ESI], EAX      ; Broadcast second SIPI IPI to all APs
                    ; Waits for the timer interrupt until the timer expires

在代码中:

Most operating systems will make most of those operations impossible from ring 3 (user programs). So you need to write your own kernel to play freely with it: a userland Linux program will not work. At first, a single processor runs, called the bootstrap processor (BSP). It must wake up the other ones (called Application Processors (AP)) through special interrupts called Inter Processor Interrupts (IPI). Those interrupts can be done by programming Advanced Programmable Interrupt Controller (APIC) through the Interrupt command register (ICR) The format of the ICR is documented at: 10.6 "ISSUING INTERPROCESSOR INTERRUPTS" The IPI happens as soon as we write to the ICR. ICR_LOW is defined at 8.4.4 "MP Initialization Example" as: ICR_LOW EQU 0FEE00300H The magic value 0FEE00300 is the memory address of the ICR, as documented at Table 10-1 "Local APIC Register Address Map" The simplest possible method is used in the example: it sets up the ICR to send broadcast IPIs which are delivered to all other processors except the current one. But it is also possible, and recommended by some, to get information about the processors through special data structures setup by the BIOS like ACPI tables or Intel's MP configuration table and only wake up the ones you need one by one. XX in 000C46XXH encodes the address of the first instruction that the processor will execute as: CS = XX * 0x100 IP = 0 Remember that CS multiples addresses by 0x10, so the actual memory address of the first instruction is: XX * 0x1000 So if for example XX == 1, the processor will start at 0x1000. We must then ensure that there is 16-bit real mode code to be run at that memory location, e.g. with: cld mov $init_len, %ecx mov $init, %esi mov 0x1000, %edi rep movsb .code16 init: xor %ax, %ax mov %ax, %ds /* Do stuff. */ hlt .equ init_len, . - init Using a linker script is another possibility. The delay loops are an annoying part to get working: there is no super simple way to do such sleeps precisely. Possible methods include: PIT (used in my example) HPET calibrate the time of a busy loop with the above, and use it instead Related: How to display a number on the screen and and sleep for one second with DOS x86 assembly? I think the initial processor needs to be in protected mode for this to work as we write to address 0FEE00300H which is too high for 16-bits To communicate between processors, we can use a spinlock on the main process, and modify the lock from the second core. We should ensure that memory write back is done, e.g. through wbinvd.

处理器之间的共享状态

8.7.1“逻辑处理器的状态”说:

The following features are part of the architectural state of logical processors within Intel 64 or IA-32 processors supporting Intel Hyper-Threading Technology. The features can be subdivided into three groups: Duplicated for each logical processor Shared by logical processors in a physical processor Shared or duplicated, depending on the implementation The following features are duplicated for each logical processor: General purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, and EBP) Segment registers (CS, DS, SS, ES, FS, and GS) EFLAGS and EIP registers. Note that the CS and EIP/RIP registers for each logical processor point to the instruction stream for the thread being executed by the logical processor. x87 FPU registers (ST0 through ST7, status word, control word, tag word, data operand pointer, and instruction pointer) MMX registers (MM0 through MM7) XMM registers (XMM0 through XMM7) and the MXCSR register Control registers and system table pointer registers (GDTR, LDTR, IDTR, task register) Debug registers (DR0, DR1, DR2, DR3, DR6, DR7) and the debug control MSRs Machine check global status (IA32_MCG_STATUS) and machine check capability (IA32_MCG_CAP) MSRs Thermal clock modulation and ACPI Power management control MSRs Time stamp counter MSRs Most of the other MSR registers, including the page attribute table (PAT). See the exceptions below. Local APIC registers. Additional general purpose registers (R8-R15), XMM registers (XMM8-XMM15), control register, IA32_EFER on Intel 64 processors. The following features are shared by logical processors: Memory type range registers (MTRRs) Whether the following features are shared or duplicated is implementation-specific: IA32_MISC_ENABLE MSR (MSR address 1A0H) Machine check architecture (MCA) MSRs (except for the IA32_MCG_STATUS and IA32_MCG_CAP MSRs) Performance monitoring control and counter MSRs

缓存共享的讨论如下:

如何在多核Intel cpu中共享缓存内存? http://stackoverflow.com/questions/4802565/multiple-threads-and-cpu-cache 多个CPU /内核可以同时访问同一个RAM吗?

英特尔超线程具有比独立内核更好的缓存和管道共享:https://superuser.com/questions/133082/hyper-threading-and-dual-core-whats-the-difference/995858#995858

Linux内核4.2

主要的初始化操作似乎在arch/x86/kernel/smpboot.c。

ARM最小可运行裸金属示例

下面我为QEMU提供了一个最小可运行ARMv8 aarch64的例子:

.global mystart
mystart:
    /* Reset spinlock. */
    mov x0, #0
    ldr x1, =spinlock
    str x0, [x1]

    /* Read cpu id into x1.
     * TODO: cores beyond 4th?
     * Mnemonic: Main Processor ID Register
     */
    mrs x1, mpidr_el1
    ands x1, x1, 3
    beq cpu0_only
cpu1_only:
    /* Only CPU 1 reaches this point and sets the spinlock. */
    mov x0, 1
    ldr x1, =spinlock
    str x0, [x1]
    /* Ensure that CPU 0 sees the write right now.
     * Optional, but could save some useless CPU 1 loops.
     */
    dmb sy
    /* Wake up CPU 0 if it is sleeping on wfe.
     * Optional, but could save power on a real system.
     */
    sev
cpu1_sleep_forever:
    /* Hint CPU 1 to enter low power mode.
     * Optional, but could save power on a real system.
     */
    wfe
    b cpu1_sleep_forever
cpu0_only:
    /* Only CPU 0 reaches this point. */

    /* Wake up CPU 1 from initial sleep!
     * See:https://github.com/cirosantilli/linux-kernel-module-cheat#psci
     */
    /* PCSI function identifier: CPU_ON. */
    ldr w0, =0xc4000003
    /* Argument 1: target_cpu */
    mov x1, 1
    /* Argument 2: entry_point_address */
    ldr x2, =cpu1_only
    /* Argument 3: context_id */
    mov x3, 0
    /* Unused hvc args: the Linux kernel zeroes them,
     * but I don't think it is required.
     */
    hvc 0

spinlock_start:
    ldr x0, spinlock
    /* Hint CPU 0 to enter low power mode. */
    wfe
    cbz x0, spinlock_start

    /* Semihost exit. */
    mov x1, 0x26
    movk x1, 2, lsl 16
    str x1, [sp, 0]
    mov x0, 0
    str x0, [sp, 8]
    mov x1, sp
    mov w0, 0x18
    hlt 0xf000

spinlock:
    .skip 8

GitHub上游。

组装和运行:

aarch64-linux-gnu-gcc \
  -mcpu=cortex-a57 \
  -nostdlib \
  -nostartfiles \
  -Wl,--section-start=.text=0x40000000 \
  -Wl,-N \
  -o aarch64.elf \
  -T link.ld \
  aarch64.S \
;
qemu-system-aarch64 \
  -machine virt \
  -cpu cortex-a57 \
  -d in_asm \
  -kernel aarch64.elf \
  -nographic \
  -semihosting \
  -smp 2 \
;

在本例中,我们将cpu0放入自旋锁循环中,只有当cpu1释放自旋锁时,它才会退出。

在自旋锁之后,CPU 0执行一个半主机退出调用,使QEMU退出。

如果启动QEMU时只有一个CPU -smp 1,那么模拟将永远挂在自旋锁上。

CPU 1被PSCI接口唤醒,更多细节:ARM:启动/唤醒/唤醒其他CPU核心/ ap和通过执行起始地址?

上游版本还进行了一些调整,使其能够在gem5上工作,因此您也可以尝试性能特征。

我还没有在真正的硬件上测试过,所以我不确定它的可移植性。下面的树莓派参考书目可能会感兴趣:

https://github.com/bztsrc/raspi3-tutorial/tree/a3f069b794aeebef633dbe1af3610784d55a0efa/02_multicorec https://github.com/dwelch67/raspberrypi/tree/a09771a1d5a0b53d8e7a461948dc226c5467aeec/multi00 https://github.com/LdB-ECM/Raspberry-Pi/blob/3b628a2c113b3997ffdb408db03093b2953e4961/Multicore/SmartStart64.S https://github.com/LdB-ECM/Raspberry-Pi/blob/3b628a2c113b3997ffdb408db03093b2953e4961/Multicore/SmartStart32.S

本文档提供了一些使用ARM同步原语的指导,您可以使用这些原语在多核上做一些有趣的事情:http://infocenter.arm.com/help/topic/com.arm.doc.dht0008a/DHT0008A_arm_synchronization_primitives.pdf

在Ubuntu 18.10, GCC 8.2.0, Binutils 2.31.1, QEMU 2.12.0上测试。

更方便的可编程性的下一步

前面的例子使用专用指令唤醒辅助CPU并执行基本的内存同步,这是一个良好的开始。

但是为了使多核系统易于编程,例如POSIX pthreads,你还需要进入以下更复杂的主题:

setup interrupts and run a timer that periodically decides which thread will run now. This is known as preemptive multithreading. Such system also needs to save and restore thread registers as they are started and stopped. It is also possible to have non-preemptive multitasking systems, but those might require you to modify your code so that every threads yields (e.g. with a pthread_yield implementation), and it becomes harder to balance workloads. Here are some simplistic bare metal timer examples: x86 PIT deal with memory conflicts. Notably, each thread will need a unique stack if you want to code in C or other high level languages. You could just limit threads to have a fixed maximum stack size, but the nicer way to deal with this is with paging which allows for efficient "unlimited size" stacks. Here is a naive aarch64 baremetal example that would blow up if the stack grows too deep

这些都是使用Linux内核或其他操作系统的好理由:-)

用户域内存同步原语

尽管线程启动/停止/管理通常超出了用户域的范围,但是您可以使用来自用户域线程的汇编指令来同步内存访问,而不需要潜在的更昂贵的系统调用。

当然,您应该更喜欢使用可移植地包装这些低级原语的库。c++标准本身在<mutex>和<atomic>标头上取得了很大的进步,特别是在std::memory_order方面。我不确定它是否涵盖了所有可能的内存语义,但它只是可能。

更微妙的语义与无锁数据结构的上下文中特别相关,在某些情况下可以提供性能优势。要实现这些,您可能需要了解一些不同类型的内存障碍:https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/

例如,Boost在https://www.boost.org/doc/libs/1_63_0/doc/html/lockfree.html上有一些无锁容器实现

这样的用户域指令似乎也用于实现Linux futex系统调用,这是Linux中主要的同步原语之一。Man futex 4.15写道:

The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization. When using futexes, the majority of the synchronization operations are performed in user space. A user-space program employs the futex() system call only when it is likely that the program has to block for a longer time until the condition becomes true. Other futex() operations can be used to wake any processes or threads waiting for a particular condition.

系统调用名称本身的意思是“快速用户空间XXX”。

下面是一个最小的无用的c++ x86_64 / aarch64内联汇编示例,它主要是为了好玩而演示这些指令的基本用法:

main.cpp

#include <atomic>
#include <cassert>
#include <iostream>
#include <thread>
#include <vector>

std::atomic_ulong my_atomic_ulong(0);
unsigned long my_non_atomic_ulong = 0;
#if defined(__x86_64__) || defined(__aarch64__)
unsigned long my_arch_atomic_ulong = 0;
unsigned long my_arch_non_atomic_ulong = 0;
#endif
size_t niters;

void threadMain() {
    for (size_t i = 0; i < niters; ++i) {
        my_atomic_ulong++;
        my_non_atomic_ulong++;
#if defined(__x86_64__)
        __asm__ __volatile__ (
            "incq %0;"
            : "+m" (my_arch_non_atomic_ulong)
            :
            :
        );
        // https://github.com/cirosantilli/linux-kernel-module-cheat#x86-lock-prefix
        __asm__ __volatile__ (
            "lock;"
            "incq %0;"
            : "+m" (my_arch_atomic_ulong)
            :
            :
        );
#elif defined(__aarch64__)
        __asm__ __volatile__ (
            "add %0, %0, 1;"
            : "+r" (my_arch_non_atomic_ulong)
            :
            :
        );
        // https://github.com/cirosantilli/linux-kernel-module-cheat#arm-lse
        __asm__ __volatile__ (
            "ldadd %[inc], xzr, [%[addr]];"
            : "=m" (my_arch_atomic_ulong)
            : [inc] "r" (1),
              [addr] "r" (&my_arch_atomic_ulong)
            :
        );
#endif
    }
}

int main(int argc, char **argv) {
    size_t nthreads;
    if (argc > 1) {
        nthreads = std::stoull(argv[1], NULL, 0);
    } else {
        nthreads = 2;
    }
    if (argc > 2) {
        niters = std::stoull(argv[2], NULL, 0);
    } else {
        niters = 10000;
    }
    std::vector<std::thread> threads(nthreads);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i] = std::thread(threadMain);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i].join();
    assert(my_atomic_ulong.load() == nthreads * niters);
    // We can also use the atomics direclty through `operator T` conversion.
    assert(my_atomic_ulong == my_atomic_ulong.load());
    std::cout << "my_non_atomic_ulong " << my_non_atomic_ulong << std::endl;
#if defined(__x86_64__) || defined(__aarch64__)
    assert(my_arch_atomic_ulong == nthreads * niters);
    std::cout << "my_arch_non_atomic_ulong " << my_arch_non_atomic_ulong << std::endl;
#endif
}

GitHub上游。

可能的输出:

my_non_atomic_ulong 15264
my_arch_non_atomic_ulong 15267

从这里我们可以看到,x86 LOCK前缀/ aarch64 LDADD指令使添加操作具有原子性:如果没有它,我们在许多添加操作上都有竞争条件,并且最后的总计数小于同步的20000。

参见:

x86 在x86汇编中,“锁”指令是什么意思? x86暂停指令如何在自旋锁*和*中工作,它可以在其他场景中使用吗? 手臂 LDXR/STXR, LDAXR/STLXR: ARM64: LDXR/STXR vs LDAXR/STLXR LDADD和其他原子v8.1加载修改存储指令:http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0801g/alc1476202791033.html WFE / SVE: ARM中的WFE指令处理 到底什么是std::atomic?

在Ubuntu 19.04 amd64和QEMU aarch64用户模式下测试。

非官方的SMP FAQ

曾经,为了编写x86汇编程序,例如,你会有这样的指令:“用值5加载EDX寄存器”,“增加EDX”寄存器,等等。对于拥有4核(甚至更多)的现代cpu,在机器代码级别上,它是否看起来就像有4个独立的cpu(即只有4个不同的“EDX”寄存器)?

完全正确。有4组寄存器,包括4个单独的指令指针。

如果是这样,当你说“增加EDX寄存器”时,是什么决定哪个CPU的EDX寄存器被增加?

当然是执行指令的CPU。可以把它想象成4个完全不同的微处理器共享相同的内存。

现在在x86汇编器中有“CPU上下文”或“线程”概念吗?

不。汇编程序只是像往常一样翻译指令。没有变化。

内核之间的通信/同步是如何工作的?

由于它们共享相同的内存,这主要是程序逻辑的问题。虽然现在有一个处理器间中断机制,但它不是必要的,最初也没有出现在第一个双cpu x86系统中。

如果您正在编写一个操作系统,通过硬件公开的什么机制允许您在不同的内核上调度执行?

The scheduler actually doesn't change, except that it is slightly more carefully about critical sections and the types of locks used. Before SMP, kernel code would eventually call the scheduler, which would look at the run queue and pick a process to run as the next thread. (Processes to the kernel look a lot like threads.) The SMP kernel runs the exact same code, one thread at a time, it's just that now critical section locking needs to be SMP-safe to be sure two cores can't accidentally pick the same PID.

是一些特殊的特权指令吗?

不。这些核心都运行在相同的内存中,使用相同的旧指令。

如果你正在为一个多核CPU编写一个优化编译器/字节码虚拟机,你需要特别了解什么,比如说,x86,以使它生成跨所有核高效运行的代码?

运行与之前相同的代码。需要改变的是Unix或Windows内核。

你可以把我的问题总结为“为了支持多核功能,x86机器码做了哪些改变?”

没有什么是必要的。第一个SMP系统使用与单处理器完全相同的指令集。现在,x86体系结构已经有了很大的改进,并且有了大量的新指令来让事情变得更快,但是对于SMP来说没有一个是必要的。

For more information, see the Intel Multiprocessor Specification. Update: all the follow-up questions can be answered by just completely accepting that an n-way multicore CPU is almost1 exactly the same thing as n separate processors that just share the same memory.2 There was an important question not asked: how is a program written to run on more than one core for more performance? And the answer is: it is written using a thread library like Pthreads. Some thread libraries use "green threads" that are not visible to the OS, and those won't get separate cores, but as long as the thread library uses kernel thread features then your threaded program will automatically be multicore. 1. For backwards compatibility, only the first core starts up at reset, and a few driver-type things need to be done to fire up the remaining ones.2. They also share all the peripherals, naturally.