F-CPU Architecture: Register Organization Andrew D. Balsa (andrebalsa@altern.org) August 1998 A description of the register organization of the Freedom CPU archi­ tecture. 1. Introduction One of the most important features defining a CPU architecture is the internal register organization, and the F-CPU architecture breaks with tradition here: we have chosen a memory-to-memory architecture (with some twists). Among the advantages of this architecture, one of the most important ones is the fact that any patents relative to register-register architectures are automatically avoided. Also, we are innovating, and innovation has a value by itself. But the choice of a memory-to-memory architecture was also motivated by technical, performance reasons: 1. We wanted to avoid the usual cycle where variables are loaded from memory into registers, then processed, then returned to memory. These memory-to-CPU-to-memory cycles have three disadvantages: a) they increase the latency for processing any variable, b) they are useless, but increase the number of instructions needed for even the simplest assignment and c) the compiler has to work harder at optimizing away these register load/unload cycles. 2. The presence of large, fast L1 caches inside CPUs and the constant improvements in VLSI technology allow more choices when it comes to an efficient memory hierarchy, compared to the "traditional" memory hierarchy in Von Neumann machines: main memory <---> caches <---> CPU registers. In fact, we skip an entire level, since all we have now is: main memory <---> caches. 3. A large register set means a longer context switch latency, because of the time needed to save the clobbered registers. By avoiding the use of registers, the issue is entirely avoided. 4. The issue of register windows has been previously researched and the conclusion is that the benefits brought by this technique are independent of either the CISC or RISC basic architectural choice. So we decided to include this "twist" in our memory-to-memory architecture. In fact, it's very easy to implement a "memory window" feature on top of a memory-to-memory architecture. 5. Since now all data move instructions are basically memory-to-memory move instructions, the resulting instruction set is simpler and more orthogonal. The general purpose registers are truly general purpose. Greatly simplifies the user-visible machine. 6. We also have the advantages of low register pressure and graceful performance degradation in those special cases where many variables are needed. 7. Simplifies the internal CPU architecture. We have the L0 data cache and the ALU directly connected to the internal CPU bus. Also simplifies control unit structure. We get the advantages of a large register set without spending any silicon real estate for this. 8. We can now tie our basic CPU clock frequency to how fast we can make the L0 data cache operate. This simplifies the basic job of the F-1 VLSI implementation team. 2. Register Organization Implementation The F-CPU architecture has the standard Program Counter or instruction pointer (PC) 64-bit register, and also a standard Status or flags (ST) 32-bit register. And it has a Memory Window (MW) 64-bit register, which contains the base address of the active memory window. This memory window is a set of 32 8-byte blocks that can be accessed using short versions of the standard instructions. Data pointed to by a memory window is usually already in the L0 data cache, providing zero-latency accesses. At any instant, there are 32 possible memory windows that can be accessed (with an 8KB L0 data cache). The F-CPU architecture also has many dedicated registers to control: a) CPU Configuration b) Memory Regions c) FPU Control d) Multiprocessing e) Paging f) Segmentation g) Interrupt Processing h) Coprocessor Control i) Performance Monitoring j) TimeStamp Counter Control k) Reconfigurable Logic Control 3. Instruction Set 3.1. Addressing Modes In a memory-to-memory architecture there is obviously no distinction between a register-based addressing mode and a memory-based addressing mode. An interesting feature is also that there is not much sense in including immediate addressing modes, since these can just as well be thought of as PC-relative memory-to-memory operations. Consequently, the addressing modes of the F-CPU architecture are few and simple: · direct. · indirect. · PC-relative direct. · PC-relative displacement. Other addressing modes can be synthesized using parallelizable instruction sequences, hence with near zero-cost in terms of performance. 3.2. Instruction Format With a 64-bit instruction format, few addressing modes, external FPU and coprocessors and generally regular instruction set, the F-CPU architecture has a very simple instruction format. All instructions are 64-bit long. Two bits decode into the following four classes of instructions: · Standard ALU and branch instructions. · Control instructions. · DMA instructions. · FPU and coprocessor instructions. Since the F-CPU instruction set is regular with respect to the size of data, all instructions can address either a byte (8 bits), a word (16 bits), a double word (or double, meaning 32 bits) or a quad word (or quad, meaning 64 bits). Two bits are thus spent. There are no alignment requirements. Referencing any one of the memory addresses in the current active window takes 5 bits per argument. In any of the other 31 windows takes an extra 5 bits. Anywhere in memory takes obviously a full 64-bit address.