|
The architecture described in this document is the result of the efforts and ideas of many people. Consequently this section is an attempt to list those people (in alphabetical order) and if possible, the particular area they contributed to.
[if your name is not here, please drop me a note and Ill add it - ADB]
The F-CPU architecture will be unique in many ways by the time it is finalized, but certainly its distinctive characteristic is that it is being designed over the Internet by dozens of volunteers in a free, collaborative effort.
Engineers in general and CPU architects in particular wont easily recognize that an organizational model ultimately influences the resulting design of any large-scale engineering project: there is a widespread (mis)belief that the technical results are independent of human issues, guided only by rational choices based either on technical constraints or economic factors.
Another widespread myth is that good CPU architectures are the work of highly talented individuals working in total isolation, much like Seymour Cray designed the Cray-1 alone in the mountains. A slightly modified version of this myth states that keeping a small team of talented CS and EE engineers in isolation and under pressure will ultimately lead to revolutionary CPU architectures.
The F-CPU architecture is being developed within the framework of the Freedom Project, which is itself based on ideas borrowed from the Free Software movement. The cornerstone of the Free Software movement is the GNU/GPL (General Public License), a document that expresses in legal terms deep-rooted beliefs and ideals related to individual freedom in the Information Age. We cannot summarize the foundations of the Free Software movement in a few phrases, but we can at least mention one noteworthy aspect of Free Software projects: the source code is available for all to view, examine, understand and improve. When applied to a hardware project, this would translate into making the design and its implementation (in the form of VHDL, Verilog or any other hardware description language) as well as the masks corresponding to any particular IC process, freely available.
For the Freedom Project, not only will the final results of the development process be made freely available over the Internet, but the entire development process itself is free, in the sense that anyone can join and contribute by exchanging ideas, discussing design choices, and implementing them.
The F-CPU architecture and its implementation(s), like a (collective) work of art, will be the expression in silicon of these beliefs and ideals, as they reflect on our organizational, development and communication methodologies.
The present document is also being written taking into account this pervasive freedom to think, learn and create that runs through the Freedom Project: it can be changed, updated, corrected and extended at any time. The information herein contained can and should always be improved. The Freedom Project defines the framework in which these improvements can take place.
One last remark: we are doing our best to separate the implementation-dependent parts of the CPU design from the architecture-dependent, when this logical separation makes sense. Sometimes the architecture and its implementation cannot be separated conceptually and/or in practice (well, perhaps they can, but we dont know how to do it - yet).
The mission of the Freedom Project is to develop a new high-performance 64-bit CPU architecture, suitable for Personal Workstation use, and to implement this architecture in the available process technology. The new F-CPU architecture must support gcc and the Linux kernel (customizing these pieces of software is part of the Freedom Project). The implementation(s) must also conform to commodity PC technical requirements (e.g. an F-CPU mainboard has to fit in a standard PC case, with a standard PC keyboard connector, standard PCI bus connectors, etc.).
It is perhaps useful to mention development issues that are not part of the Freedom Project, e.g.:
This mission statement pretty much delimits the performance envelope of the F1 implementation, and somehow also influences the final CPU architecture.
Even though the F-CPU word size is 64 bits, we prefer to stick to H&Ps and standard RISC notation. Since this is a matter of convention, we believe keeping with tradition is more befitting to this case.
The word exception is used to designate traps, interrupts, faults and exceptions. Exceptions can be divided in various classes, as described by H&P.
Many authors do not make a distinction between the expressions context switch and process switch (e.g. Tanenbaum or H&P). However, in this document we shall use the definitions found in [Stallings, William, Operating Systems, Macmillan 1992, p. 148]. Quoting:
It is clear, then, that a context switch is a concept distinct from that of a process switch. A context switch may occur without changing the state of the process that is currently in the Running state. In that case, the context saving and subsequent restoral involve little overhead. However, if the currently running process is to be moved to another state (Ready, Blocked, etc.), then the operating system [the kernel] must make substantial changes in the environment.
...
Thus, the process switch, which involves a state change, requires considerably more effort than a context switch.
Again, well stick to H&P terminology: jump will be used when the PC is changed unconditionally and branch will be used when it is changed conditionally.
H&Ps notation conventions will be used throughout to define the F-CPU instruction set.
Early in the development of the F-CPU architecture, some hard technical choices were made. It is likely that the F-CPU architecture will have the following features:
Some other architectural choices are being dictated by the desire to innovate and improve upon current CPU architecture design practices. These choices are more questionable compared to the hard choices listed in the previous subsection, and yet once a consensus is reached over these choices, the F-CPU architecture will be in many ways determined. At we write these lines, these firm choices are:
Many ideas are still floating around on the various Freedom Project mailing lists.
The DLX CPU architecture was designed by H&P as a logical outcome of their research into RISC. Although it was never implemented exactly as described in their book Quantitative Approach, it is so similar to the many commercial RISC implementations that at least a parallel can be drawn between the DLX projected performance and the actual performance of MIPS, SPARC, etc.
The similarity between the Freedom CPU architecture and the DLX hypothetical CPU is not an accident. Basing ourselves on RISC concepts is a natural consequence of the desire to obtain the best performance from a (conceptually) simple design.
In designing the F-CPU architecture, we are also feeling compelled to make use of GCC DLX machine descriptions, DLX simulation and emulation tools, and the vast amount of information freely available on the Web about DLX. If possible, well avoid reinventing the wheel.
However, the F-CPU design is right now so different from DLX that it cannot be called a DLX descendent.
We'll use the following broad instruction classes defined by H&P for the DLX architecture:
In fact H&P had defined a separate instruction class for FP, but we managed to fold that into the Moves instruction class (we are really pushing RISC philosophy, here: simplify, simplify). ;-)
Move instructions support all data sizes up to and including 64 bits, as expected. Signed and unsigned byte, half-word and word are accounted for, as well as double. (128-bit moves?)
The F-CPU recognizes three distinct address spaces: I/O, coprocessor and memory.
Logical and arithmetic operations operate on three registers or two registers and a 16-bit immediate operate.
For lack of imagination, we adopted almost exactly the same control flow instructions as DLX:
Easily the most controversial and misunderstood feature in the F-CPU (proposed) architecture, is, to date, the choice of a memory-to-memory architecture. Nevertheless, it is simpler to think of the F-CPU in terms of a standard RISC machine - without registers!
"What? No registers? Is this a joke?" or "Well, if it's not a RISC CPU, it will perform like a dog." are the typical remarks. Some people with more experience notice that the TMS9900 microprocessor (one of the first 16-bit microprocessors) was a memory-to-memory machine, but they too observe that this was in the good old days were CPUs were slow and DRAM was fast. Today, we have exactly the opposite: CPUs are very fast and DRAM is slow.
We know all that.
Now please turn to page C-0 of H&P's Quantitative Approach. It says:
"RISC: any computer announced after 1985.", quoting Steven Przybylski, a designer of the Stanford MIPS. Why do H&P begin their description of various RISC CPUs with this humorous remark? Because it's true: you can't design a microprocessor architecture nowadays without taking into account the work done by the early RISC pioneers.
The term m2m/RISC was coined to describe the F-CPU architecture, so lets see how it differs from a standard RISC.
The rationale behind the concept of registers is, as we know, that of a memory hierarchy. This concept is discussed extensively in H&P, section 1.7. Each level of storage has smaller sizes, faster access times, greater bandwidth and smaller latencies as we go from disk storage to main memory, to L3 cache (if any), to L2 cache (if any), to L1 cache (all modern CPUs have this), to registers (all modern CPUs have these).
The sole memory-to-memory microprocessor architecture known to the authors was the Texas Instruments TMS9900. The idea sounds counter-intuitive given H&P description of memory hierarchies in modern RISC CPUs.
However, the choice of a m2m/RISC architecture stems from a few observations of modern CPU designs:
The design objectives of the Freedom CPU architecture WRT its data path organization were:
The proposed Freedom architecture has banks of virtual registers, which are used to shadow (earlier on we used the terms to cache or to mirror, but these led to confusions and are now deprecated) 256-byte regions of memory. How does that work and why is it good?
The idea is really very simple: each memory window (virtual register bank) has an associated 64-bit special register which defines the (virtual, as opposed to physical) base address in memory of a block of 32 (64-bit) words. Well call this register MWBn, where 0 <= n <= (number of windows of the implementation-1). The MWB registers are control registers, which means their value can only be changed in supervisor mode.
A single MW (memory window) is active at any time, meaning that a user program only ever sees 32 memory addresses that can be directly operated upon (other memory addresses must be accessed indirectly). Well call the virtual registers in the active memory window VR0 - VR31. VR0 is always set to 0, following standard RISC tradition.
The active memory window is defined by the least significant bits in register AMW, which is one of the control registers
MW0 is reserved for the operating system kernel. Switching between memory windows (we call this a context switch) is achieved by changing the value of AMW. Since AMW is one of the control registers, it is only accessible to the OS kernel.
Whenever MWBn is loaded with a new value (a new base address), a 256-byte block is brought in from memory into the corresponding 32 internal VRs, and the previous values in the VRs are simultaneously written back to the 256 byte block (memory window) previously pointed to. This we define as a process switch. Note that the OS kernel must also save paging information during a process switch. Well see how this can be achieved.
After a process switch, the contents of the VRs are _not_ maintained coherent with their corresponding memory addresses anymore.
So, as can be seen, each register set is in fact a "memory window", in the sense that its contents reflect the state of a memory block.
How does the hardware work? Just like any RISC CPU. Are there additional delays to use any of the internal virtual registers? No. Is this slower in any way compared to a RISC? No. Does this add complexity to the data path, compared to a RISC? No.
What is gained, compared to a standard RISC? Mainly lower context switch latencies: assuming the OS wants to switch from one MW to another, all it takes is to load a new value in the corresponding bits in the AMW register. This can be done in a single clock cycle, and obviously has a very low latency, compared to saving/restoring 32 registers.
For the compiler, the F-CPU is very much like many RISC CPUs: 32 general purpose registers (in reality special memory locations).
From the software point of view, the F-CPU is a pure memory-to-memory machine. From the hardware point of view, the F-CPU is a standard RISC machine with (32 x n) registers, where n is the number of register sets (32 in the F1 CPU implementation), and each register set is made up of 32 registers. The F-CPU doesnt have a flags register. A mechanism similar to the DLX/MIPS is used to test for condition codes.
The only conventional register is the 64-bit Program Counter register. Interestingly, during a context switch, the value of the PC is saved in VR0. Context switches can be achieved without any use of a stack to save the CPU state.
We shall refer to the addressing modes as though they would use standard registers. On the other hand, one must keep in mind the fact that each virtual register shadows a memory location. In this sense, there is no register addressing!
Like H&P, we establish a distinction between data addressing modes (which operate on data in the virtual registers and/or memory) and program flow (instruction) addressing modes (which always take as their destination operand the Program Counter).
The 32-bit instruction format allows for 16-bit immediate operands (according to H&P, 16-bit immediate data covers 75-80% of all uses of immediate data). This is basically similar to the DLX I-type instruction format, but some elements are re-arranged:
The opcode occupies the 6 most significant bits.
The destination virtual register occupies the following 5 bits.
The source VR follows, with another 5 bits.
The immediate operand comes last, with 16 bits.
These are really three memory operands instructions. The encoding is again similar to the DLX R-type instructions, but as with the I-type instructions, the destination register field comes next to the opcode field.
Uses exactly the same format as the immediate addressing mode, but the 16-bit field is used to hold the displacement. According to H&P, 16-bit displacement capture 99% all displacements.
Can be synthesized from the displacement addressing mode by setting the displacement field to zero.
Can be synthesized from the displacement addressing mode by using VR0 as the base register. This addressing mode is only useful for the I/O and coprocessor address spaces (for OS kernel use).
Obviously, other more sophisticated addressing modes can be synthesized from the basic register and displacement modes, with a slight performance penalty. According to H&P, the basic addressing modes provided above cover from 75% to 99% of the addressing modes in standard benchmark programs.
PC-relative Jumps and Branches are available with a 26-bit offset, plus 2 bits since our instructions are aligned on 32-bit words. This 28-bit range provides ample program space (256MB) and ease of linking.
The destination address is taken from one of the registers, and a 16-bit (+2) displacement is added, resulting in the new PC value.
This is synthesized from the previous mode by setting the immediate field to zero.
The F-CPU architecture does not implement segmentation: pure paging is used, with 8KB pages. The MMU is particularly simple: a 64-entry TLB is used like in MIPS.
In parallel with the fixed size 8KB paging scheme, the F-CPU architecture makes use of an implementation dependent number (minimum of two) of variable-sized pages, ranging (in powers of 2) from 16KB up to the entire address space. These variable-sized pages can be used , for example, to map an entire video frame buffer in a single step, or to assign a fixed mapping to the OS kernel.
Like in MIPS, a TLB miss will trigger an exception, which is handled by context switching to the kernel. With a 64-bit address space, it was judged that an inverted hash table implementation would be more effectively implemented in software rather than in hardware, and would yield adequate performance.
Modern CPUs have a plethora of special registers, used to control various features such as performance monitoring, virtual memory, cache control, timestamp counters/timers, etc. The F-CPU architecture is no exception, but it innovates in that there is a single, standard mechanism for accessing such registers that is at once fast, secure and simple to implement. In the F-CPU architecture, all CPU control registers are mapped in the first half (32KB) of the special coprocessor address space. Moving a value between any CCR and any virtual register takes a single instruction which executes in a single clock cycle.
The CCRs are only accessible in supervisor mode. Trying to access a CCR in user mode results in an illegal instruction exception.
It was felt that Transport Triggered Architecture would provide a good conceptual base for the development of a highly parallelizable FPU architecture. Moving FP data from any virtual register to any TTA functional unit in the FP coprocessor address space takes a single instruction and a single clock cycle.
Block moves are frequent operations in modern Personal workstations. Most CISC architectures implement block move instructions, however early RISC didnt. The F-CPU architecture does not specify a Block Move Coprocessor (BMC) as part of the instruction set, but instead as a general-purpose coprocessor that will handle lengthy block move operations, freeing the main processing unit for other tasks. As with the FP coprocessor, initiating a block move command (by transferring the command word from any of the virtual registers to the BMC) takes a single instruction and a single clock cycle.
|