<!-- LinuxDoc file was created by LyX 0.12 (C) 1995-1998 by <root> Thu Sep 24 11:10:17 1998
 -->
<!-- Export filter v0.6 by Pascal Andre/Bernhard Iselborn-->

<!doctype linuxdoc system>

<article>

<title>The Freedom CPU ISA and Register Organization
<author>Andrew D. Balsa (maintainer/coordinator)
<date>Rev. 0.0.2, 24 September 1998
<abstract>Describes the F-CPU ISA (Instruction Set Architecture) and Register Organization, discusses current choices. Supercedes all earlier documents!
<toc>
<sect>Acknowledgments
<p>

The architecture described in this document is the result of the efforts
 and ideas of many people. Consequently this section is an attempt to list those
 people (in alphabetical order) and if possible, the particular area they contributed
 to.
<itemize>
<item>Jecel Assumpcão Jr.: software assisted MMU, inverted hash table paging.
<item>Andrew D. Balsa: coordination.
<item>Alan Pinkerton: cleaning up the concept of virtual registers.
<item>Rafael Reilova: Block Move Coprocessor.
<item>Teemu Suutari: TTA logic.
<item>Brion Vibber: instruction format discussions.
<item>?: dropping 64-bit instructions in favour of 32-bit instructions.
</itemize>

&lsqb;if your name is not here, please drop me a note and Ill add it -
 ADB&rsqb;
<sect>Introduction
<p>

The F-CPU architecture will be unique in many ways by the time it is finalized,
 but certainly its distinctive characteristic is that it is being designed over
 the Internet by dozens of volunteers in a free, collaborative effort.

Engineers in general and CPU architects in particular wont easily recognize
 that an organizational model ultimately influences the resulting design of
 any large-scale engineering project: there is a widespread (mis)belief that
 the technical results are independent of human issues, guided only by rational
 choices based either on technical constraints or economic factors.

Another widespread myth is that good CPU architectures are the work of
 highly talented individuals working in total isolation, much like Seymour Cray
 designed the Cray-1 alone in the mountains. A slightly modified version of
 this myth states that keeping a small team of talented CS and EE engineers
 in isolation and under pressure will ultimately lead to revolutionary CPU architectures.

The F-CPU architecture is being developed within the framework of the Freedom
 Project, which is itself based on ideas borrowed from the Free Software movement.
 The cornerstone of the Free Software movement is the GNU/GPL (General Public
 License), a document that expresses in legal terms deep-rooted beliefs and
 ideals related to individual freedom in the Information Age. We cannot summarize
 the foundations of the Free Software movement in a few phrases, but we can
 at least mention one noteworthy aspect of Free Software projects: the source
 code is available for all to view, examine, understand and improve. When applied
 to a hardware project, this would translate into making the design and its
 implementation (in the form of VHDL, Verilog or any other hardware description
 language) as well as the masks corresponding to any particular IC process,
 freely available.

For the Freedom Project, not only will the final results of the development
 process be made freely available over the Internet, but the entire development
 process itself is free, in the sense that anyone can join and contribute by
 exchanging ideas, discussing design choices, and implementing them. 



The F-CPU architecture and its implementation(s), like a (collective) work
 of art, will be the expression in silicon of these beliefs and ideals, as they
 reflect on our organizational, development and communication methodologies.

The present document is also being written taking into account this pervasive
 freedom to think, learn and create that runs through the Freedom Project: it
 can be changed, updated, corrected and extended at any time. The information
 herein contained can and should always be improved. The Freedom Project defines
 the framework in which these improvements can take place.

One last remark: we are doing our best to separate the implementation-dependent
 parts of the CPU design from the architecture-dependent, <bf></bf><bf><em>when this logical
 separation makes sense</em></bf><em></em>. Sometimes the architecture and its implementation cannot
 be separated conceptually and/or in practice (well, perhaps they can, but we
 dont know how to do it - yet).
<sect>Mission statement
<p>

The mission of the Freedom Project is to develop a new high-performance
 64-bit CPU architecture, suitable for Personal Workstation use, and to implement
 this architecture in the available process technology. The new F-CPU architecture
 must support gcc and the Linux kernel (customizing these pieces of software
 is part of the Freedom Project). The implementation(s) must also conform to
 commodity PC technical requirements (e.g. an F-CPU mainboard has to fit in
 a standard PC case, with a standard PC keyboard connector, standard PCI bus
 connectors, etc.).

It is perhaps useful to mention development issues that are <bf>not</bf> part of
 the Freedom Project, e.g.:
<itemize>
<item>Develop a new microcontroller-type 8, 12, 14 or 16-bit microprocessor (e.g.
 PIC devices). Reason: outside the scope of the project.
<item>Develop yet another 32-bit RISC processor for low-power/low cost embedded
 applications (e.g. ARM). Reason: there is an infinite variety of such processors
 on the market now, with excellent characteristics.
<item>Develop new IC process technologies. Reasons: not feasible and outside
 the scope of the project. In fact we want to use exactly the most cost-effective
 technology available for each implementation of the F-CPU architecture.
<item>Build a foundry. Reasons: obvious.
<item>Develop a new compiler or OS kernel. Reasons: gcc and the Linux kernel
 nowadays constitute a standard. Developing a new compiler and/or kernel would
 mean throwing away years of development, and would present unsurmountable difficulties.
<item>Develop new PC standards. Reasons: not feasible and not needed.
</itemize>

This mission statement pretty much delimits the performance envelope of
 the F1 implementation, and somehow also influences the final CPU architecture.
<sect>Terminology
<p>
<sect1>Data Size
<p>

Even though the F-CPU word size is 64 bits, we prefer to stick to H&amp;Ps
 and standard RISC notation. Since this is a matter of convention, we believe
 keeping with tradition is more befitting to this case.
<itemize>
<item>8 bits: byte.
<item>16 bits: half-word.
<item>32 bits: word.
<item>64 bits: double.
</itemize>
<sect1>Exceptions
<p>

The word exception is used to designate traps, interrupts, faults and exceptions.
 Exceptions can be divided in various classes, as described by H&amp;P.
<sect1>Context switch vs. Process switch
<p>

Many authors do not make a distinction between the expressions context
 switch and process switch  (e.g. Tanenbaum or H&amp;P). However, in this document
 we shall use the definitions found in &lsqb;Stallings, William, Operating Systems,
 Macmillan 1992, p. 148&rsqb;. Quoting:
<quote>It is clear, then, that a context switch is a concept distinct from that
 of a process switch. A context switch may occur without changing the state
 of the process that is currently in the Running state. In that case, the context
 saving and subsequent restoral involve little overhead. However, if the currently
 running process is to be moved to another state (Ready, Blocked, etc.), then
 the operating system &lsqb;the kernel&rsqb; must make substantial changes in
 the environment.
</quote>
<quote>...
</quote>
<quote>Thus, the process switch, which involves a state change, requires considerably
 more effort than a context switch.
</quote>
<sect1>Jumps and Branches
<p>

Again, well stick to H&amp;P terminology: <bf></bf><bf><em>jump</em></bf><em></em> will be used when the PC
 is changed unconditionally and <bf></bf><bf><em>branch</em></bf><em></em> will be used when it is changed conditionally.
<sect>Notation
<p>

H&amp;Ps notation conventions will be used throughout to define the F-CPU
 instruction set.
<sect>Early architectural choices
<p>
<sect1>Hard choices
<p>

Early in the development of the F-CPU architecture, some hard technical
 choices were made. It is likely that the F-CPU architecture will have the following
 features:
<enum>
<item>64-bit internal data paths (i.e. this is a 64-bit architecture).
<item>Standard Von Neumann CPU&lt;-&gt;memory architecture.
<item>64-bit address space.
<item>Virtual memory support.
<item>Kernel and user operating modes.
<item>Co-processor support (FP, Graphics, etc...) on a separate (internal) bus.
<item>Designed for Y2K-current process technologies (0.25 micron or smaller).
</enum>
<sect1>Firm choices
<p>

Some other architectural choices are being dictated by the desire to innovate
 and improve upon current CPU architecture design practices. These choices are
 more questionable compared to the hard choices listed in the previous subsection,
 and yet once a consensus is reached over these choices, the F-CPU architecture
 will be in many ways determined. At we write these lines, these firm choices
 are:
<enum>
<item>A memory-to-memory (m2m) / RISC mixed architecture (this is explained below).
<item>Both big-endian and little-endian modes supported (H&amp;P pages 73 and
 C-12).
<item>Memory windows with 32 user-visible virtual registers. The number of memory
 windows is implementation dependent, with a minimum of 2 windows. These memory
 windows can be thought of as virtual register banks. They are used to accelerate
 context switching (they are <bf>not</bf> used to accelerate procedure calls, as are
 register windows in the SPARC architecture).
<item>Three operand instructions, as per standard RISC practice.
<item>A 32-bit fixed length instruction format, also as per standard RISC practice.
<item>Separate I/O, coprocessor and memory addressing spaces. The I/O and coprocessor
 addressing spaces (16 bits each) only allow Moves. The I/O addressing space
 is protected (only accessible in supervisor mode). The coprocessor addressing
 space is user-visible.
<item>N-stage pipelined operation, where n is implementation dependent.
<item>Hardware integer multiply and divide.
<item>FP coprocessor implemented as coprocessor mapped Transport Triggered Architecture
 (TTA) logic (the presence of an FPU is implementation dependent).
<item>Block Move coprocessor also implemented as coprocessor mapped TTA logic
 (the presence of a Block Move coprocessor is implementation dependent).
<item>No condition codes, a la MIPS.
<item>Immediate operands are limited to 16-bit size (due to the fixed 32-bit
 instruction length).
<item>CPU Control Registers (CCRs) are mapped in the first half (32KB range)
 of the coprocessor address space.
</enum>
<sect1>Soft choices
<p>

Many ideas are still floating around on the various Freedom Project mailing
 lists.
<sect>DLX and the F-CPU
<p>

The DLX CPU architecture was designed by H&amp;P as a logical outcome of
 their research into RISC. Although it was never implemented exactly as described
 in their book Quantitative Approach, it is so similar to the many commercial
 RISC implementations that at least a parallel can be drawn between the DLX
 projected performance and the actual performance of MIPS, SPARC, etc.

The similarity between the Freedom CPU architecture and the DLX hypothetical
 CPU is not an accident. Basing ourselves on RISC concepts is a natural consequence
 of the desire to obtain the best performance from a (conceptually) simple design.

In designing the F-CPU architecture, we are also feeling compelled to make
 use of GCC DLX machine descriptions, DLX simulation and emulation tools, and
 the vast amount of information freely available on the Web about DLX. If possible,
 well avoid reinventing the wheel.

However, the F-CPU design is right now so different from DLX that it cannot
 be called a DLX descendent.
<sect>Instruction Classes
<p>

We'll use the following broad instruction classes defined by H&amp;P for
 the DLX architecture:
<enum>
<item>Moves (H&amp;P defines this as load/store instructions). Since we specified
 that our coprocessors would be designed using TTA logic, our FP and block move
 instructions are included in this instruction class.
<item>ALU operations.
<item>Branches and Jumps.
<item>We'll add a fourth class for special instructions e.g. HALT, so: Special.
</enum>

In fact H&amp;P had defined a separate instruction class for FP, but we
 managed to fold that into the Moves instruction class (we are really pushing
 RISC philosophy, here: simplify, simplify). ;-)
<sect1>Move instructions
<p>

Move instructions support all data sizes up to and including 64 bits, as
 expected. Signed and unsigned byte, half-word and word are accounted for, as
 well as double. (128-bit moves?)

The F-CPU recognizes three distinct address spaces: I/O, coprocessor and
 memory.
<sect1>ALU instructions
<p>

Logical and arithmetic operations operate on three registers or two registers
 and a 16-bit immediate operate.
<itemize>
<item>Logical: AND, OR, XOR, NAND, NOR, NXOR.
<item>Arithmetic: ADD, SUB, MULT, DIV; signed and unsigned. MULT and DIV store
 their result in two consecutive virtual registers (unlike MIPS, but very much
 like SPARC).
<item>Shifts and Rotates: rotates as well as logical and arithmetic shifts, left
 and right, 1 to 64 bits. 
<item>Set conditional.
</itemize>
<sect1>Branches and Jumps
<p>

For lack of imagination, we adopted almost exactly the same control flow
 instructions as DLX:
<itemize>
<item>Branch on zero/not zero.
<item>Unconditional Jump.
<item>Jump and Link.
<item>SYS. System call exception (H&amp;P use TRAP). Switches to MW0, entering
 supervisor mode. Unlike DLX, this is not a vectored trap: the OS kernel is
 assumed to have a single entry point.
<item>RFE. Return from exception. Switches back to the previously active MW,
 restores user mode.
</itemize>
<sect1>Special
<p>
<itemize>
<item>HALT. Stops and powers down the CPU.
<item>STI. Enable maskable interrupts.
<item>CLI. Disable maskable interrupts.
<item>LCX. Locked compare and exchange. Compares and exchanges the contents of
 a register with the contents of a memory address, pointed to by another register
 plus 16-bit displacement.
</itemize>
<sect>Memory Windows and Virtual Registers
<p>

Easily the most controversial and misunderstood feature in the F-CPU (proposed)
 architecture, is, to date, the choice of a memory-to-memory architecture. Nevertheless,
 it is simpler to think of the F-CPU in terms of a standard RISC machine - without
 registers!

&dquot;What? No registers? Is this a joke?&dquot; or &dquot;Well, if it's
 not a RISC CPU, it will perform like a dog.&dquot; are the typical remarks.
 Some people with more experience notice that the TMS9900 microprocessor (one
 of the first 16-bit microprocessors) was a memory-to-memory machine, but they
 too observe that this was in the good old days were CPUs were slow and DRAM
 was fast. Today, we have exactly the opposite: CPUs are very fast and DRAM
 is slow.

We know all that. 

Now please turn to page C-0 of H&amp;P's Quantitative Approach. It says:

&dquot;RISC: any computer announced after 1985.&dquot;, quoting Steven
 Przybylski, a designer of the Stanford MIPS. Why do H&amp;P begin their description
 of various RISC CPUs with this humorous remark? Because it's true: you can't
 design a microprocessor architecture nowadays without taking into account the
 work done by the early RISC pioneers.

The term m2m/RISC was coined to describe the F-CPU architecture, so lets
 see how it differs from a standard RISC.
<sect1>Registers - what do we need them for?
<p>

The rationale behind the concept of registers is, as we know, that of a
 memory hierarchy. This concept is discussed extensively in H&amp;P, section
 1.7. Each level of storage has smaller sizes, faster access times, greater
 bandwidth and smaller latencies as we go from disk storage to main memory,
 to L3 cache (if any), to L2 cache (if any), to L1 cache (all modern CPUs have
 this), to registers (all modern CPUs have these).

The sole memory-to-memory microprocessor architecture known to the authors
 was the Texas Instruments TMS9900. The idea sounds counter-intuitive given
 H&amp;P description of memory hierarchies in modern RISC CPUs.

However, the choice of a m2m/RISC architecture stems from a few observations
 of modern CPU designs:
<enum>
<item>On-chip L1 CPU caches can be dual-ported and still operate at full CPU
 speed. These can be as large as 64KB using 0.35 micron technology (e.g. the
 Cyrix 6x86MX, the Centaur C6 or the AMD K6), and probably larger using 0.25
 micron technology. 
<item>The number of registers available is an important architectural parameter,
 obeying sometimes contradictory requirements/limitations: 
<itemize>
<item>On the one hand, compiler writers would prefer to have as many registers
 as possible. In fact, they would even prefer to view registers and memory as
 the same. 
<item>Instruction set designers will limit the number of accessible registers
 to avoid using too many bits for register selection. 
<item>Operating systems designers will want to save as few registers as possible,
 to avoid expensive context-switches. 
<item>Hardware designers can implement as many registers as required, but the
 complexity of the design usually increases in direct proportion to the number
 of registers. 
<item>The SPARC architecture inaugurated the efficient use of register windows.
 However, SPARC like all RISC CPUs is a register-to-register machine. It has
 been proved that register windows can increase the efficiency of CISC designs
 as well as RISC ones.
</itemize>
</enum>

The design objectives of the Freedom CPU architecture WRT its data path
 organization were: 
<itemize>
<item>Minimize latencies for memory accesses. 
<item>Maximize bandwidth for memory accesses. 
<item>Reduce &dquot;register pressure&dquot; in compilers, particularly gcc.
 
<item>Minimize context switch latency. 
<item>Present a simple, regular memory/registers addressing instruction encoding.
 
<item>Exploit modern cache design.
</itemize>
<sect1>Virtual registers
<p>

The proposed Freedom architecture has banks of virtual registers, which
 are used to shadow (earlier on we used the terms to cache or to mirror, but
 these led to confusions and are now deprecated) 256-byte regions of memory.
 How does that work and why is it good? 

The idea is really very simple: each memory window (virtual register bank)
 has an associated 64-bit special register which defines the (virtual, as opposed
 to physical) base address in memory of a block of 32 (64-bit) words. Well call
 this register MWBn, where 0 &lt;= n &lt;= (number of windows of the implementation-1).
 The MWB registers are control registers, which means their value can only be
 changed in supervisor mode.

A single MW (memory window) is active at any time, meaning that a user
 program only ever sees 32 memory addresses that can be directly operated upon
 (other memory addresses must be accessed indirectly). Well call the virtual
 registers in the active memory window VR0 - VR31. VR0 is always set to 0, following
 standard RISC tradition.

The active memory window is defined by the least significant bits in register
 AMW, which is one of the control registers

MW0 is reserved for the operating system kernel. Switching between memory
 windows (we call this a context switch) is achieved by changing the value of
 AMW. Since AMW is one of the control registers, it is only accessible to the
 OS kernel.

Whenever MWBn is loaded with a new value (a new base address), a 256-byte
 block is brought in from memory into the corresponding 32 internal VRs, and
 the previous values in the VRs are simultaneously written back to the 256 byte
 block (memory window) previously pointed to. This we define as a process switch.
 Note that the OS kernel must also save paging information during a process
 switch. Well see how this can be achieved.

After a process switch, the contents of the VRs are _not_ maintained coherent
 with their corresponding memory addresses anymore.

So, as can be seen, each register set is in fact a &dquot;memory window&dquot;,
 in the sense that its contents reflect the state of a memory block.

How does the hardware work? Just like any RISC CPU. Are there additional
 delays to use any of the internal virtual registers? No. Is this slower in
 any way compared to a RISC? No. Does this add complexity to the data path,
 compared to a RISC? No.

What is gained, compared to a standard RISC? Mainly lower context switch
 latencies: assuming the OS wants to switch from one MW to another, all it takes
 is to load a new value in the corresponding bits in the AMW register. This
 can be done in a single clock cycle, and obviously has a very low latency,
 compared to saving/restoring 32 registers.

For the compiler, the F-CPU is very much like many RISC CPUs: 32 general
 purpose registers (in reality special memory locations).

From the software point of view, the F-CPU is a pure memory-to-memory machine.
 From the hardware point of view, the F-CPU is a standard RISC machine with
 (32 x n) registers, where n is the number of register sets (32 in the F1 CPU
 implementation), and each register set is made up of 32 registers. The F-CPU
 doesnt have a flags register. A mechanism similar to the DLX/MIPS is used to
 test for condition codes.

The only conventional register is the 64-bit Program Counter register.
 Interestingly, during a context switch, the value of the PC is saved in VR0.
 Context switches can be achieved without any use of a stack to save the CPU
 state.
<sect>Addressing modes
<p>

We shall refer to the addressing modes as though they would use standard
 registers. On the other hand, one must keep in mind the fact that each virtual
 register shadows a memory location. In this sense, there is no register addressing!

Like H&amp;P, we establish a distinction between data addressing modes
 (which operate on data in the virtual registers and/or memory) and program
 flow (instruction) addressing modes (which always take as their destination
 operand the Program Counter).
<sect1>Data addressing modes
<p>
<sect2>Immediate
<p>

The 32-bit instruction format allows for 16-bit immediate operands (according
 to H&amp;P, 16-bit immediate data covers 75-80&percnt; of all uses of immediate
 data). This is basically similar to the DLX I-type instruction format, but
 some elements are re-arranged:

The opcode occupies the 6 most significant bits.

The destination virtual register occupies the following 5 bits.

The source VR follows, with another 5 bits.

The immediate operand comes last, with 16 bits.
<sect2>Register
<p>

These are really three memory operands instructions. The encoding is again
 similar to the DLX R-type instructions, but as with the I-type instructions,
 the destination register field comes next to the opcode field.
<sect2>Displacement
<p>

Uses exactly the same format as the immediate addressing mode, but the
 16-bit field is used to hold the displacement. According to H&amp;P, 16-bit
 displacement capture 99&percnt; all displacements.
<sect2>Register indirect or register deferred
<p>

Can be synthesized from the displacement addressing mode by setting the
 displacement field to zero.
<sect2>Direct or absolute
<p>

Can be synthesized from the displacement addressing mode by using VR0 as
 the base register. This addressing mode is only useful for the I/O and coprocessor
 address spaces (for OS kernel use).
<sect1>Data addressing affectiveness
<p>

Obviously, other more sophisticated addressing modes can be synthesized
 from the basic register and displacement modes, with a slight performance penalty.
 According to H&amp;P, the basic addressing modes provided above cover from
 75&percnt; to 99&percnt; of the addressing modes in standard benchmark programs.
<sect1>Program flow addressing modes
<p>
<sect2>PC-relative
<p>

PC-relative Jumps and Branches are available with a 26-bit offset, plus
 2 bits since our instructions are aligned on 32-bit words. This 28-bit range
 provides ample program space (256MB) and ease of linking.
<sect2>Register indirect with displacement
<p>

The destination address is taken from one of the registers, and a 16-bit
 (+2) displacement is added, resulting in the new PC value.
<sect2>Register indirect
<p>

This is synthesized from the previous mode by setting the immediate field
 to zero.
<sect>Virtual memory: pure paging
<p>

The F-CPU architecture does not implement segmentation: pure paging is
 used, with 8KB pages. The MMU is particularly simple: a 64-entry TLB is used
 like in MIPS.

In parallel with the fixed size 8KB paging scheme, the F-CPU architecture
 makes use of an implementation dependent number (minimum of two) of variable-sized
 pages, ranging (in powers of 2) from 16KB up to the entire address space. These
 variable-sized pages can be used , for example, to map an entire video frame
 buffer in a single step, or to assign a fixed mapping to the OS kernel.

Like in MIPS, a TLB miss will trigger an exception, which is handled by
 context switching to the kernel. With a 64-bit address space, it was judged
 that an inverted hash table implementation would be more effectively implemented
 in software rather than in hardware, and would yield adequate performance.
<sect>CPU control registers (CCRs)
<p>

Modern CPUs have a plethora of special registers, used to control various
 features such as performance monitoring, virtual memory, cache control, timestamp
 counters/timers, etc. The F-CPU architecture is no exception, but it innovates
 in that there is a single, standard mechanism for accessing such registers
 that is at once fast, secure and simple to implement. In the F-CPU architecture,
 all CPU control registers are mapped in the first half (32KB) of the special
 coprocessor address space. Moving a value between any CCR and any virtual register
 takes a single instruction which executes in a single clock cycle.

The CCRs are only accessible in supervisor mode. Trying to access a CCR
 in user mode results in an illegal instruction exception.
<sect>FP coprocessor TTA architecture
<p>

It was felt that Transport Triggered Architecture would provide a good
 conceptual base for the development of a highly parallelizable FPU architecture.
 Moving FP data from any virtual register to any TTA functional unit in the
 FP coprocessor address space takes a single instruction and a single clock
 cycle.
<sect>Block Move coprocessor architecture
<p>

Block moves are frequent operations in modern Personal workstations. Most
 CISC architectures implement block move instructions, however early RISC didnt.
 The F-CPU architecture does not specify a Block Move Coprocessor (BMC) as part
 of the instruction set, but instead as a general-purpose coprocessor that will
 handle lengthy block move operations, freeing the main processing unit for
 other tasks. As with the FP coprocessor, initiating a block move command (by
 transferring the command word from any of the virtual registers to the BMC)
 takes a single instruction and a single clock cycle.

</article>
