Simba
Simba is the codename for an Epiphany-like processor designed for multicore GPGPU.
Technical overview
The Simba processor is designed with 256 compute cores. Each core will have 256 KiB of IVRAM for a total of 64 MiB of fast, core-local memory. There will also be at least 64 MiB of slower, on-board EVRAM which is shareable between cores and the CPU.
IVRAM instead of cache
Each core's IVRAM is only accessible by the core that is paired to it. Dedicated DMA instructions are used to read/write data at volume. Additionally, the IVRAM is conceptually split into four 64 KiB quadrants with 16-bit addressing, which are dynamically purposed and repurposed by the core at will. One quadrant contains code, one contains the stack, and the other two are "general-purpose", with dedicated instructions for switching execution and moving the stack as desired by the running compute kernel.
This architecture was chosen instead of an automatically-managed cache system, as well-written code can take advantage of this architecture to perform better by-the-joule by a factor of two.[1] Such gains are not physically possible with typical industry ISAs that use MMUs, virtual memory, out-of-order execution, and long pipelines that demand sophisticated speculative execution modules.
Simba does not provide MMU functions of any kind, nor does it provide any kind of speculative execution features or out-of-order execution. All of these things were ruled out of consideration due to them being inherently power-hungry. The silicon does, however, look to provide fixed-function execution of common operations, including cryptographic functions.
ISA design
The main design focus of the ISA is extreme code density. A strictly fixed-width 16-bit instruction size was imposed as a hard requirement, because larger instruction sizes would not be the most efficient use of the limited IVRAM space.
This proved challenging when it came to facilitating now-typical 5-bit register file indices (32 registers), thrice over for floating-point and vector registers. To not thrash ISA codespace so badly, register files are conceptually divided into quadrants, and most instructions can operate on source and destination registers within a given quadrant. Dedicated memory instructions have the ability to address all registers freely.
Opcode map
Mnemonic | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | Explanation | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AND
|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | G | Rdst | Rsrc | Rdst = Rdst AND Rsrc | |||||||||||||||||||||||||||
XOR
|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | G | Rdst | Rsrc | Rdst = Rdst XOR Rsrc | |||||||||||||||||||||||||||
NOT
|
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | G | Rdst | Rsrc | Rdst = Rdst NOT Rsrc | |||||||||||||||||||||||||||
ORR
|
0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | G | Rdst | Rsrc | Rdst = Rdst OR Rsrc | |||||||||||||||||||||||||||
ANDN
|
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | G | Rdst | Rsrc | Rdst = Rdst AND NOT Rsrc | |||||||||||||||||||||||||||
XNOR
|
0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | G | Rdst | Rsrc | Rdst = Rdst XOR NOT Rsrc | |||||||||||||||||||||||||||
PCT
|
0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | G | Rdst | Rsrc | Rdst = pop_count( Rsrc ) | |||||||||||||||||||||||||||
NOR
|
0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | G | Rdst | Rsrc | Rdst = Rdst OR NOT Rsrc | |||||||||||||||||||||||||||
CLZ
|
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | G | Rdst | Rsrc | Rdst = leading_zeroes( Rsrc ) | |||||||||||||||||||||||||||
CTZ
|
0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | G | Rdst | Rsrc | Rdst = trailing_zeroes( Rsrc ) | |||||||||||||||||||||||||||
PAKLO
|
0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | G | Rdst | Rsrc | Rdst = ( ( Rsrc AND 0xFFFFFFFF ) LSL 32 ) OR ( Rdst AND 0xFFFFFFFF ) | |||||||||||||||||||||||||||
PAKHI
|
0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | G | Rdst | Rsrc | Rdst = ( ( ( Rsrc LSR 32 ) AND 0xFFFFFFFF ) LSL 32 ) OR ( ( Rdst LSR 32 ) AND 0xFFFFFFFF ) | |||||||||||||||||||||||||||
PAKHLO
|
0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | G | Rdst | Rsrc | Rdst = ( ( Rsrc AND 0xFFFF ) LSL 16 ) OR ( Rdst AND 0xFFFF ) | |||||||||||||||||||||||||||
PAKHHI
|
0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | G | Rdst | Rsrc | Rdst = ( ( ( Rsrc LSR 16 ) AND 0xFFFF ) LSL 16 ) OR ( ( Rdst LSR 16 ) AND 0xFFFF ) | |||||||||||||||||||||||||||
PAKBLO
|
0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | G | Rdst | Rsrc | Rdst = ( ( Rsrc AND 255 ) LSL 8 ) OR ( Rdst AND 255 ) | |||||||||||||||||||||||||||
PAKBHI
|
0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | G | Rdst | Rsrc | Rdst = ( ( ( Rsrc LSR 8 ) AND 255 ) LSL 8 ) OR ( ( Rdst LSR 8 ) AND 255 ) | |||||||||||||||||||||||||||
SIGNX
|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | G | Rdst | Rsrc | Rdst = sign_ext_32to64( Rsrc ) | |||||||||||||||||||||||||||
SIGNXH
|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | G | Rdst | Rsrc | Rdst = sign_ext_16to64( Rsrc ) | |||||||||||||||||||||||||||
SIGNXB
|
0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | G | Rdst | Rsrc | Rdst = sign_ext_8to64( Rsrc ) | |||||||||||||||||||||||||||
ASR
|
0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | G | Rdst | Rsrc | Rdst = Rdst ASR Rsrc | |||||||||||||||||||||||||||
LSL
|
0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | G | Rdst | Rsrc | Rdst = Rdst LSL Rsrc | |||||||||||||||||||||||||||
LSR
|
0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | G | Rdst | Rsrc | Rdst = Rdst LSR Rsrc | |||||||||||||||||||||||||||
ROL
|
0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | G | Rdst | Rsrc | Rdst = Rdst ROL Rsrc | |||||||||||||||||||||||||||
ROR
|
0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | G | Rdst | Rsrc | Rdst = Rdst ROR Rsrc | |||||||||||||||||||||||||||
GRV
|
0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | G | Rdst | Rsrc | Rdst = generalised_reverse( Rdst, Rsrc ) | |||||||||||||||||||||||||||
SHF
|
0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | G | Rdst | Rsrc | Rdst = generalised_shuffle( Rdst, Rsrc ) | |||||||||||||||||||||||||||
BEXT
|
0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | G | Rdst | Rsrc | Rdst = bit_extract( Rdst, Rsrc ) | |||||||||||||||||||||||||||
BDEP
|
0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | G | Rdst | Rsrc | Rdst = bit_deposit( Rdst, Rsrc ) | |||||||||||||||||||||||||||
ADD
|
0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | G | Rdst | Rsrc | Rdst = Rdst + Rsrc | |||||||||||||||||||||||||||
SUB
|
0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | G | Rdst | Rsrc | Rdst = Rdst – Rsrc | |||||||||||||||||||||||||||
MUL
|
0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | G | Rdst | Rsrc | Rdst = Rdst • Rsrc | |||||||||||||||||||||||||||
DIV
|
0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | G | Rdst | Rsrc | Rdst = Rdst ÷ Rsrc | |||||||||||||||||||||||||||
MOV
|
0 | 0 | 1 | 0 | Sdst | Gdst | Rdst | Ssrc | Gsrc | Rsrc | Rdst = Rsrc | ||||||||||||||||||||||||||||
LDRD
|
0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | G | Rdst | Rsrc | Rdst = [ Rsrc ] | |||||||||||||||||||||||||||
LDR
|
0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | G | Rdst | Rsrc | Rdst = [ Rsrc ] AND 0xFFFFFFFF | |||||||||||||||||||||||||||
LDRH
|
0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | G | Rdst | Rsrc | Rdst = [ Rsrc ] AND 0xFFFF | |||||||||||||||||||||||||||
LDRB
|
0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | G | Rdst | Rsrc | Rdst = [ Rsrc ] AND 255 | |||||||||||||||||||||||||||
LDRQ
|
0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | G | Fdst | Rsrc | Fdst = [ Rsrc ] (quad-precision) | |||||||||||||||||||||||||||
LDRD
|
0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | G | Fdst | Rsrc | Fdst = [ Rsrc ] (double-precision) | |||||||||||||||||||||||||||
LDR
|
0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | G | Fdst | Rsrc | Fdst = [ Rsrc ] (single-precision) | |||||||||||||||||||||||||||
LDRH
|
0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | G | Fdst | Rsrc | Fdst = [ Rsrc ] (half-precision) | |||||||||||||||||||||||||||
STRD
|
0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | G | Rdst | Rsrc | [ Rdst ] = Rsrc | |||||||||||||||||||||||||||
STR
|
0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | G | Rdst | Rsrc | [ Rdst ] = Rsrc and 0xFFFFFFFF | |||||||||||||||||||||||||||
STRH
|
0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | G | Rdst | Rsrc | [ Rdst ] = Rsrc AND 0xFFFF | |||||||||||||||||||||||||||
STRB
|
0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | G | Rdst | Rsrc | [ Rdst ] = Rsrc AND 255 | |||||||||||||||||||||||||||
STRQ
|
0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | G | Rdst | Fsrc | [ Rdst ] = Fsrc (quad-precision) | |||||||||||||||||||||||||||
STRD
|
0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | G | Rdst | Fsrc | [ Rdst ] = Fsrc (double-precision) | |||||||||||||||||||||||||||
STR
|
0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | G | Rdst | Fsrc | [ Rdst ] = Fsrc (single-precision) | |||||||||||||||||||||||||||
STRH
|
0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | G | Rdst | Fsrc | [ Rdst ] = Fsrc (half-precision) | |||||||||||||||||||||||||||
HALT
|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Halt core execution | ||||||||||||||||||||||
B{cond}
|
0 | 1 | 0 | 0 | cond | imm9 * | Branch on {cond} to signed PC-relative offset of ( imm9 + 1 ) • 4 (imm9 cannot be zero)
| ||||||||||||||||||||||||||||||||
LDA
|
0 | 1 | 0 | 1 | imm7 | G | Rdst | Load 7-bit immediate number of imm7 into Rdst | |||||||||||||||||||||||||||||||
WFLO
|
0 | 1 | 1 | 0 | 0 | flag11 | Wait until synchronization flag flag11 changes to be LOW | ||||||||||||||||||||||||||||||||
WFHI
|
0 | 1 | 1 | 0 | 1 | flag11 | Wait until synchronization flag flag11 changes to be HIGH | ||||||||||||||||||||||||||||||||
CF
|
0 | 1 | 1 | 1 | 0 | flag11 | Clear synchronization flag flag11 | ||||||||||||||||||||||||||||||||
SF
|
0 | 1 | 1 | 1 | 1 | flag11 | Set synchronization flag flag11 | ||||||||||||||||||||||||||||||||
ADD
|
1 | 0 | 0 | 0 | 0 | P | G | Fdst | Fsrc | Fdst = Fdst + Fsrc | |||||||||||||||||||||||||||||
SUB
|
1 | 0 | 0 | 0 | 1 | P | G | Fdst | Fsrc | Fdst = Fdst – Rsrc | |||||||||||||||||||||||||||||
MUL
|
1 | 0 | 0 | 1 | 0 | P | G | Fdst | Fsrc | Fdst = Fdst • Fsrc | |||||||||||||||||||||||||||||
DIV
|
1 | 0 | 0 | 1 | 1 | P | G | Fdst | Fsrc | Fdst = Fdst ÷ Fsrc | |||||||||||||||||||||||||||||
SQRT
|
1 | 0 | 1 | 0 | 0 | P | G | Fdst | Fsrc | Fdst = square_root( Fsrc ) | |||||||||||||||||||||||||||||
CBRT
|
1 | 0 | 1 | 0 | 1 | P | G | Fdst | Fsrc | Fdst = cube_root( Fsrc ) | |||||||||||||||||||||||||||||
CMP
|
1 | 0 | 1 | 1 | 0 | P | G | Fdst | Fsrc | F1 compare to F2 | |||||||||||||||||||||||||||||
MOD
|
1 | 0 | 1 | 1 | 1 | P | G | Fdst | Fsrc | Fdst = Fdst % Fsrc | |||||||||||||||||||||||||||||
MIN
|
1 | 1 | 0 | 0 | 0 | P | G | Fdst | Fsrc | Fdst = min( Fdst, Fsrc ) | |||||||||||||||||||||||||||||
MAX
|
1 | 1 | 0 | 0 | 1 | P | G | Fdst | Fsrc | Fdst = max( Fdst, Fsrc ) | |||||||||||||||||||||||||||||
FTOI
|
1 | 1 | 0 | 1 | 0 | P | G | Fdst | Fsrc | Rdst = Fsrc (converted) | |||||||||||||||||||||||||||||
ITOF
|
1 | 1 | 0 | 1 | 1 | P | G | Fdst | Fsrc | Fdst = Rsrc (converted) | |||||||||||||||||||||||||||||
NEG
|
1 | 1 | 1 | 0 | 0 | P | G | Fdst | Fsrc | Fdst = - Fsrc | |||||||||||||||||||||||||||||
POW
|
1 | 1 | 1 | 0 | 1 | P | G | Fdst | Fsrc | Fdst = Fdst Fsrc | |||||||||||||||||||||||||||||
CMPC
|
1 | 1 | 1 | 1 | 0 | P | G | Fdst | Fsrc | Rdst = compare_class( Fsrc ) | |||||||||||||||||||||||||||||
JUMP
|
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | Q | 0 | 0 | 0 | 0 | 0 | Jump code execution to IVRAM quadrant Q | |||||||||||||||||||||||
STDMA
|
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | Q | G | Rdst* | Start DMA transfer from IVRAM quadrant Q into [ Rdst • 0x1000 ] (Rdst cannot be R0) | ||||||||||||||||||||||||||
STACK
|
1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | Q | 0 | 0 | 0 | 0 | 0 | Set program stack location to IVRAM quadrant Q | |||||||||||||||||||||||
LDDMA
|
1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | Q | G | Rsrc* | Start DMA transfer from [ Rsrc • 0x1000 ] into IVRAM quadrant Q (Rsrc cannot be R0) | ||||||||||||||||||||||||||
POP
|
1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | G | Rstart | Roffs | Pop a range of one or more registers, from Rstart to Roffs (relative, exclusive). | |||||||||||||||||||||||||||
PUSH
|
1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | G | Rstart | Roffs | Push a range of one or more registers, from Rstart to Roffs (relative, exclusive). | |||||||||||||||||||||||||||
MOD
|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | G | Rdst | Rsrc | Rdst = Rdst % Rsrc | |||||||||||||||||||||||||||
CMP
|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | G | Rdst | Rsrc | Compare Rdst to Rsrc |
ISA variables
Variable | Explanation |
---|---|
R | Integer register |
F | Floating-point register |
G | Register group |
S | Register state (0=integer or floating-point) |
P | Precision (0=single, 1=double, 2=half, 3=quad) |
Q | Quadrant (0-3) |
Conditional values
# | Explanation |
---|---|
0 | Always |
1 | Less than or equal to (signed) |
2 | Greater than (signed) |
3 | Less than or equal to (unsigned) |
4 | Greater than (unsigned) |
5 | Equal to |
6 | Not equal to |
7 | Overflow |
Precision values
# | Explanation |
---|---|
0 | Quad precision (128-bit) |
1 | Double precision, bits 0-63 |
2 | Double precision, bits 64-127 |
3 | Single precision, bits 0-31 |
4 | Single precision, bits 32-63 |
5 | Single precision, bits 64-95 |
6 | Single precision, bits 96-127 |
7 | Half precision, bits 0-15 |
8 | Half precision, bits 16-31 |
9 | Half precision, bits 32-47 |
10 | Half precision, bits 48-63 |
11 | Half precision, bits 64-79 |
12 | Half precision, bits 80-95 |
13 | Half precision, bits 96-111 |
14 | Half precision, buts 112-127 |
15 | Full 128-bit, as an integer |
References
- ↑ Epiphany-V performance benchmarks. Adapteva’s 1,024-core Epiphany V mega-chip packs a serious wallop. PCWorld. Retrieved 30 November 2021.