Simba

From XionKB
Jump to navigationJump to search

Simba is the codename for an Epiphany-like processor designed for multicore GPGPU.

Technical overview

The Simba processor is designed with 256 compute cores. Each core will have 256 KiB of IVRAM for a total of 64 MiB of fast, core-local memory. There will also be at least 64 MiB of slower, on-board EVRAM which is shareable between cores and the CPU.

IVRAM instead of cache

Each core's IVRAM is only accessible by the core that is paired to it. Dedicated DMA instructions are used to read/write data at volume. Additionally, the IVRAM is conceptually split into four 64 KiB quadrants with 16-bit addressing, which are dynamically purposed and repurposed by the core at will. One quadrant contains code, one contains the stack, and the other two are "general-purpose", with dedicated instructions for switching execution and moving the stack as desired by the running compute kernel.

This architecture was chosen instead of an automatically-managed cache system, as well-written code can take advantage of this architecture to perform better by-the-joule by a factor of two.[1] Such gains are not physically possible with typical industry ISAs that use MMUs, virtual memory, out-of-order execution, and long pipelines that demand sophisticated speculative execution modules.

Simba does not provide MMU functions of any kind, nor does it provide any kind of speculative execution features or out-of-order execution. All of these things were ruled out of consideration due to them being inherently power-hungry. The silicon does, however, look to provide fixed-function execution of common operations, including cryptographic functions.

ISA design

The main design focus of the ISA is extreme code density. A strictly fixed-width 16-bit instruction size was imposed as a hard requirement, because larger instruction sizes would not be the most efficient use of the limited IVRAM space.

This proved challenging when it came to facilitating now-typical 5-bit register file indices (32 registers), thrice over for floating-point and vector registers. To not thrash ISA codespace so badly, register files are conceptually divided into quadrants, and most instructions can operate on source and destination registers within a given quadrant. Dedicated memory instructions have the ability to address all registers freely.

Opcode map

Mnemonic 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Explanation
AND 0 0 0 0 0 0 0 0 G Rdst Rsrc Rdst = Rdst AND Rsrc
XOR 0 0 0 0 0 0 0 1 G Rdst Rsrc Rdst = Rdst XOR Rsrc
NOT 0 0 0 0 0 0 1 0 G Rdst Rsrc Rdst = Rdst NOT Rsrc
ORR 0 0 0 0 0 0 1 1 G Rdst Rsrc Rdst = Rdst OR Rsrc
ANDN 0 0 0 0 0 1 0 0 G Rdst Rsrc Rdst = Rdst AND NOT Rsrc
XNOR 0 0 0 0 0 1 0 1 G Rdst Rsrc Rdst = Rdst XOR NOT Rsrc
PCT 0 0 0 0 0 1 1 0 G Rdst Rsrc Rdst = pop_count( Rsrc )
NOR 0 0 0 0 0 1 1 1 G Rdst Rsrc Rdst = Rdst OR NOT Rsrc
CLZ 0 0 0 0 1 0 0 0 G Rdst Rsrc Rdst = leading_zeroes( Rsrc )
CTZ 0 0 0 0 1 0 0 1 G Rdst Rsrc Rdst = trailing_zeroes( Rsrc )
PAKLO 0 0 0 0 1 0 1 0 G Rdst Rsrc Rdst = ( ( Rsrc AND 0xFFFFFFFF ) LSL 32 ) OR ( Rdst AND 0xFFFFFFFF )
PAKHI 0 0 0 0 1 0 1 1 G Rdst Rsrc Rdst = ( ( ( Rsrc LSR 32 ) AND 0xFFFFFFFF ) LSL 32 ) OR ( ( Rdst LSR 32 ) AND 0xFFFFFFFF )
PAKHLO 0 0 0 0 1 1 0 0 G Rdst Rsrc Rdst = ( ( Rsrc AND 0xFFFF ) LSL 16 ) OR ( Rdst AND 0xFFFF )
PAKHHI 0 0 0 0 1 1 0 1 G Rdst Rsrc Rdst = ( ( ( Rsrc LSR 16 ) AND 0xFFFF ) LSL 16 ) OR ( ( Rdst LSR 16 ) AND 0xFFFF )
PAKBLO 0 0 0 0 1 1 1 0 G Rdst Rsrc Rdst = ( ( Rsrc AND 255 ) LSL 8 ) OR ( Rdst AND 255 )
PAKBHI 0 0 0 0 1 1 1 1 G Rdst Rsrc Rdst = ( ( ( Rsrc LSR 8 ) AND 255 ) LSL 8 ) OR ( ( Rdst LSR 8 ) AND 255 )
SIGNX 0 0 0 1 0 0 0 0 G Rdst Rsrc Rdst = sign_ext_32to64( Rsrc )
SIGNXH 0 0 0 1 0 0 0 1 G Rdst Rsrc Rdst = sign_ext_16to64( Rsrc )
SIGNXB 0 0 0 1 0 0 1 0 G Rdst Rsrc Rdst = sign_ext_8to64( Rsrc )
ASR 0 0 0 1 0 0 1 1 G Rdst Rsrc Rdst = Rdst ASR Rsrc
LSL 0 0 0 1 0 1 0 0 G Rdst Rsrc Rdst = Rdst LSL Rsrc
LSR 0 0 0 1 0 1 0 1 G Rdst Rsrc Rdst = Rdst LSR Rsrc
ROL 0 0 0 1 0 1 1 0 G Rdst Rsrc Rdst = Rdst ROL Rsrc
ROR 0 0 0 1 0 1 1 1 G Rdst Rsrc Rdst = Rdst ROR Rsrc
GRV 0 0 0 1 1 0 0 0 G Rdst Rsrc Rdst = generalised_reverse( Rdst, Rsrc )
SHF 0 0 0 1 1 0 0 1 G Rdst Rsrc Rdst = generalised_shuffle( Rdst, Rsrc )
BEXT 0 0 0 1 1 0 1 0 G Rdst Rsrc Rdst = bit_extract( Rdst, Rsrc )
BDEP 0 0 0 1 1 0 1 1 G Rdst Rsrc Rdst = bit_deposit( Rdst, Rsrc )
ADD 0 0 0 1 1 1 0 0 G Rdst Rsrc Rdst = Rdst + Rsrc
SUB 0 0 0 1 1 1 0 1 G Rdst Rsrc Rdst = RdstRsrc
MUL 0 0 0 1 1 1 1 0 G Rdst Rsrc Rdst = RdstRsrc
DIV 0 0 0 1 1 1 1 1 G Rdst Rsrc Rdst = Rdst ÷ Rsrc
MOV 0 0 1 0 Sdst Gdst Rdst Ssrc Gsrc Rsrc Rdst = Rsrc
LDRD 0 0 1 1 0 0 0 0 G Rdst Rsrc Rdst = [ Rsrc ]
LDR 0 0 1 1 0 0 0 1 G Rdst Rsrc Rdst = [ Rsrc ] AND 0xFFFFFFFF
LDRH 0 0 1 1 0 0 1 0 G Rdst Rsrc Rdst = [ Rsrc ] AND 0xFFFF
LDRB 0 0 1 1 0 0 1 1 G Rdst Rsrc Rdst = [ Rsrc ] AND 255
LDRQ 0 0 1 1 0 1 0 0 G Fdst Rsrc Fdst = [ Rsrc ] (quad-precision)
LDRD 0 0 1 1 0 1 0 1 G Fdst Rsrc Fdst = [ Rsrc ] (double-precision)
LDR 0 0 1 1 0 1 1 0 G Fdst Rsrc Fdst = [ Rsrc ] (single-precision)
LDRH 0 0 1 1 0 1 1 1 G Fdst Rsrc Fdst = [ Rsrc ] (half-precision)
STRD 0 0 1 1 1 0 0 0 G Rdst Rsrc [ Rdst ] = Rsrc
STR 0 0 1 1 1 0 0 1 G Rdst Rsrc [ Rdst ] = Rsrc and 0xFFFFFFFF
STRH 0 0 1 1 1 0 1 0 G Rdst Rsrc [ Rdst ] = Rsrc AND 0xFFFF
STRB 0 0 1 1 1 0 1 1 G Rdst Rsrc [ Rdst ] = Rsrc AND 255
STRQ 0 0 1 1 1 1 0 0 G Rdst Fsrc [ Rdst ] = Fsrc (quad-precision)
STRD 0 0 1 1 1 1 0 1 G Rdst Fsrc [ Rdst ] = Fsrc (double-precision)
STR 0 0 1 1 1 1 1 0 G Rdst Fsrc [ Rdst ] = Fsrc (single-precision)
STRH 0 0 1 1 1 1 1 1 G Rdst Fsrc [ Rdst ] = Fsrc (half-precision)
HALT 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Halt core execution
B{cond} 0 1 0 0 cond imm9 * Branch on {cond} to signed PC-relative offset of ( imm9 + 1 ) • 4 (imm9 cannot be zero)
LDA 0 1 0 1 imm7 G Rdst Load 7-bit immediate number of imm7 into Rdst
WFLO 0 1 1 0 0 flag11 Wait until synchronization flag flag11 changes to be LOW
WFHI 0 1 1 0 1 flag11 Wait until synchronization flag flag11 changes to be HIGH
CF 0 1 1 1 0 flag11 Clear synchronization flag flag11
SF 0 1 1 1 1 flag11 Set synchronization flag flag11
ADD 1 0 0 0 0 P G Fdst Fsrc Fdst = Fdst + Fsrc
SUB 1 0 0 0 1 P G Fdst Fsrc Fdst = FdstRsrc
MUL 1 0 0 1 0 P G Fdst Fsrc Fdst = FdstFsrc
DIV 1 0 0 1 1 P G Fdst Fsrc Fdst = Fdst ÷ Fsrc
SQRT 1 0 1 0 0 P G Fdst Fsrc Fdst = square_root( Fsrc )
CBRT 1 0 1 0 1 P G Fdst Fsrc Fdst = cube_root( Fsrc )
CMP 1 0 1 1 0 P G Fdst Fsrc F1 compare to F2
MOD 1 0 1 1 1 P G Fdst Fsrc Fdst = Fdst % Fsrc
MIN 1 1 0 0 0 P G Fdst Fsrc Fdst = min( Fdst, Fsrc )
MAX 1 1 0 0 1 P G Fdst Fsrc Fdst = max( Fdst, Fsrc )
FTOI 1 1 0 1 0 P G Fdst Fsrc Rdst = Fsrc (converted)
ITOF 1 1 0 1 1 P G Fdst Fsrc Fdst = Rsrc (converted)
NEG 1 1 1 0 0 P G Fdst Fsrc Fdst = - Fsrc
POW 1 1 1 0 1 P G Fdst Fsrc Fdst = Fdst Fsrc
CMPC 1 1 1 1 0 P G Fdst Fsrc Rdst = compare_class( Fsrc )
JUMP 1 1 1 1 1 0 0 0 0 Q 0 0 0 0 0 Jump code execution to IVRAM quadrant Q
STDMA 1 1 1 1 1 0 0 0 1 Q G Rdst* Start DMA transfer from IVRAM quadrant Q into [ Rdst • 0x1000 ] (Rdst cannot be R0)
STACK 1 1 1 1 1 0 0 1 0 Q 0 0 0 0 0 Set program stack location to IVRAM quadrant Q
LDDMA 1 1 1 1 1 0 0 1 1 Q G Rsrc* Start DMA transfer from [ Rsrc • 0x1000 ] into IVRAM quadrant Q (Rsrc cannot be R0)
POP 1 1 1 1 1 1 0 0 G Rstart Roffs Pop a range of one or more registers, from Rstart to Roffs (relative, exclusive).
PUSH 1 1 1 1 1 1 0 1 G Rstart Roffs Push a range of one or more registers, from Rstart to Roffs (relative, exclusive).
MOD 1 1 1 1 1 1 1 0 G Rdst Rsrc Rdst = Rdst % Rsrc
CMP 1 1 1 1 1 1 1 1 G Rdst Rsrc Compare Rdst to Rsrc

ISA variables

Variable Explanation
R Integer register
F Floating-point register
G Register group
S Register state (0=integer or floating-point)
P Precision (0=single, 1=double, 2=half, 3=quad)
Q Quadrant (0-3)

Conditional values

# Explanation
0 Always
1 Less than or equal to (signed)
2 Greater than (signed)
3 Less than or equal to (unsigned)
4 Greater than (unsigned)
5 Equal to
6 Not equal to
7 Overflow

Precision values

# Explanation
0 Quad precision (128-bit)
1 Double precision, bits 0-63
2 Double precision, bits 64-127
3 Single precision, bits 0-31
4 Single precision, bits 32-63
5 Single precision, bits 64-95
6 Single precision, bits 96-127
7 Half precision, bits 0-15
8 Half precision, bits 16-31
9 Half precision, bits 32-47
10 Half precision, bits 48-63
11 Half precision, bits 64-79
12 Half precision, bits 80-95
13 Half precision, bits 96-111
14 Half precision, buts 112-127
15 Full 128-bit, as an integer

References

  1. Epiphany-V performance benchmarks. Adapteva’s 1,024-core Epiphany V mega-chip packs a serious wallop. PCWorld. Retrieved 30 November 2021.