Simba

Simba is the codename for an Epiphany-like processor designed for multicore GPGPU.

Technical overview

The Simba processor is designed with 256 compute cores. Each core will have 256 KiB of IVRAM for a total of 64 MiB of fast, core-local memory. There will also be at least 64 MiB of slower, on-board EVRAM which is shareable between cores and the CPU.

IVRAM instead of cache

Each core's IVRAM is only accessible by the core that is paired to it. Dedicated DMA instructions are used to read/write data at volume. Additionally, the IVRAM is conceptually split into four 64 KiB quadrants with 16-bit addressing, which are dynamically purposed and repurposed by the core at will. One quadrant contains code, one contains the stack, and the other two are "general-purpose", with dedicated instructions for switching execution and moving the stack as desired by the running compute kernel.

This architecture was chosen instead of an automatically-managed cache system, as well-written code can take advantage of this architecture to perform better by-the-joule by a factor of two.^[1] Such gains are not physically possible with typical industry ISAs that use MMUs, virtual memory, out-of-order execution, and long pipelines that demand sophisticated speculative execution modules.

Simba does not provide MMU functions of any kind, nor does it provide any kind of speculative execution features or out-of-order execution. All of these things were ruled out of consideration due to them being inherently power-hungry. The silicon does, however, look to provide fixed-function execution of common operations, including cryptographic functions.

ISA design

The main design focus of the ISA is extreme code density. A strictly fixed-width 16-bit instruction size was imposed as a hard requirement, because larger instruction sizes would not be the most efficient use of the limited IVRAM space.

This proved challenging when it came to facilitating now-typical 5-bit register file indices (32 registers), thrice over for floating-point and vector registers. To not thrash ISA codespace so badly, register files are conceptually divided into quadrants, and most instructions can operate on source and destination registers within a given quadrant. Dedicated memory instructions have the ability to address all registers freely.

Opcode map

Mnemonic	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	Explanation
`AND`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	G		R_dst			R_src			R_dst = R_dst AND R_src
`XOR`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	`1`	G		R_dst			R_src			R_dst = R_dst XOR R_src
`NOT`	`0`	`0`	`0`	`0`	`0`	`0`	`1`	`0`	G		R_dst			R_src			R_dst = R_dst NOT R_src
`ORR`	`0`	`0`	`0`	`0`	`0`	`0`	`1`	`1`	G		R_dst			R_src			R_dst = R_dst OR R_src
`ANDN`	`0`	`0`	`0`	`0`	`0`	`1`	`0`	`0`	G		R_dst			R_src			R_dst = R_dst AND NOT R_src
`XNOR`	`0`	`0`	`0`	`0`	`0`	`1`	`0`	`1`	G		R_dst			R_src			R_dst = R_dst XOR NOT R_src
`PCT`	`0`	`0`	`0`	`0`	`0`	`1`	`1`	`0`	G		R_dst			R_src			R_dst = pop_count( R_src )
`NOR`	`0`	`0`	`0`	`0`	`0`	`1`	`1`	`1`	G		R_dst			R_src			R_dst = R_dst OR NOT R_src
`CLZ`	`0`	`0`	`0`	`0`	`1`	`0`	`0`	`0`	G		R_dst			R_src			R_dst = leading_zeroes( R_src )
`CTZ`	`0`	`0`	`0`	`0`	`1`	`0`	`0`	`1`	G		R_dst			R_src			R_dst = trailing_zeroes( R_src )
`PAKLO`	`0`	`0`	`0`	`0`	`1`	`0`	`1`	`0`	G		R_dst			R_src			R_dst = ( ( R_src AND 0xFFFFFFFF ) LSL 32 ) OR ( R_dst AND 0xFFFFFFFF )
`PAKHI`	`0`	`0`	`0`	`0`	`1`	`0`	`1`	`1`	G		R_dst			R_src			R_dst = ( ( ( R_src LSR 32 ) AND 0xFFFFFFFF ) LSL 32 ) OR ( ( R_dst LSR 32 ) AND 0xFFFFFFFF )
`PAKHLO`	`0`	`0`	`0`	`0`	`1`	`1`	`0`	`0`	G		R_dst			R_src			R_dst = ( ( R_src AND 0xFFFF ) LSL 16 ) OR ( R_dst AND 0xFFFF )
`PAKHHI`	`0`	`0`	`0`	`0`	`1`	`1`	`0`	`1`	G		R_dst			R_src			R_dst = ( ( ( R_src LSR 16 ) AND 0xFFFF ) LSL 16 ) OR ( ( R_dst LSR 16 ) AND 0xFFFF )
`PAKBLO`	`0`	`0`	`0`	`0`	`1`	`1`	`1`	`0`	G		R_dst			R_src			R_dst = ( ( R_src AND 255 ) LSL 8 ) OR ( R_dst AND 255 )
`PAKBHI`	`0`	`0`	`0`	`0`	`1`	`1`	`1`	`1`	G		R_dst			R_src			R_dst = ( ( ( R_src LSR 8 ) AND 255 ) LSL 8 ) OR ( ( R_dst LSR 8 ) AND 255 )
`SIGNX`	`0`	`0`	`0`	`1`	`0`	`0`	`0`	`0`	G		R_dst			R_src			R_dst = sign_ext_32to64( R_src )
`SIGNXH`	`0`	`0`	`0`	`1`	`0`	`0`	`0`	`1`	G		R_dst			R_src			R_dst = sign_ext_16to64( R_src )
`SIGNXB`	`0`	`0`	`0`	`1`	`0`	`0`	`1`	`0`	G		R_dst			R_src			R_dst = sign_ext_8to64( R_src )
`ASR`	`0`	`0`	`0`	`1`	`0`	`0`	`1`	`1`	G		R_dst			R_src			R_dst = R_dst ASR R_src
`LSL`	`0`	`0`	`0`	`1`	`0`	`1`	`0`	`0`	G		R_dst			R_src			R_dst = R_dst LSL R_src
`LSR`	`0`	`0`	`0`	`1`	`0`	`1`	`0`	`1`	G		R_dst			R_src			R_dst = R_dst LSR R_src
`ROL`	`0`	`0`	`0`	`1`	`0`	`1`	`1`	`0`	G		R_dst			R_src			R_dst = R_dst ROL R_src
`ROR`	`0`	`0`	`0`	`1`	`0`	`1`	`1`	`1`	G		R_dst			R_src			R_dst = R_dst ROR R_src
`GRV`	`0`	`0`	`0`	`1`	`1`	`0`	`0`	`0`	G		R_dst			R_src			R_dst = generalised_reverse( R_dst, R_src )
`SHF`	`0`	`0`	`0`	`1`	`1`	`0`	`0`	`1`	G		R_dst			R_src			R_dst = generalised_shuffle( R_dst, R_src )
`BEXT`	`0`	`0`	`0`	`1`	`1`	`0`	`1`	`0`	G		R_dst			R_src			R_dst = bit_extract( R_dst, R_src )
`BDEP`	`0`	`0`	`0`	`1`	`1`	`0`	`1`	`1`	G		R_dst			R_src			R_dst = bit_deposit( R_dst, R_src )
`ADD`	`0`	`0`	`0`	`1`	`1`	`1`	`0`	`0`	G		R_dst			R_src			R_dst = R_dst + R_src
`SUB`	`0`	`0`	`0`	`1`	`1`	`1`	`0`	`1`	G		R_dst			R_src			R_dst = R_dst – R_src
`MUL`	`0`	`0`	`0`	`1`	`1`	`1`	`1`	`0`	G		R_dst			R_src			R_dst = R_dst • R_src
`DIV`	`0`	`0`	`0`	`1`	`1`	`1`	`1`	`1`	G		R_dst			R_src			R_dst = R_dst ÷ R_src
`MOV`	`0`	`0`	`1`	`0`	S_dst	G_dst		R_dst			S_src	G_src		R_src			R_dst = R_src
`LDRD`	`0`	`0`	`1`	`1`	`0`	`0`	`0`	`0`	G		R_dst			R_src			R_dst = [ R_src ]
`LDR`	`0`	`0`	`1`	`1`	`0`	`0`	`0`	`1`	G		R_dst			R_src			R_dst = [ R_src ] AND 0xFFFFFFFF
`LDRH`	`0`	`0`	`1`	`1`	`0`	`0`	`1`	`0`	G		R_dst			R_src			R_dst = [ R_src ] AND 0xFFFF
`LDRB`	`0`	`0`	`1`	`1`	`0`	`0`	`1`	`1`	G		R_dst			R_src			R_dst = [ R_src ] AND 255
`LDRQ`	`0`	`0`	`1`	`1`	`0`	`1`	`0`	`0`	G		F_dst			R_src			F_dst = [ R_src ] (quad-precision)
`LDRD`	`0`	`0`	`1`	`1`	`0`	`1`	`0`	`1`	G		F_dst			R_src			F_dst = [ R_src ] (double-precision)
`LDR`	`0`	`0`	`1`	`1`	`0`	`1`	`1`	`0`	G		F_dst			R_src			F_dst = [ R_src ] (single-precision)
`LDRH`	`0`	`0`	`1`	`1`	`0`	`1`	`1`	`1`	G		F_dst			R_src			F_dst = [ R_src ] (half-precision)
`STRD`	`0`	`0`	`1`	`1`	`1`	`0`	`0`	`0`	G		R_dst			R_src			[ R_dst ] = R_src
`STR`	`0`	`0`	`1`	`1`	`1`	`0`	`0`	`1`	G		R_dst			R_src			[ R_dst ] = R_src and 0xFFFFFFFF
`STRH`	`0`	`0`	`1`	`1`	`1`	`0`	`1`	`0`	G		R_dst			R_src			[ R_dst ] = R_src AND 0xFFFF
`STRB`	`0`	`0`	`1`	`1`	`1`	`0`	`1`	`1`	G		R_dst			R_src			[ R_dst ] = R_src AND 255
`STRQ`	`0`	`0`	`1`	`1`	`1`	`1`	`0`	`0`	G		R_dst			F_src			[ R_dst ] = F_src (quad-precision)
`STRD`	`0`	`0`	`1`	`1`	`1`	`1`	`0`	`1`	G		R_dst			F_src			[ R_dst ] = F_src (double-precision)
`STR`	`0`	`0`	`1`	`1`	`1`	`1`	`1`	`0`	G		R_dst			F_src			[ R_dst ] = F_src (single-precision)
`STRH`	`0`	`0`	`1`	`1`	`1`	`1`	`1`	`1`	G		R_dst			F_src			[ R_dst ] = F_src (half-precision)
`HALT`	`0`	`1`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	Halt core execution
`B{cond}`	`0`	`1`	`0`	`0`	cond			imm₉ *									Branch on `{cond}` to signed PC-relative offset of ( imm₉ + 1 ) • 4 (imm₉ cannot be zero)
`LDA`	`0`	`1`	`0`	`1`	imm₇							G		R_dst			Load 7-bit immediate number of imm₇ into R_dst
`WFLO`	`0`	`1`	`1`	`0`	`0`	flag₁₁											Wait until synchronization flag flag₁₁ changes to be LOW
`WFHI`	`0`	`1`	`1`	`0`	`1`	flag₁₁											Wait until synchronization flag flag₁₁ changes to be HIGH
`CF`	`0`	`1`	`1`	`1`	`0`	flag₁₁											Clear synchronization flag flag₁₁
`SF`	`0`	`1`	`1`	`1`	`1`	flag₁₁											Set synchronization flag flag₁₁
`ADD`	`1`	`0`	`0`	`0`	`0`	P				G	F_dst			F_src			F_dst = F_dst + F_src
`SUB`	`1`	`0`	`0`	`0`	`1`	P				G	F_dst			F_src			F_dst = F_dst – R_src
`MUL`	`1`	`0`	`0`	`1`	`0`	P				G	F_dst			F_src			F_dst = F_dst • F_src
`DIV`	`1`	`0`	`0`	`1`	`1`	P				G	F_dst			F_src			F_dst = F_dst ÷ F_src
`SQRT`	`1`	`0`	`1`	`0`	`0`	P				G	F_dst			F_src			F_dst = square_root( F_src )
`CBRT`	`1`	`0`	`1`	`0`	`1`	P				G	F_dst			F_src			F_dst = cube_root( F_src )
`CMP`	`1`	`0`	`1`	`1`	`0`	P				G	F_dst			F_src			F₁ compare to F₂
`MOD`	`1`	`0`	`1`	`1`	`1`	P				G	F_dst			F_src			F_dst = F_dst % F_src
`MIN`	`1`	`1`	`0`	`0`	`0`	P				G	F_dst			F_src			F_dst = min( F_dst, F_src )
`MAX`	`1`	`1`	`0`	`0`	`1`	P				G	F_dst			F_src			F_dst = max( F_dst, F_src )
`FTOI`	`1`	`1`	`0`	`1`	`0`	P				G	F_dst			F_src			R_dst = F_src (converted)
`ITOF`	`1`	`1`	`0`	`1`	`1`	P				G	F_dst			F_src			F_dst = R_src (converted)
`NEG`	`1`	`1`	`1`	`0`	`0`	P				G	F_dst			F_src			F_dst = - F_src
`POW`	`1`	`1`	`1`	`0`	`1`	P				G	F_dst			F_src			F_dst = F_dst ^F_src
`CMPC`	`1`	`1`	`1`	`1`	`0`	P				G	F_dst			F_src			R_dst = compare_class( F_src )
`JUMP`	`1`	`1`	`1`	`1`	`1`	`0`	`0`	`0`	`0`	Q		`0`	`0`	`0`	`0`	`0`	Jump code execution to IVRAM quadrant Q
`STDMA`	`1`	`1`	`1`	`1`	`1`	`0`	`0`	`0`	`1`	Q		G		R_dst*			Start DMA transfer from IVRAM quadrant Q into [ R_dst • 0x1000 ] (R_dst cannot be R0)
`STACK`	`1`	`1`	`1`	`1`	`1`	`0`	`0`	`1`	`0`	Q		`0`	`0`	`0`	`0`	`0`	Set program stack location to IVRAM quadrant Q
`LDDMA`	`1`	`1`	`1`	`1`	`1`	`0`	`0`	`1`	`1`	Q		G		R_src*			Start DMA transfer from [ R_src • 0x1000 ] into IVRAM quadrant Q (R_src cannot be R0)
`POP`	`1`	`1`	`1`	`1`	`1`	`1`	`0`	`0`	G		R_start			R_offs			Pop a range of one or more registers, from R_start to R_offs (relative, exclusive).
`PUSH`	`1`	`1`	`1`	`1`	`1`	`1`	`0`	`1`	G		R_start			R_offs			Push a range of one or more registers, from R_start to R_offs (relative, exclusive).
`MOD`	`1`	`1`	`1`	`1`	`1`	`1`	`1`	`0`	G		R_dst			R_src			R_dst = R_dst % R_src
`CMP`	`1`	`1`	`1`	`1`	`1`	`1`	`1`	`1`	G		R_dst			R_src			Compare R_dst to R_src

ISA variables

Variable	Explanation
R	Integer register
F	Floating-point register
G	Register group
S	Register state (`0`=integer or floating-point)
P	Precision (`0`=single, `1`=double, `2`=half, `3`=quad)
Q	Quadrant (0-3)

Conditional values

#	Explanation
`0`	Always
`1`	Less than or equal to (signed)
`2`	Greater than (signed)
`3`	Less than or equal to (unsigned)
`4`	Greater than (unsigned)
`5`	Equal to
`6`	Not equal to
`7`	Overflow

Precision values

#	Explanation
`0`	Quad precision (128-bit)
`1`	Double precision, bits 0-63
`2`	Double precision, bits 64-127
`3`	Single precision, bits 0-31
`4`	Single precision, bits 32-63
`5`	Single precision, bits 64-95
`6`	Single precision, bits 96-127
`7`	Half precision, bits 0-15
`8`	Half precision, bits 16-31
`9`	Half precision, bits 32-47
`10`	Half precision, bits 48-63
`11`	Half precision, bits 64-79
`12`	Half precision, bits 80-95
`13`	Half precision, bits 96-111
`14`	Half precision, buts 112-127
`15`	Full 128-bit, as an integer

References

↑ Epiphany-V performance benchmarks. Adapteva’s 1,024-core Epiphany V mega-chip packs a serious wallop. PCWorld. Retrieved 30 November 2021.

[1] Epiphany-V performance benchmarks. Adapteva’s 1,024-core Epiphany V mega-chip packs a serious wallop. PCWorld. Retrieved 30 November 2021.

[1]

Simba

Contents

Technical overview

IVRAM instead of cache

ISA design

Opcode map

ISA variables

Conditional values

Precision values

References

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools