The Doomchip ‘on-ice’

Designing hardware for 1992-1993 retro classics

Watch the talk recording here

 

Sylvain Lefebvre

A strange message

Mission statement

  • Design a portable console

  • Specialized graphics hardware

  • Early 1990s retro classics

  • On a Lattice ice40 UP5K FPGA

  • Doom-Comanche crossover

Doom (1993)

by Id software

Comanche (1992)

by NovaLogic

 

Doom source ports

source ports

Micro-controller

Doom on lamp MCu
by Nicola Wrachien

Doom nRF5340
by Audun Wilhelmsen

More than fast enough (80+ MHz)
Memory usage is a primary concern
Uses external SPIflash memory (RO)

Not a source port

  • We’re tasked with creating a specialized GPU

    • Total re-creation
  • Adding a terrain

    • Means fully revising the rendering approach
  • Btw, a notable Doom re-creation

    • Frederic Souchu’s PooM on the Pico8

Design constraints

Lattice ice40 UP5K

Great open-source FPGA toolchain support

Our target

 

Board

  • 16MB SPIflash

FPGA

  • 5280 logic cells
  • 128KB SPRAM
    • 🡅

IceBreaker by @1bitsquared

Logic cells

  • Each cell:
    • a lookup table (LUT)
    • a flip-flop (FF)
  • LUT: Given four entries, sets the output
    • Configured at startup to ‘emulate’ a logic gate
  • FF: Holds output until clock ticks

5K grid

Hardware description

  • Verilog / VHDL

    • HDL file 🡆 Yosys 🡆 NextPNR 🡆 OpenFPGALoader 🡆 FPGA
  algorithm main(output uint8 leds) {
    uint28 counter(0);
    always {
      leds    = counter[20,8];
      counter = counter + 1;
    }
  }

Is 5K a lot?

  • Let’s compare to MiSTer

    • FPGA based hardware recreation
    • Amiga, NES, SNES, Megadrive, Atari, …
  • De10-nano board, CycloneV FPGA

    • 110K logic cells
    •    🡅
    • Also much faster (MHz ++)

Impossible?

Doom on IceBreaker

by Sylvain ‘@tnt’ Munaut

Doom on IceBreaker

by Sylvain ‘@tnt’ Munaut

  • Source port

  • Architectured around RISC-V

    • 25MHz CPU - VexRiscV
  • Pseudo SRAM mod

    • Extra RW memory (8MB)
    • QPI 100 MHz
  • FPGA 128KB SPRAM:

    • 64KB cache / 64 KB framebuffer

Doom on IceBreaker

by Sylvain ‘@tnt’ Munaut

Doom on IceBreaker

  • Sets quite a standard!

  • But there is no GPU 😢

    • 🡆 Could we do faster at full 320x200?

    • 🡆 Can we provide a drawing ‘API’?

    • 🡆 Can we squeeze a voxel terrain in?

    • 🡆 Without the PSRAM mod?

Let's find out!

Oh, btw

Memory Layout

Memory Layout

  • Constraints recap:

    • 128KB of fast internal memory (SPRAM)
    • 16MB of slower read only external memory (SPIflash)
  • What we would need:

    • 320x200 framebuffer (256 palette) 🡆 64KB
    • Level and texture data (E1M1 🡆 3.1MB, 35KB for level)
    • Some KB of RAM for the CPU part, data-structures, etc.

Memory layout

  • Conclusions:
    • – Textures can only go in SPIflash, good news, RO

    • – Level is not strictly RO, could go to RAM

    • 🡆 But this would leave us with < 32KB for runtime

    •         128KB - 35KB(level) - 64KB(framebuffer)

Get rid of the framebuffer?

Framebuffer

  • A portable console

    • Uses a small LCD/OLED screen (SPI protocol)
    • These have an internal framebuffer!
  • Still

    • Random access is slow
    • 🡆 To achieve this, we have to stream pixels

Streaming pixels

  • Means rendering top/bottom or left/right

    • Can hold a single row/column in memory
  • We’ll use left/right, rendering columns

    • There are several reasons, see later

Rendering like it’s 1990

Drawing Textured triangles

Incorrect interpolation

Perspective correct texturing

Perspective correct texturing

Per-pixel division

  • Cost of a division

    • 🡆 A standard design is 1 cycle per bit

    • 🡆 More bits per cycles costs logic and/or MHz

    • 🡆 Fine in a few vertices

But one div *per-pixel* 😱

Perspective correct?

  • You could choose to not be perspective correct

  • But … nah (PS1 anyone?)

 

Are we Doomed?

Special cases to the rescue

  • Vertical / horizontal surfaces!

  • Z-constant along screen columns / rows

    • 🡆 Single division for all pixels
Sounds promising! Let's take a look

Walls

A good reason to streams columns

Flats (floors and ceilings)

// R_DrawSpan
// With DOOM style restrictions on view orientation,
//  the floors and ceilings consist of horizontal slices
//  or spans with constant z depth.
// However, rotation around the world z axis is possible,
// thus [..] has to traverse the texture at an angle

Flats column by column?

Flats column by column?

Other advantage

of columns

  • Depth and visibility

Other advantage

of columns

  • Depth and visibility

  • Doom levels are organized in a BSP tree

    • 🡆 Order the walls from viewpoint (also, localization and collisions)
    • 🡆 Read all from the Black Books (M. Abrash, F. Sanglard)

Doom

  • Front-to-back ordering from BSP-tree
    • But we don’t want to draw everything each frame
    •  

Visibility

Doom

  • Front-to-back ordering from BSP-tree
    • But we don’t want to draw everything each frame
    • 🡆 Allows to render front to back and stop early

2D as seen from above

2D as seen from above

The Terrain

The terrain

     

The terrain

The terrain

Rendering with columns

Rendering with columns

  1. Go through the scene BSP

    • 🡆 Sorted list of all potentially visible walls
  2. Project candidate walls on screen

    • 🡆 Gets a first/last column
  3. For each screen column

    • 🡆 Traverse walls
    • 🡆 Draw a segment if covered
    • 🡆 Lower / Upper / Middle
    • 🡆 Stop if middle is drawn

Rendering with columns

for (int c = 0 ; c != doomchip_width ; ++c) {
  // ..
  for ( ; v < v_end ; ++v ) {
    if (c >= vis[v].i0 && c <= vis[v].i1) {
      // lower wall
      if (bspSegs[seg].lwr) {
        // ..
      }
      // upper wall
      if (bspSegs[seg].upr) {
        // ..
      }
      // middle wall
      if (bspSegs[seg].mid) {
        // ..
        // close column?
        if ((bspSegs[seg].flags&1) == 0) {
          //              ^^^^^ transparent?
          top = btm; // opaque, close column
          break;
        }
      }
    }
} }

Graphics hardware

Software / Hardware

Our renderer design tradeoffs

Software / Hardware

Software / Hardware

Software / Hardware

Software / Hardware

  • CPU side:
    • 🡆 View
    • 🡆 Perspective
  • Hardware side:
    • 🡆 Column drawing (walls, flats, terrain)
    • 🡆 Texturing

Column texturing hardware

Draw queue (HW)

Draw queue (CPU side)

// draw column command
volatile unsigned int*  const COLDRAW0     = (unsigned int* )0x40014;
volatile unsigned int*  const COLDRAW1     = (unsigned int* )0x40010;
static inline void col_send(unsigned int t0,unsigned int t1) {
  *COLDRAW0 = t0;  *COLDRAW1 = t1;
}
// ceiling with flat texturing
col_send(COLDRAW_FLAT(-sec_c_h,cx),
         COLDRAW_COL(bspSectors[sec].c_T,c_h,top, seclight) | FLAT);
// upper wall
col_send(COLDRAW_WALL(y,tex_v,tc_u),
         COLDRAW_COL(bspSegs[seg].upr, c_o,top, seclight) | WALL);
// terrain
col_send(COLDRAW_TERRAIN(start_dist,end_dist,pick),
         COLDRAW_COL    (terrain_texture_id, btm, top, 15) | TERRAIN);
// end of column (EOC)
col_send(0, COLDRAW_EOC);

Draw queue (HW side)

// memory mapping
if ((prev_mem_wenable != 0) & prev_mem_addr[16,1]) {
  switch (prev_mem_addr[2,4]) {
    case 4b0001: {
      if (prev_mem_addr[0,1]) {
        // received COLDRAW0
        coldraw.in_tex0 = prev_mem_wdata[0,32];
      } else {
        // received COLDRAW1
        uint8  start <: prev_mem_wdata[10,8];
        uint8  end   <: prev_mem_wdata[18,8];
        uint1  empty <: start == end;
        uint1  eoc   <: prev_mem_wdata[9,1];
        // send segment to drawer
        coldraw.in_tex1  = prev_mem_wdata[0,32];
        coldraw.in_ready = ~empty | eoc; // not null or eoc tag
      }
    }
    // ..

Column texturing hardware (HW)

Column texturing hardware (HW)

Column drawer (HW)

algorithm column_drawer(
  input  uint1       in_ready, // pulse
  input  uint32      in_tex1,
  input  uint32      in_tex0,
  output uint1       scr_send(0),
  output uint17      scr_data,
  input  uint1       scr_full,
  output uint1       fifo_empty,
  output uint1       fifo_full,
  output uint8       pickedh,
  spiflash_user      sf,
  input view         vw,
) <autorun> {
$$log_n_fifo = 8
$$n_fifo     = 1<<log_n_fifo

  simple_dualport_bram uint64 fifo   [$n_fifo$]                      =uninitialized;
  simple_dualport_bram uint12 colbufs[$1 << (doomchip_height_p2+1)$] =uninitialized;
  // ...

Column drawer (HW)

  segment_drawer drawer<reginputs>(
    colbufs  <:> colbufs,
    sf       <:> sf,
    vw       <:> vw,
    pickedh   :> pickedh,
  );

  column_sender sender<reginputs>(
    colbufs <:> colbufs,
    scr_send :> scr_send,
    scr_data :> scr_data
  );

Column drawer (HW)

always {
  // ..
  if (in_ready) {
    // store draw command in FIFO
    fifo.wenable1 = 1;
    fifo.wdata1   = {in_tex0,in_tex1};
    fifo.addr1    = fifo.addr1 + 1;
    // ..
  } else {
    if ( ~is_empty & ~drawer.busy /*..*/ ) { // process next
      uint1 draw_seg <:: ~eoc;
      uint1 send_col <::  eoc & ~sender.busy;
      // ..
      // draw the next segment?
      drawer.in_start = draw_seg;
      // send the column?
      sender.in_start = send_col;
      draw_buffer     = send_col ^ draw_buffer;
      // ..
      fifo.addr0 = (draw_seg | send_col) ? fifo.addr0 + 1 : fifo.addr0;
    }
  }
}

Graphics hardware

Segment drawer

  • Responsible for drawing

    • – Walls
    • – Flats
    • – Terrain
  • All of this is interleaved in the same logic

  • Pipeline running in parallel

    • – Texture sampler (color from uv)
    • – Per-pixel computations (next uv)

Segment drawer

algorithm segment_drawer(
  input uint1                in_start(0), // pulse
  input uint32               in_tex1,
  input uint32               in_tex0,
  input uint1                buffer, // which buffer?
  simple_dualport_bram_port1 colbufs,
  output uint1               busy(0),
  output uint8               pickedh,
  spiflash_user              sf,
  input view                 vw,
)  {
  // ..

Segment drawer

sampler2D       sampler_io;
texture_sampler sampler(sf <:> sf, smplr <:> sampler_io);

// BRAM for single column depth buffer
simple_dualport_bram uint16 depths[$doomchip_height$] = uninitialized;

// BRAM for 1/y table (flats, terrain)
bram uint16 inv_y[2048] = {
  65535,
$$for hscr=1,2047 do
  $math.round(65535/hscr)$,
$$end
};

uint8  tex_id    <: in_tex1[0,8];
uint8  col_start <: in_tex1[10,8];
uint8  col_end   <: in_tex1[18,8] > 8d$doomchip_height-1$
                    ? 8d$doomchip_height-1$ : in_tex1[18,8];

Segment drawer

// multiply and add
int24 result     <:: (a * b) + c;

// goes through transform computations
// for both flats and terrain columns
// (sampler works in parallel)
always {
  switch ({terrain,state})
  {
    case 1: { // ---- computes v (flats)
      a     = __signed(inv_y.rdata);   //_  1/y_screen
      b     = __signed(in_tex0[0,14]); //_ *h
      c     = {24{1b0}};
    }
    case 2: { // ---- computes u (flats)
      mul_d = result >>> 6;
      a     = __signed(mul_d);          //_  h/y_screen
      b     = __signed(in_tex0[16,16]); //_ *x_screen
      c     = {24{1b0}};
    }
    // ..
  }
  state = state[3,1] ? state : (state+1);

Segment drawer

if (in_start) {
  end                = terrain ? col_start : col_end;
  current            = col_start;
  drawing            = 1;
  // bind texture
  sampler_io.do_bind = (tex_id != sampler_io.tex_id);
  sampler_io.tex_id  = tex_id;
  // init tc_u and tc_v
  tc_u               = __signed( in_tex0[24,8] );
  tc_v               = terrain ? __signed({in_tex0[16,11], 8b0})
                                : __signed({in_tex0[16, 8],11b0});
  // ..
} else {
  if (smplr_delay[$delay_bit$,1]) {
    // a texture sample is available
    drawing             = still_drawing;
    sampler_io.do_fetch = still_drawing;
    state               = 0;
    // ..
  } else {
    smplr_delay = (drawing & sampler_io.ready)
                ? {smplr_delay[0,$delay_bit$],smplr_delay[$delay_bit$,1]}
                : 1;
} }

SPI flash

  • QPI mode (4 wires used as IO)
  • Controller setup:
    • Each access sends address
    • Returns one byte
  • Running 50 MHz, could go up to 100 MHz!

CPU

CPU

The IceV-dual
 
  • RISCV RV32I
  • dual core design
  • compact (~ 1K LUTs)
  • Modified:
    • 1 cycle shifts
    • 1 cycle MUL
    • 34 cycles DIV
The IceV-dual
 
  • 25 MHz
  • 4 cycles per instruction
  • More like 6.25 MHz ...
  • (but two cores)

Recap

Hardware specs

  • Dual-core 6.25 MHz RISCV RV32IM-ish CPU

  • 320x240 SPI screen (LCD)

  • 16MB SPI flash

  • 128KB fast RAM (FPGA SPRAM)

  • Column drawer GPU with walls, flats, terrains

How much code is that?

   

  • Hardware: ~ 1700 lines of Silice

  • Firmware: ~ 1400 lines of C

 

(comments and all)

But does it work?

Synthesis, resources, MHz

MHz: 23-24/50-60 for main design, runs at 30/60 MHz

Fun fact:

How can we know the terrain height?

  • We can’t from the CPU

  • So we ask the hardware

CPU side

int pick = (col == 160 && start_dist == 0) ? PICK : 0;
  col_send(
    COLDRAW_TERRAIN(start_dist,end_dist,pick),
    COLDRAW_COL    (terrain_texture_id, btm, top, 15) | TERRAIN
  );

Hardware side

pickedh    = pickh & ~pickh_done ? sampler_io.texel : pickedh;
pickh_done = 1;

Gotchas

  • SPI in quad mode, can no longer program!

Gotchas

  • DSP synthesis had a small bug
    • Took a deep dive into Yosys to fix!

Gotchas

  • Fixed point is not specially easy

How to survive?

Use simulation

Icarus Verilog and Verilator

What’s to improve

 

  • SPI flash could run at 100 MHz (texturing x2!)

  • Sprites …

  • Fixed point is not everywhere robust

  • API, documentation

So stay tuned! Follow @sylefeb

Thank you

@sylefeb