64Doom: Classic Doom on N64

As for changes to 64Doom in this release:

 

* very fast assembly implements of memcpy and memset lifted from MIPS Technologies code in GNU Lib C and Android

 

* optimal hand-rolled assembly routines for endianness conversion -- SwapSHORT and SwapLONG -- that I can not find any way to make better / shorter; they are basically one MIPS instruction per C operation at this point (see "m_swap.S")

 

* optimal-ish hand-rolled assembly routine for FixedMul, didn't get around to FixedDiv yet but will push a version soon (see "m_fixedmul.S")

 

* hand-rolled assembly for R_DrawColumn and R_DrawSpan, the two functions responsible for drawing the entire in-game display except for the status bar. I took a lot of time to write these such that they have no prologue/epilogue and spill no registers at any time. They only use caller-saved registers along with the argument and return value registers so nothing has to be saved or restored. They have early-return conditions that can be met after executing 4 to 5 instructions for columns and spans smaller than a pixel. Also, their texture-mapping loops (the "do { ... } while(count--);" loops at the end of the C versions) are shorter than any GCC-generated code at any optimization level and the entire functions themselves are shorter than any GCC-generated code by 20 - 30 instructions each (see "R_DrawColumn.S" and "R_DrawSpan.S")

Edited by jnmartin84
1 person likes this

Share this post


Link to post

Props!

 

Do you use a profiler to measure hotspots and gauge efficiency of your implementation?

 

1 person likes this

Share this post


Link to post
On 12/13/2017 at 4:42 PM, _bruce_ said:

Props!

 

Do you use a profiler to measure hotspots and gauge efficiency of your implementation?

 

I had some test code to benchmark the memcpy/memset code. The two functions ran something like 5x faster than the previous versions when executing on a real Nintendo 64 console. Benchmarks running in MESS are wildly misleading :-D .

 

As far as the video code, I'm going by instruction count and accesses to main memory and pipeline knowledge to avoid stalls. Same for the word swap and fixedmul code.

 

Also in regard to hotspots I had profiled the original linuxxdoom code a decade ago or so and profiled the renderer again a couple months back when trying to code an RDP hardware renderer. There are also profiling results available online from other developers that I've referenced as needed.

Edited by jnmartin84

Share this post


Link to post

Got FixedDiv implemented in assembly (and working... ) but have not attempted any optimization yet:

FixedDiv:
        .global FixedDiv
        .set    noreorder
        .set    nomacro

        sra     t0,     a0,     31
        xor     t1,     a0,     t0
        sub     t1,     t1,     t0
        sra     t2,     a1,     31
        xor     t3,     a1,     t2
        sub     t3,     t3,     t2
        srl     t1,     t1,     14
        slt     t4,     t3,     t1
        bne     t4,     zero,   _FixedDiv_test
        xor     t0,     a0,     a1
        dadd    a0,     a0,     zero
        dadd    a1,     a1,     zero
        dsll    a0,     a0,     16
        ddiv    a0,     a1
        nop
        nop
        mflo    v0

_FixedDiv_end:
        jr      ra
        nop

_FixedDiv_test:
        bltz    t0,     _FixedDiv_return_INT_MIN
        lui     v0,     0x7FFF

_FixedDiv_return_INT_MAX:
        addiu   v0,     v0,     0xFFFF
        jr      ra
        nop

_FixedDiv_return_INT_MIN:
	addi   	v0,     zero,   0x8000
        sll     v0,     v0,     16
        jr      ra
        nop

0338.png

Edited by jnmartin84

Share this post


Link to post

gcc's take on FixedDiv ... looks familiar :-o

 


        .file 1 "m_fixed.c"
        .set    nomips16
        .set    nomicromips
        .ent    FixedDiv
        .type   FixedDiv, @function
FixedDiv:
        .frame  $sp,0,$31               # vars= 0, regs= 0/0, args= 0, gp= 0
        .mask   0x00000000,0
        .fmask  0x00000000,0
        .set    noreorder
        .set    nomacro
        sra     $2,$4,31
        sra     $6,$5,31
        xor     $7,$2,$4
        subu    $7,$7,$2
        xor     $2,$6,$5
        sra     $7,$7,14
        subu    $6,$2,$6
        slt     $6,$7,$6
        bne     $6,$0,$L2
        dsll    $3,$4,16

        xor     $4,$4,$5

        bltz    $4,$L4
        nop

        li      $2,2147418112                   # 0x7fff0000
        j       $31
        ori     $2,$2,0xffff

 

$L2:
        move    $4,$5
        ddiv    $0,$3,$4
        teq     $4,$0,7
        mflo    $2
        j       $31
        sll     $2,$2,0

        sll     $2,$2,0

 

$L4:
        j       $31
        li      $2,-2147483648                  # 0xffffffff80000000

        .set    macro
        .set    reorder
        .end    FixedDiv

Edited by jnmartin84

Share this post


Link to post

I feel like MIPS is one of those processor architectures where there aren't very many opportunities for "exotic" optimizations, just clever register usage / reusage (kind of like those old C tricks like swapping without a temp variable, etc).

 

I often find my code very near identically matching gcc -O2 output on my first or second cleanup pass in most cases.

Edited by jnmartin84

Share this post


Link to post

To be fair to GCC, the MIPS architecture (the MIPS III ISA in the case of the Nintendo 64, with the RSP's vector instruction extensions not included) is so simple, most things in C map directly to small sequences of MIPS instructions, sometimes mapping one-to-one, and unlike the x86 as a particularly egregious example (the whole RISC vs CISC thing...), there aren't a half-dozen different ways to do something like copy a string (I'm not an Intel expert but I can think of two different opcodes that will do just that, copy a string with just a single instruction rather than a dozen-ish MIPS instructions with control flow transfer).

Edited by jnmartin84

Share this post


Link to post

Can you be more specific than that? Are you saying the rom couldn't be made at all, or that it could be made but didn't work on your favorite emulator?

Share this post


Link to post
3 hours ago, Danfun64 said:

Can you be more specific than that? Are you saying the rom couldn't be made at all, or that it could be made but didn't work on your favorite emulator?

The rom cannot be made.

Share this post


Link to post

Can you be any more specific? Error messages? I would like to help.

Share this post


Link to post

I'm seeing reports on another forum that there might be line-ending issues with the shell script I packaged in the toolkit.

Share this post


Link to post
On 12/15/2017 at 7:40 AM, jnmartin84 said:

I feel like MIPS is one of those processor architectures where there aren't very many opportunities for "exotic" optimizations, just clever register usage / reusage (kind of like those old C tricks like swapping without a temp variable, etc).

Makes sense really - it's a RISC architecture (one of the original ones) and the central idea behind RISC is to have fewer, simpler instructions which can be better optimized in the CPU design. Most of the time there really should only be pretty much only one way to do things.

1 person likes this

Share this post


Link to post
On 1/3/2018 at 10:23 AM, fraggle said:

Makes sense really - it's a RISC architecture (one of the original ones) and the central idea behind RISC is to have fewer, simpler instructions which can be better optimized in the CPU design. Most of the time there really should only be pretty much only one way to do things.

The antithesis of things like the VAX which could compute a polynomial with a single CPU instruction. :-D

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now