FastDoom: DOS Vanilla Doom optimized for 386/486 processors

Ramon_Demestre · May 2, 2023

7 hours ago, Darkcrafter07 said:

Outstanding job, I got Doom 2 running at 15-35FPS in potato mode on a 386dx-33 in PCem. Not sure if it's doable if you could split rendering to two screens and render them with different detailizations, like the near field objects are rendered in low details and far field in high. Would it work? Kind of unnecessary with processors we have but just for a sake of curiosity. Maybe it could work even faster on such slow systems if split screen was introduced with potato and super-potato mode for close objects.

It might be possible to split in several vertical stripes with different horizontal resolution? It is true that the central region is much more important than the rest of the screen.

You got 90° Horizontal fov on 320 pixels width, could be nice to have 110_100_110 with the central region being highrez and the sides being half rez or maybe 5 regions of 64 columns each with potatoes on the very sides, lodetails on the sides and highrez on the center. I am not familiar with the VGA mode X for that but I guess it could just mean to play more with the mapmask register?

viti95 · May 2, 2023

Mixing detail levels is doable, as detail levels are only used to benefit from the VGA mode Y trick that allows writing 2 or 4 bytes of VRAM data with a single byte written. Potato detail mode (in Mode Y executables) benefits even further from having to select only once the mapmask just when starting rendering the scene, instead of one time per column or 4 times per visplane scanline, compared to low/high detail modes. Each OUT instruction wastes 10 cycles on a 486, so that's taking lot of time from rendering.

Another possibility is to use different detail levels for scene rendering and sprite rendering, I guess this is easier to modify. Maybe this way is possible to see clearly enemies, while using lower detail for scenery.

Again, this takes time to investigate, for now I'll be focusing on an interesting optimization that Ken Silverman (🤯) has pointed out to optimize even further rendering code https://github.com/viti95/FastDoom/issues/143

Edited May 3, 2023 by viti95

zokum · May 3, 2023

You've probably already looked into this, but Heretic is based on the older 1.2 code base. There might be some optimizations in there that were removed for the Doom 1.10 / linux cleanup. Maybe Hexen did something clever here and there that is worth looking into. It's a shame the source code for Strife was lost, since that one is based on v1.666 and could potentially be closer to 1.9.

I'm no assembly programmer, I prefer to work with high-level optimizations. Have you looked into reordering the nodes-tree to make it more cache friendly? On all but the slowest platforms, this could be a nice boost.

I have sometimes thought about asking if to release the 1.9 source code if we can get DMX GLPed. We could then set up a fund raiser to buy the rights to DMX. Reverse engineering 1.9 shouldn't be too hard though, given that we got the 1.10 version and the Heretic 1.2 based engine. There can't be that many changes.

Redneckerz · May 3, 2023

14 hours ago, viti95 said:

Mixing detail levels is doable, as detail levels are only used to benefit from the VGA mode Y trick that allows writing 2 or 4 bytes of VRAM data with a single byte written. Potato detail mode (in Mode Y executables) benefits even further from having to select only once the mapmask just when starting rendering the scene, instead of one time per column or 4 times per visplane scanline, compared to low/high detail modes. Each OUT instruction wastes 10 cycles on a 486, so that's taking lot of time from rendering.

Another possibility is to use different detail levels for scene rendering and sprite rendering, I guess this is easier to modify. Maybe this way is possible to see clearly enemies, while using lower detail for scenery.

Again, this takes time to investigate, for now I'll be focusing on an interesting optimization that Ken Silverman (🤯) has pointed out to optimize even further rendering code https://github.com/viti95/FastDoom/issues/143

Pretty impressive that you got Ken's attention on this. He loves his assembly jobs. Make sure he sees this thread, great to see him have this kind of interested for his competitor :P

Darkcrafter07 · May 3, 2023

Yeah, the guys is a legend and I enjoyed Duke Nukem 3D just as well as Doom when was a kid.

Frenkel · May 3, 2023

10 hours ago, zokum said:

Reverse engineering 1.9 shouldn't be too hard though, given that we got the 1.10 version and the Heretic 1.2 based engine. There can't be that many changes.

--------
Ken Silverman giving tips on how to speed up Doom is awesome!

--------
DJGPP produces faster Doom code than Watcom. I thought it was because of its whole program optimization, which Watcom lacks. So as a test I put everything in one big doom.c file. The Watcom code was a bit faster, but not nearly as fast as DJGPP's code.

Ramon_Demestre · May 3, 2023

27 minutes ago, Frenkel said:

DJGPP produces faster Doom code than Watcom. I thought it was because of its whole program optimization, which Watcom lacks. So as a test I put everything in one big doom.c file. The Watcom code was a bit faster, but not nearly as fast as DJGPP's code.

DJGPP is based on the latest gcc version which is one of the best C compiler to date, even without link-time optimization gcc is nuts. Watcom code generation was pretty good for the era but today it is far behind. OpenWatcom 2.0 did not improve the quality of the optimizer.

However in those benchmarks you are using DOSBox which is not a very good indicator for code performance because it does not really emulate any real CPU but instead just limits the speed of execution in instructions per second. You could use PCem than can more precisely emulate a 386 for better benchmarks.

There are a ton of flags for gcc, I am sure we can find the best settings for a 386.

EDIT: well the latest DJGPP is based on gcc 9.3.0 which is not the latest but was released in 2020

Edited May 3, 2023 by Ramon_Demestre

Frenkel · May 3, 2023

I use this release of DJGPP which is based on GCC 12.1.0.

Yeah, I shouldn't use DOSBox for benchmarking, but I just like how it can access my C-drive so I can quickly run new builds and I like its built-in way to record videos.
In DOSBox the optimized for space version of the Digital Mars executable is faster than the optimized for speed version. :D
Also DJGPP -march=i386 is faster than i486 is faster than i586 in DOSBox.

Ramon_Demestre · May 4, 2023

I tried djdoom on PCem with a 486DX2@66 and i got on -timedemo demo1 (ultimate-doom):

djdoom (your build): 1710/2000

djdoom build with my flags: 1710/1995

wcdoom: 1710/2597

DOOM.EXE: 1710/1897

FDOOM: 1710/1367

so gcc is faster than watcom but djdoom is much slower than even vanilla doom.

I will try more build flags and post it to your github.

I tried -O2 and -Os and it is slower on PCem.

outside of the absence of real cpu timing emulation (which mean that the only thing that matter is size). there is also the problem of memory access. In DOSBOX all the ram is as fast as it can physically be which mean that a program will read/write memory instantly, the whole RAM is like L1 cache fast.

I do not know why djdoom is slower than vanilla, I will have to check the source.

viti95 · May 4, 2023

86Box is much more accurate for emulation (in every aspect) than DOSBox. There is an easy way to access drive files from that emulator, just set-up a CD-ROM drive on DOS (MSCDEX + any IDE driver), and mount your dev folder directly on that CD drive (no ISO required at all).

BTW I tried to build DJDoom on Linux, it builds fine if all referenced files (.h) in the code are converted to lower-case, otherwise compilation fails.

21 hours ago, zokum said:

... Have you looked into reordering the nodes-tree to make it more cache friendly? On all but the slowest platforms, this could be a nice boost ...

Well I tried to optimized the nodes with Zennode and ZokumBSP and the results were clear, there is a performance uplift https://www.doomworld.com/forum/post/2175041

Also WadPtr tool made the timedemos run faster, but in both cases there were issues. WadPtr caused some textures to move much faster (for example in E1M1 the rotating texture below the green armor spinned like crazy), and reordering the nodes caused desync in demos.

@Ramon_Demestre DJDoom is slower due to non-optimized rendering routines, it's just non-unrolled C code.

Edited May 4, 2023 by viti95 : Updated post

viti95 · May 4, 2023

22 minutes ago, Ramon_Demestre said:

I tried djdoom on PCem with a 486DX2@66 and i got on -timedemo demo1 (ultimate-doom):

djdoom (your build): 1710/2000

djdoom build with my flags: 1710/1995

wcdoom: 1710/2597

DOOM.EXE: 1710/1897

FDOOM: 1710/1367

Is insane to see a 29% performance uplift just from using DJGPP compared to OpenWatcom. Just add optimized ASM code for rendering functions an I'm pretty sure DJDoom will be faster than FastDoom.

Ramon_Demestre · May 4, 2023

Yes indeed this is insane. Compiler have become amazingly fast.

FWIW:

djdoom -02 : 1710/2127

djdoom -Os: 1710/2473

so gcc -Os is still a little better than Watcom but Watcom generate even smaller exe. 376KB for Watcom vs 390KB for gcc-Os

Ramon_Demestre · May 4, 2023

Inlining assembly with gcc is even more powerful than in watcom (with the use of generic registers), however it is a nightmare to write with the GAS syntax. I use the Intel2GAS python script: https://github.com/skywind3000/Intel2GAS

viti95 · May 4, 2023

Maybe it's not required to use GAS at all (yeah I also hate it). It's possible to generate DJGPP compatible objects with NASM, so maybe FastDoom's rendering functions can be used directly on DJDoom.

Ramon_Demestre · May 4, 2023

I will work (when I get the envy) on some small cool functions:

ie:

fixed_t FixedMul(fixed_t a, fixed_t b)
{
    __asm__ (
        "imul %2 \n"
        "shrd $0x10,%%edx,%%eax \n"
    : "=eax" (a)         /* OUTPUT */
    : "eax" (a), "r" (b) /* INPUT   */
    : "edx"              /* CLOBBER */
    );
    return a;
}

I go from 1995 -> 1979.

nams should be able to produce gcc compatible coff files I guess.

zokum · May 4, 2023

Rebuilding the nodes does much more than reordering the nodes. In Doom the root node is the last node, the ones after that tend to be the second to last and third to last and so on. Apparently this isn't very cache friendly on older hardware. Simply flipping it upside down and having it start on index 0 and hopefully also cache index 1,2,3 etc could lead to a performance increase. Ideally you want to put them in an order where the most used nodes are near each other so that you get the most out of caching. This was originally an idea by Linguica.

A node builder would still have to have the last node as the root node, but could from there on use index 0 and 1 for the next ones etc. There's probably some sort of optimal way to pack the node references that will lead to higher performance. It is one of those "maybe" things I could put into ZokumBSP. Almost no one would notice it though.

It might have more merit on larger maps with many monsters and more traversal of the node tree per frame. Since you can freely reorder the tree, having some sort of optimization pass would make sense if the executable sees that it is node tree stored bottom up. There might even be different access patterns that work better on some cpus than on others.

Adding it to the node builder would speed up new maps being built in almost all ports. Adding it to the executable would speed up all maps in that port, but even better than doing it in the node builder. So in an ideal world, we'd have both. But for Fast Doom, I'd say a port improvement is the more relevant approach.

zokum · May 4, 2023

By the way, demos can probably desync if nodes are rebuilt, but it is a lot less likely. It can make some sectors larger or smaller, and that can affect collision detection in rare edge cases. Blockmap compression is a complex field, with some types leading to desync and others not. ZokumBSP has the -bi flag to produce blockmaps that should not desync with id's, but take up less memory. It can optimize better than ZenNode's blockmap code and be 100% compatible with demos.

There's an oversight in the ZenNode code. It will only find duplicate lines if they are on the same row, or in the block above. It is unable to find it if it is more than one row above. This typically happens if you have a continuous long wall, and then a building near the wall. ZenNode has special code for handling empty blocks, and will compress all of those into one list.

|       1 new block
|       identical to the previous block
|       identical to the previous block
| +--+  2 new blocks
| |  |  2 new blocks
| +--+  2 new blocks
|       1 new block, but this one is identical to an earlier block, the first one.
|       identical to previous block

I know it's a crappy drawing, but hopefully it should be clear why this went wrong in ZenNode. The look-back loop only looks back as many entries as one row can consist of. That means the current row and usually some of the row above.

Individualised · May 4, 2023

On 5/3/2023 at 12:01 AM, viti95 said:

Another possibility is to use different detail levels for scene rendering and sprite rendering, I guess this is easier to modify. Maybe this way is possible to see clearly enemies, while using lower detail for scenery.

Doom 2 on GBA, as well as the GBADoom (a port of PrBoom to GBA) source port, renders textures at low detail, but renders the actual walls themselves at high detail, so you still get high detail wall edges and such, if that makes sense. Would anything like that be feasible?

Frenkel · May 4, 2023

8 hours ago, Ramon_Demestre said:

djdoom is much slower than even vanilla doom.

There isn't much assembly code in Doom, but sure do they make a difference.

R_DrawColumn, R_DrawColumnLow, R_DrawSpan, R_DrawSpanLow

FixedMul, FixedDiv2

I_ReadJoystick

zokum · May 4, 2023

3 hours ago, Individualised said:

Doom 2 on GBA, as well as the GBADoom (a port of PrBoom to GBA) source port, renders textures at low detail, but renders the actual walls themselves at high detail, so you still get high detail wall edges and such, if that makes sense. Would anything like that be feasible?

Sounds like they downsampled the textures. That will mostly help on memory usage, but not so much on frame rate, I think. I think we have more lower hanging fruit in optimizing data structures and hand-tuning a few of the core functions for the cpu in use. If you tune it to be fast on a 386 DX40, it might be slower on a 486 than it could be, since it has instructions that use fewer cycles in some cases, so the best optimizations can change.

Michael Abrash has a lot about this kind of stuff in his Black Book.

Personally I still think better algorithms and optimized data structures is what we need to improve. You want to hand-tune the parts that are slow, and if you change some of the algorithms to do things differently, you might end up with different hand-tuning needs.

Individualised · May 4, 2023

9 minutes ago, zokum said:

Sounds like they downsampled the textures. That will mostly help on memory usage, but not so much on frame rate, I think. I think we have more lower hanging fruit in optimizing data structures and hand-tuning a few of the core functions for the cpu in use. If you tune it to be fast on a 386 DX40, it might be slower on a 486 than it could be, since it has instructions that use fewer cycles in some cases, so the best optimizations can change.

Michael Abrash has a lot about this kind of stuff in his Black Book.

Personally I still think better algorithms and optimized data structures is what we need to improve. You want to hand-tune the parts that are slow, and if you change some of the algorithms to do things differently, you might end up with different hand-tuning needs.

I got it the wrong way round. Textures are rendered at full resolution, but the level geometry is rendered at half resolution, so edges are jagged but the wall textures look fine. It's hard to explain without showing footage (ignore the mipmapping that Doom 2 GBA does additionally):

Some people might find it creates an ugly effect though. I'm not too big of a fan of it. Also keep in mind Doom 2 GBA is a complete rewrite (Doom 1 on GBA uses the Jaguar Doom engine, this uses a custom engine called Southpaw) so I'm not sure if this approach would even be easily implemented in the Doom engine.

viti95 · May 5, 2023

15 hours ago, Frenkel said:

There isn't much assembly code in Doom, but sure do they make a difference.

R_DrawColumn, R_DrawColumnLow, R_DrawSpan, R_DrawSpanLow

FixedMul, FixedDiv2

I_ReadJoystick

Yeah it makes a great difference, I hope to create even faster R_Draw* routines with the hints Ken have given.

BTW I think DJDoom is a great opportunity to create a great port from scratch. I mean, it builds on GCC fine and is Vanilla 100%, so we can port it to multiple architectures, using ASM code for specific parts like rendering or sound. For example I think the Macintosh 68k platform can benefit a lot (using the Retro68 SDK), since the Macintosh uses a linear VRAM layout, and supports 256 colors (no chunky to planar penalty compared to Amiga or Atari ST).

Darkcrafter07 · May 5, 2023

Yesterday I was reading through the whole thread and found out that video modes like 13h and vba are even faster. I opened up PCem, and tried 13h and framerates skyrocketed, well, according to my tests this port is 1.67 times faster than vanilla on high details and no sound, and that's 13h only. Couldn't run VBA 2.0 modes as I don't know what video card and driver I need. I downloaded and installed SciTech Display Doctor and even though it reports VBA 3 support on Win 95 and S3 virge, it seems to make no difference for fdoom. 486 computers like dx33 and further really started to shine and deliver groovy experience on high detail without degradation!

viti95 · May 5, 2023

For FastDoom Vesa modes you need a video card that supports VESA 2.0 and the 320x200 (8 bit, 256 color) video mode. There are two executables:

FDOOMVBR.EXE: This modes uses a backbuffer for rendering, and doesn't require LFB support.
FDOOMVBD.EXE: This mode renders directly on the video card, and supports triple buffering. It requires LFB support.

I've tested those executables succesfully with these video cards, using Scitech Display Doctor 5.3a:

- S3 805, Trio32, Trio64, Virge

- Trident 9440

- Cirrus Logic GD-5426, GD-5428

- ATI Mach32, Mach64

- Cyrix MediaGX

- Rendition Verite 2200

Non-working video cards:

- Matrox video cards. I guess it's due to different VRAM layout, the screen is corrupted on those cards.

Edited May 5, 2023 by viti95

Darkcrafter07 · May 5, 2023

1 hour ago, viti95 said:

For FastDoom Vesa modes you need a video card that supports VESA 2.0 and the 320x200 (8 bit, 256 color) video mode. There are two executables:

FDOOMVBR.EXE: This modes uses a backbuffer for rendering, and doesn't require LFB support.

FDOOMVBD.EXE: This mode renders directly on the video card, and supports triple buffering. It requires LFB support.

I've tested those executables succesfully with these video cards, using Scitech Display Doctor 5.3a:

- S3 805, Trio32, Trio64, Virge

- Trident 9440

- Cirrus Logic GD-5426, GD-5428

- ATI Mach32, Mach64

- Cyrix MediaGX

- Rendition Verite 2200

Non-working video cards:

- Matrox video cards. I guess it's due to different VRAM layout, the screen is corrupted on those cards.

Thank you! I had SDD 6.53 installed and it had VBA 3.0 mode support only, apparently. Installing the DOS version only of 5.3a helped to run fdoomvbr.

Btw, if that feature of split detail modes for geometry and sprites was implemented, could it be sped even further by dividing sprites in the far-field (high details) and near-field (potato)?

Edited May 5, 2023 by Darkcrafter07

zokum · May 8, 2023

On 5/5/2023 at 12:09 PM, Darkcrafter07 said:

Yesterday I was reading through the whole thread and found out that video modes like 13h and vba are even faster. I opened up PCem, and tried 13h and framerates skyrocketed, well, according to my tests this port is 1.67 times faster than vanilla on high details and no sound, and that's 13h only. Couldn't run VBA 2.0 modes as I don't know what video card and driver I need. I downloaded and installed SciTech Display Doctor and even though it reports VBA 3 support on Win 95 and S3 virge, it seems to make no difference for fdoom. 486 computers like dx33 and further really started to shine and deliver groovy experience on high detail without degradation!

Mode X/Y allows you to use double buffering, for tear-free rendering. Mode 13h doesn't have this and would therefor look worse on that era's hardware. As for what is faster, this depends on your hardware implementation. PCEm is probably not the best representation of what actual VGA hardware.

As I understood it, writing to graphics memory (vram) was quite slow, so you would want to avoid any overdraw. Overdraw in Doom typically happens when you draw the scene first and then add the monster sprites and normal textures on two-sided lines, back to front. If the overdraw was significant, it would be faster to draw to memory and then copy it over to vram without any overdraw. I think this is what Doom did. It makes sense to code to make it it run faster on complex scenes instead of optimizing for simple scenes. You'd rather want a frame rate of 15-25 with an average of 20, than a frame rate of 10-30, with an average of 21. The timedemo might run faster on one demo, but when playing, you don't really care about the times it had 35fps and cycles to spare, you care about the slowdowns that happened in situations with many monsters onscreen.

viti95 · May 8, 2023

Any communication via ISA bus is extremely slow, so the less you use it, the better. As you said @zokum, Doom writes directly to VRAM in all cases, first renders the scenario (no overdraw here), and then objects (that includes some walls) in back to front order. This is were major overdraw happens, and there is no possible fix just because everything is stored in VRAM.

The other big problem is rendering fuzzy objects (spectres). Due to everything being rendered on VRAM, it requires to read from the VRAM in order to create that fuzzy effect, which is also extremely slow (1 read + 1 write per each pixel). That's why I decided to add the Sega Saturn rendering method, is much faster as only half of the pixels are directly drawn to VRAM.

zokum · May 9, 2023

Ideally you want to draw front to back, but back to front of everything that is transparent. Modern GPUs handle this with a z-buffer. Implementing that in software is probably too slow. The transparency effect complicates clever algorithm optimizations greatly. I tried hard to see if I could come up with a good algorithm to do some sort of front to back drawing, but I can't really see a way to optimize it.

There might be ways to draw the normal textures on two-sided walls, if they have no transparency, faster and more accurate than the current algorithm does it.

Drawing specters with an outline only would be an interesting way to render them. As long as you assure that the outline edges are at least 1 pixel and skip the rest of the sprite at any distance. Draw the outline in an alternating grey-shade pattern. A lot of the alternative ways to render specters make them a lot easier to see. Change the grey colors used based on the sector light level. If the grey colors also cycled in a similar way to the original effect, it could look quite interesting.

I have wondered why invisible wasn't just a bit flag they could set on all monsters, like ambush and skill levels. On the other hand that would mean that pwads would be full of invisible cyber demons.

Ramon_Demestre · May 9, 2023

10 minutes ago, zokum said:

I have wondered why invisible wasn't just a bit flag they could set on all monsters, like ambush and skill levels. On the other hand that would mean that pwads would be full of invisible cyber demons.

shooting invisible rockets of course! Should be doable with dehacked thoough.

Ramon_Demestre · May 9, 2023

I tried to modify the Sega Saturn render in order to have black pixels instead of the original sprite colors and I think it looks kinda good and more faithful to the original render while being even faster to than the Sega Saturn render.

Just replace all the *dest = dc_colormap[dc_source[(frac >> FRACBITS)]]; by *dest = 0 in the `R_DrawFuzzColumnSaturn()` function:

image.png.4c3976ab90b8b653e7e1e770f00f5708.png

I think this transparent mode could be added and named something like " Black Saturn" or "Simple". Also maybe a dark gray would be better than pure black.

Sign In

FastDoom: DOS Vanilla Doom optimized for 386/486 processors

Recommended Posts

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in