Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
Redneckerz

FastDoom: DOS Vanilla Doom optimized for 386/486 processors

Recommended Posts

2 hours ago, viti95 said:

I discovered while searching how omgifol library works, that Zennode tool is able to optimize the BSP tree and the nodes ( https://github.com/Doom-Utils/zennode ). The speedup is quite noticeable, here is a video of the current dev build (also comes with optimizations in flat visplanes rendering) running the optimized IWAD.

You may also want to look into ZokumBSP, which is a slot-in replacement for Zennode, fixes some bugs it has, and actually has some really advanced features related to blockmap compression.

 

It might even wind up faster. :)

Edited by Dark Pulse

Share this post


Link to post
13 hours ago, viti95 said:

EDIT:

 

@Dark Pulse i've tested ZokumBSP but the WAD it generates causes the demos to desync. I haven't tried all the options, but i'm pretty sure the desync comes from changes in the blockmap

Yes, a rebuilt blockmap will by definition cause a level to desync in terms of demos.

 

What it would be more useful for, though, is testing if ZokumBSP-optimized maps will run faster compared to the originals under FastDoom. I mean, if you're trying to squeeze every drop of performance, you can certainly do worse than reducing the blockmap complexity.

 

Just don't try to do the crunching on an actual 386/486, or you're gonna have a bad time. :)

Share this post


Link to post
On 8/5/2020 at 8:28 PM, Optimus said:

HQ=High quality

LQ=Low quality (F5)

PQ=Potato quality

UQ=Potato + flatsurfaces + flatskies + flatshadows +low sound + mono

 

Demo 1: E1M5

Doom: HQ (6.9fps) LQ (11fps)

FastDoom: HQ (7.8fps) LQ (12.9fps) PQ (15.4fps) UQ (19.0fps) 

Sorry to be that guy but i think the potato mode should be much faster considering what it does and how bad it looks (pretty much the same as low detail with half view size). I've also tested it on a 486 SX-25 VM and results are consistent with what you get on real hardware. FastDoom is a bit faster than regular Doom BTW as expected. I've compared low detail (17 fps) vs potato (20 fps) with full view and in low detail with half size i get 35 fps. So you can get better performance reducing the view a bit instead and the game will look better. I mean i think there must be something wrong with the potato mode.

 My VM is here: https://www.vogons.org/viewtopic.php?f=9&t=75976

Edited by drfrag

Share this post


Link to post

For those who haven't been following the thread wherein Randall Linden has provided portions of the SNES Doom source and toolchain, a potential snag has been hit on the legal side. On the off chance that a solution can't be worked out, what's the likelihood that a graphics and/or gameplay option for this sourceport could be developed to fill that niche? Just asking...um, for a friend.

Share this post


Link to post
13 minutes ago, Job said:

For those who haven't been following the thread wherein Randall Linden has provided portions of the SNES Doom source and toolchain, a potential snag has been hit on the legal side. On the off chance that a solution can't be worked out, what's the likelihood that a graphics and/or gameplay option for this sourceport could be developed to fill that niche? Just asking...um, for a friend.

All Bethesda did was deny him the right to distribute the assets along with the source. RIPDOOM will still create the appropriate assets when fed a PC IWAD.

Share this post


Link to post

True, although for those out there who maybe want a more simplified approach, this sourceport seems like it does most of the SNES Doom stuff right out of the box. Granted, the shading needs tweaking, among a few other SNES Doom-specific traits, but it's mostly there. 

Share this post


Link to post
On 8/9/2020 at 6:27 PM, drfrag said:

Sorry to be that guy but i think the potato mode should be much faster considering what it does and how bad it looks (pretty much the same as low detail with half view size). I've also tested it on a 486 SX-25 VM and results are consistent with what you get on real hardware. FastDoom is a bit faster than regular Doom BTW as expected. I've compared low detail (17 fps) vs potato (20 fps) with full view and in low detail with half size i get 35 fps. So you can get better performance reducing the view a bit instead and the game will look better. I mean i think there must be something wrong with the potato mode.

 My VM is here: https://forum.zdoom.org/viewtopic.php?f=97&t=69534

 

You're right potato mode can be faster, as the current implementation uses the original LQ mode, but renders only odd columns as 4 pixel wide, and omits rendering even columns. This means that the processing of even columns is totally useless and wastes processor cicles. Even more, the implementation isn't as optimized as the original renderers, because the visplane rendering is made in C instead of a handwritten ASM optimized version.

Share this post


Link to post

Well technically you should render only the first of each four columns doing it four times simultaneously but easier said than done. I don't know how to do it hope someone can help, @leileilolcould do it as she added low detail modes to Quake.

In this line i just guess 127 is the half of 255.

*dest = dc_colormap[dc_source[(frac >> FRACBITS) & 127]];

BTW my VM thread has been reported as warez but clearly it's not (i used Caldera DR-DOS) and the emulator itself isn't warez either. Anyway i didn't include the emulator nor the roms in my file and the disk image could be used with other emulators.

Share this post


Link to post

New release!

 

https://github.com/viti95/FastDoom/releases/tag/0.5

 

Changelog:

  • Faster Potato mode. Now it's 100% native potato mode and doesn't use the LQ mode to draw the screen.
  • Fixed Sega Saturn shadows in potato mode.
  • Added "-init" parameter, it forces the user to press a key to start the game. This makes easier to see the initialization process.
  • Now it's possible to use "-nomonsters" without "-warp level". Change made to test the IA performance impact.
  • Fixed AWE32 music.
  • More rendering and main code optimizations.
  • Bring back the gamma correction, lot's of users with CRT monitors needed this functionality (F11 key)
  • Remapped autorun to F12 key

Thanks @drfrag for pushing me to make this mode be as fast as it should be :D

Share this post


Link to post
2 hours ago, viti95 said:

New release!

 

https://github.com/viti95/FastDoom/releases/tag/0.5

 

Changelog:

  • Faster Potato mode. Now it's 100% native potato mode and doesn't use the LQ mode to draw the screen.
  • Fixed Sega Saturn shadows in potato mode.
  • Added "-init" parameter, it forces the user to press a key to start the game. This makes easier to see the initialization process.
  • Now it's possible to use "-nomonsters" without "-warp level". Change made to test the IA performance impact.
  • Fixed AWE32 music.
  • More rendering and main code optimizations.
  • Bring back the gamma correction, lot's of users with CRT monitors needed this functionality (F11 key)
  • Remapped autorun to F12 key

Thanks @drfrag for pushing me to make this mode be as fast as it should be :D

Looks epic man, congratulations. I have changed the release date and added the new features to the FastDoom page on DoomWiki. I think you may have missed the initial post about it :P

 

Thank you for making FastDoom.

Share this post


Link to post

Yeah it's a bit faster now, performance in my "warez" VM now goes from 20 to 23 with the status bar on. Turning low detail on went from 10 to 17 tough.

I expected a massive performance boost, now i get the same performance in low detail with two screenblocks less.

I must confess i don't understand your C code very well, it's very different from the original low detail mode. I don't see how you can access the four VGA banks without two extra pointers, specifically low detail used dest and dest2. I expected you would need dest3 and dest4. About the ASM version no idea. Also i don't know what's the performance difference between both versions.

    *dest2 = *dest = dc_colormap[dc_source[(frac>>FRACBITS)&127]];

 

Share this post


Link to post

The column drawing routine it's exactly the same for HQ, LQ and Potato, it just changes how many planes do you write at the same time (1, 2 or 4). That's made writing to the port 0x3C5 the plane mask that you need. The ASM version of these routines works exactly the same as the C version (but it's much more optimized). VGA ModeX it's really hard to understand, and it's way different than the normal chunky modes. I recommend this video by Root42 to understand how it works (in an easy way) and the Michael Abrash's Graphics Programming Black Book Special Edition to go into greater detail of what can be done with it (https://www.phatcode.net/res/224/files/html/ch47/47-01.html#Heading1)

 

void R_DrawColumn (void) 
{ 
    int			count; 
    byte*		dest; 
    fixed_t		frac;
    fixed_t		fracstep;	 
 
    count = dc_yh - dc_yl; 

    // Zero length, column does not exceed a pixel.
    if (count < 0) 
	return; 
				 
#ifdef RANGECHECK 
    if ((unsigned)dc_x >= SCREENWIDTH
	|| dc_yl < 0
	|| dc_yh >= SCREENHEIGHT) 
	I_Error ("R_DrawColumn: %i to %i at %i", dc_yl, dc_yh, dc_x); 
#endif 

        outp (SC_INDEX+1,1<<(dc_x&3)); 

        dest = destview + dc_yl*80 + (dc_x>>2); 

    // Determine scaling,
    //  which is the only mapping to be done.
    fracstep = dc_iscale; 
    frac = dc_texturemid + (dc_yl-centery)*fracstep; 

    // Inner loop that does the actual texture mapping,
    //  e.g. a DDA-lile scaling.
    // This is as fast as it gets.
    do 
    {
	// Re-map color indices from wall texture column
	//  using a lighting/special effects LUT.
	*dest = dc_colormap[dc_source[(frac>>FRACBITS)&127]];
	
	dest += SCREENWIDTH/4;
	frac += fracstep;
	
    } while (count--); 
} 



void R_DrawColumnLow (void) 
{ 
    int			count; 
    byte*		dest; 
    fixed_t		frac;
    fixed_t		fracstep;	 
 
    count = dc_yh - dc_yl; 

    // Zero length.
    if (count < 0) 
	return; 
				 
#ifdef RANGECHECK 
    if ((unsigned)dc_x >= SCREENWIDTH
	|| dc_yl < 0
	|| dc_yh >= SCREENHEIGHT)
    {
	
	I_Error ("R_DrawColumn: %i to %i at %i", dc_yl, dc_yh, dc_x);
    }
    //	dccount++; 
#endif 
        if (dc_x & 1)
            outp (SC_INDEX+1,12); 
        else
            outp (SC_INDEX+1,3);

        dest = destview + dc_yl*80 + (dc_x>>1); 
    
    fracstep = dc_iscale; 
    frac = dc_texturemid + (dc_yl-centery)*fracstep;
    
    do 
    {
        *dest = dc_colormap[dc_source[(frac>>FRACBITS)&127]];
        
        dest += SCREENWIDTH/4; 
        frac += fracstep;

    } while (count--);
}

 

Share this post


Link to post

Thanks, interesting. That code you've posted is not from FastDoom, you removed the C versions. For ASM i did a little programming for microcontrollers many years ago and i don't remember anything. I was checking the PCDoom code and i thought it already used mode X, doesn't it? Mode 13h could be faster on a 486, at least it was on my 486 with an isa card (i tried Boom but it had no low detail).

I'm still surprised that potato is not much faster with pixel quadrupling.

Share this post


Link to post

Yes, both PCDoom and FastDoom are using the original ASM versions of the column and span renderers, and all rendering is exactly the same as Vanilla Doom (ModeX). This is faster for 386 and low end 486, as it writes directly to the video card avoiding writing to ram first. In those pc's the ram bandwidth is slow, so it was faster to write directly to video memory. When fast 486 processors came with faster and larger caches, it was faster to write the screen in a ram buffer and then copy to video memory as there are lot's of over overdraw (sprites and transparent walls).

 

I've made some tests with my 386DX, here are the results:

 

Game: Ultimate Doom 1.9

Demo: Demo1

Screen size: Full with stats bar

Audio: Sound and music enabled (8 channel sound + opl3)

 

Processor: AMD 386DX 33MHz

RAM: 32Mb

L2 Cache: 256Kb

Video Card: Cirrus Logic GD-5422 ISA (1Mb)

Sound Card: Aztech AZT2320 ISA (Sound Blaster Pro compatible)

 

Ultimate Doom 1.9:

HQ: 5.750 fps

LQ: 9.341 fps

 

FastDoom 0.6 DEV:

HQ: 6.660 fps

LQ: 11.309 fps

PQ: 15.436 fps

 

HQ + FlatSurfaces: 7.395 fps

LQ + FlatSurfaces: 12.531 fps

PQ + FlatSurfaces: 19.725 fps

 

HQ + FlatSurfaces + Saturn + Near + Mono + LowSound: 7.736

LQ + FlatSurfaces + Saturn + Near + Mono + LowSound: 13.170

PQ + FlatSurfaces + Saturn + Near + Mono + LowSound: 20.824

 

Share this post


Link to post
13 hours ago, viti95 said:

Yes, both PCDoom and FastDoom are using the original ASM versions of the column and span renderers, and all rendering is exactly the same as Vanilla Doom (ModeX). This is faster for 386 and low end 486, as it writes directly to the video card avoiding writing to ram first. In those pc's the ram bandwidth is slow, so it was faster to write directly to video memory. When fast 486 processors came with faster and larger caches, it was faster to write the screen in a ram buffer and then copy to video memory as there are lot's of over overdraw (sprites and transparent walls).

 

Just to add about oldschool PCs, a fact that you might know of course, but could sound misleading here.

While reading/writing to the RAM was slow, doing the same to VRAM was way way slower. So, technically when you say "ram bandwidth is slow, so it was faster to write directly to video memory" this might sound odd. The reality is, if you have a software buffer where you render everything first and the render from there to VRAM, it's 1 write to RAM, 1 read from RAM and 1 write to VRAM. But if you render directly to VRAM, it's only the slow write to VRAM, and as long as the renderer avoids overdraw (which Doom engine does pretty well at least for the 3d environments, not sure about the monsters or transparent textures that comes in a second layer), it ends up being a good solution to avoid using a software buffer in between. Now, my curiosity is, because of the mode-x, you end up having to write individual bytes to vram, instead of doing a 32bit copy from buffer to vram (4 pixels at once, instead of 4 separate writes), this could have impact on slow ISA card, but not sure. I always was curious whether the mode-x method bottlenecked Doom instead of helping, writting one byte to vram at a time instead of 2 or 4 (I think ISA 16bit cards, would bottleneck you anyway,. and while direct 16bit writes would be twice fast, 32bit writes will be almost the same as 16bit, in my asm rep stos/movs tests here). I need to fork the code at some point and do some experiments, render everything in the RAM backbuffer in a certain way,. switch mode-x planes only 4 times in a frame, rep movsd stuff.

Share this post


Link to post
On 8/15/2020 at 12:58 PM, drfrag said:

I'm still surprised that potato is not much faster with pixel quadrupling.

 

I might fork the code and have a look at it later (not before September, I am in holidays). I think I have some ideas from my experience with OptiDoom to optimize some stuff in this one.

 

In OptiDoom, when I originally halfed the columns, in a loop I would skip odd columns, and render the even only and scaled up horizonally double.

BUT, there was another per column loop, a second pass which went through the columns again to help construct edges for the visplane calculations that need later. That was also CPU intensive. If I skipped half columns here, there would be visplane spills (even when I tried to patch them in various ways).

So effectively, rendering half the columns, never gave me twice the frame rate, more things are happening in per column loop at different places.

 

My current solution though was, to simply render the Doom window in half (so if actual res is 320*200, set the window parameters (when you resize) to 160*200, so now everything, column rendering, second column prepass for visplanes, visplane calcs, are all doing half the loops on X,. then I scale it back horizontally to double to reach 320*200 (I used the 3DO hardware to scale backbuffer).

 

This actually gave a bit better results than the previous method,. but still.. in some slower player views the result was not double speed. There is way more things happening behind the Doom hood, that halfing columns to 1/4th will not necessary give you 4x times more speed.

Share this post


Link to post

viti95: Those results with your 386 look clearly better now (BTW your cirrus card was very fast).

Optimus: I know that linear mode is faster on a fast 486 (i had a DX4-75 with a trident 8900C and that one had a 25 Mhz bus). Boom was faster than vanilla but not enough for high detail, low detail ran fine with full view. Heretic ran pretty well on a 486DX-33 but also Doom did. Main memory access was very slow on a 386.

I expected it to be faster but not 4x becouse low detail with a couple of screenblocks less gives the same performance as potato here but his results with the 386 are better now. Could you try the new version on your 386? The C version of the low detail routines also used mode X in PCDoom right?

Share this post


Link to post

I've done some testing after adding a fps counter to RUDE and on big detailed maps low detail is still much faster. The code is the same as in PC Doom but clearly it's a linear buffer. In thegiven.wad (a technological terror for the Doom engine) i get 11 fps vs 6 at hot spots in wide mode (i think it's 400x240). Also happens in Crispy even in lowres (but the fps counter doesn't work for widescreen). It's also faster in Planisphere 2. May be i could try to add the 3x1 mode if i have the time.

This is on my E2-9010. Now i think the minimum requirement is a Pentium II with 128 MB since it requires XP. WinDoom ran on a 486 and it was very fast apparently.

Share this post


Link to post

@Optimus I think the ModeX bottlenecked fast 486 (>75MHz) and Pentium processors, and some video cards don't like the ModeX at all (Rendition Veritè and Cyrix MediaGX gpu are painfully slow in that mode). I'll try to add VGA 13h chunky modes with a RAM buffer to FastDoom, as it should be faster on those processors. The main bottleneck in Doom is still the processor and RAM bandwidth, as all the BSP processing takes a lot of time. Also I'll post benchmark results with different optimizations to the BSP trees in the Wads, you can gain a good amount of fps with those optimizations (maybe it's also useful for OptiDoom and other ports such as GBADoom). Regardles 32 bit writes, they are only faster if you have a VLB or PCI video card, the ISA bus is pretty limited in every way. Even John Carmack stated that:

 

Quote

Our artwork was done in 8x8 blocks, or "ebes" as Tom called them. A 32-bit
loop would have needed more code to handle widths that were an odd number
of ebes. It might have been a speedup, but the bus and video card would have
to have handled full 32-bit writes, and I don’t recall that being common back
then. Many VL-Bus cards were still the same basic chipset used on the ISA
cards, and often still 16 bit, which meant that a 32-bit write just took 2x as long.
— John Carmack

 

@drfrag The potato mode it's about 2.3 times faster in my 386DX, and if I use flat visplanes it increments to 2.6. As I said before, the main bottleneck is the BSP processing. Both PCDoom and PCDoom-v2 uses the ModeX, but PCDoom uses the C functions to render columns and visplanes, and PCDoom-v2 uses the ASM functions. Those ASM functions are really hard to understand (and very different compared to the C versions), as they rely in self-modifying code and most decompilers just go nuts trying to decompile them (tested with IDA Pro and Ghidra). Also they are way faster than the C versions. Any fast processor should be faster with a video RAM buffer and chunky video modes.

 

BTW the slowest processors that run on Windows XP are the first Pentiums and the Cyrix MediaGX processors. I have pc's running those processors, so I can test how well runs RUDE on them, if you wish.

Share this post


Link to post

I was referring to PCDoom v2 i guess you can compile it without ASM and the C code is still there. Unfortunately i don't understand how the masks in R_DrawSpan work. If you're sure about your ASM code then it must be right, but i'm surprised that decreasing the screen size a bit makes it faster than potato.

 

That would be cool. I know the pentium thing but a pentium with 128 MB of ram was rare. The XP requirement comes from SDL2 and i think it's SP3, the engine uses by default 32 MB of ram but i think it will run with -mb 16 or may be -mb 8. XP itself will barely boot with 64 MB. It should be fast tough if you disable the smooth scaling option (it's an intermediate buffer in Chocolate Doom). I could upload a devbuild with the fps counter (-showfps). Older 2.5 versions ran on 98 and probably would run on 95 using an older compiler and a makefile but they were buggy. I could backport things if there was enough interest. BTW did Choco ran on 95 at some time?

 I know latest ZDoom was almost playable on a Pentium (there was a 95 version of ZDoom LE). I added some silly low detail modes such as 3x2 (i wanted to add 3x3 but i ended up with that lol), 4x4... But they were slow, ZDoom copied the pixels instead of reusing the pointers. Doubling vertically is useless anyway.

Actually i still own some dying old computers but installing software there was a PITA.

Edit: Thanks, here is the test build: https://gofile.io/d/1MxDr9

Edited by drfrag

Share this post


Link to post

Quick'n'dirty Ultimate Doom 1.9 WAD optimization benchmark. Tested with my 386DX-33, full screen with status bar and audio disabled. ZokumBSP it's using the default optimization mode with blockmap optimization disabled (as it breaks the demos). The multi-tree mode should be even better but I haven't time enough to generate the optimized WAD and test it.

 

FastDoom + original wad demo1 + potato + nomelt + flatsurfaces + noaudio : 25.167 fps

FastDoom + zennode demo1 + potato + nomelt + flatsurfaces + noaudio : 26.250 fps

FastDoom + zokumBSP demo1 + potato + nomelt + flatsurfaces + noaudio : 26.135 fps

 

Ultimate DOOM 1.9 + original wad demo1 + noaudio : 10.862 fps

Ultimate DOOM 1.9 + zennode demo1 + noaudio : 11.328 fps

Ultimate DOOM 1.9 + zokumBSP demo1 + noaudio : 11.288 fps

 

 

20200819_180322.jpg

Share this post


Link to post

I think a hefty boost would be obtained by rewriting the BSP algo to be non-recursive. It's hard, but if done right it would save a lot of heavy function call overheads and improve cache coherency/locality a lot.

Share this post


Link to post

You're right @Maes , rewriting the recursive R_RenderBSPNode into a non-recursive function should increase the performance but I don't really know how much performance we will gain. GBADoom implements a non-recursive R_RenderBSPNode, and those commits weren't reverted (so it must be performing better than the recursive one). I will try to implement these changes into FastDoom :D

imagen.png.e67a1308f3093a3408c171b1ed1c15bf.png

Share this post


Link to post

And now i remember that SDL2 requires at least OpenGL 1.3 (i think), if you use the software driver (GDI) it would be too slow. GDI was very slow on XP. So it's not worth testing it on a Pentium.

Edit: it seems that it's 1.1 but accelerated.

Edited by drfrag

Share this post


Link to post

New video!

 

This is showing a new visplane renderer, which is much faster but only allows flat colors to be drawed. The old one has been modified to reenable diminished lightning so everyone is happy now :D. Every visplane is stored in a set of columns, and latter transformed into spans, merged and then drawed as spans. The new renderer just renders the columns with flat colors, avoiding the process of transforming the columns into spans and then drawing with the span functions. Drawing a span in ModeX is much slower than a column, so this is as fast as it could get.

 

Also I have adapted the recursive BSP functions to be non-recursive, the same way it's done in GBADoom. The speedup wasn't much but enough to be included.

 

 

Edited by viti95

Share this post


Link to post

New release! FastDoom 0.6

 

  • Added option "-nomelt" to avoid melting transition while loading a new levels. For 386 pc's where this functionality it's really slow.
  • Fixed bug that made the framerate choppy with fast processors
  • Lot's of internal optimizations (non-recursive R_RenderBSPNode, many ideas from GBADoom). This version is faster even without lowering image quality
  • Flat surfaces now have diminished lightning enabled or not. The option "-flatsurfaces" has it enabled, and the new option "-flattersurfaces" has it disabled. The option "-flattersurfaces" is MUCH faster than the previous one.

Grab it here: https://github.com/viti95/FastDoom/releases/tag/0.6

Edited by viti95

Share this post


Link to post

I am an epic PC gamer with a Pentium II 350 MHz, 128 MB of RAM, and an ATI 3D RAGE 2MB PCI so I can run DOOM at max settings easily ;)

 

Seriously though, this is really cool! :D  I might try this on my Compaq Presario 425 which couldn't even get past the title screen even though it meets the minimum system requirements.

Share this post


Link to post

I just tried a more heavy vanilla WAD, Suspend in Dusk

Sorry for the flickering CRT :P

 

Results:

Classic Doom HD: 4.8FPS, LD: 8FPS

Fast Doom HD: 5.3FPS, LD:9.6FPS

Fast Doom LD with untextured but shaded flats: 10.3FPS

Fast Doom Potato, untextured/unshaded flats: 17FPS

 

 

 

Share this post


Link to post

Interesting WAD, i'll try some benchmarks on both my 386 and 486. I think your 386 is limited by the low amount of RAM, 4Mb will cause the game to stutter while loading and unloading assets from HDD. That's why enabling/disabling SmartDrive causes the performance to change. SmartDrive uses part of the RAM to cache data from the HDD, and that results in less memory available for Doom (which itself caches all the data). It also uses more cpu cycles while loading from the HDD.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×