Which are the ports with most optimized renderers

Optimus · October 13, 2012

I am asking this because there are specific maps with big cities and lot's of foliage (like 5till L1 Complex from this week reviews) that really crawl the engine, no matter if it's in Zdoom or GZdoom. There are also things that crawl unexpectedly in GZdoom but in Zdoom software renderer they are fine or just better.

So, I am curious, which ports could be more optimized for such situations? Also, NUTS.wad. Is there a port that plays this better? Or such ones that can handle more sprites better (I guess it's also the AI and clipping/visibility calculations and projectiles of sprites that make this bad also). This could be useful with 10x or 100x mutators I've recently tried too. In big levels with lot's of monsters this is an overkill.

Not that those WADs are very important and people would like to play, but it would be useful to have a port that can handle such stressing situations better. Although, as I am guessing most of the ports haven't ever drastically changed the original algorithms of Doom (beside accelerated ports that have to render with polygons instead of columns), I am curious if there was ever a thought on the community to create a port that handles these situations with a different or more optimized algorithm. Or which port do you think has slightly better performance with too many sprites?

Maes · October 13, 2012

I can only speak about mine, Mocha Doom, and maybe throw in some comparisons with prBoom+. In general, it has 50-70% of the speed of prBoom+, with the 70% end actually being achieved in NUTS.WAD type WADs: I suspect that is because of the lazy garbage collection mechanism giving it a slight advantage over thousands of fine-grained repeated mallod/dealloc operations.

In a map like NUTS, my tests showed that the map computations alone are big chunk of the computations. Some figures off the top of my head, for the mochadoom nuts.wad timedemo, on my Dell Inspiron 1720 with an Intel T8300:

Regular rendering: 60 fps
Parallel rendering, 2 threads (only walls and flats): 70 fps
Parallel rendering, 2 threads (walls and flats, then sprites, with per-sprite split-screen strategy): 75 fps
Disabling sprite rendering altogether (but leaving everything else on): 90 fps

So sprite rendering, while it's certainly a significant part of the time, even if you speed it up infinitely it wouldn't go beyond 90 fps, and walls/flats aren't really hard to render in nuts.wad, so the vast majority of the time is spent in the game logic in such maps. That one, you can only optimize so much before you start breaking things or forsaking compatibility (in particular if someone ever tries parallel logic....)

Edit: some more data, all also without rendering sprites, and all with status bar visible (and palette effects active):

Rendering just flats (BSP active, but no colfuncs execute): 97.6 fps
Rendering just walls: 113 fps.
No rendering of walls or flats, but BSP still active: 125 fps.
No rendering of walls or flats (BSP disabled): 160 fps
Not calling R_RenderPlayerView at all: 162 fps
Not calling Display() at all (NOTHING renders, and not even framebuffers update): 290.32 fps.

If anything, this highlights a few interesting bottlenecks: the BSP traverser itself is quite time consuming, and alone it makes up a big chunk of the time even in a map like nuts.wad which should be relatively simple (but perhaps it forces way too many subsector iterations). Rendering walls is faster than planes in this map, but that's to be expected: you see a lot of floor and sky, and very little walls, if ever.

However, even without rendering anything at all (meaning that not even the windowing subsystem is ever updated), the execution speed doesn't exactly jump into the 1000s of fps: even if by some magical means you had a zero-latency insta-renderer that took 1 CPU cycle to do everything and no need to move over some millions of pixels to the graphics card every second, you'd still have a sizeable latency just due to the sheer gameplay processing.

If you work out the figures I quoted you can come up with approximate percentages of how CPU time is used (assuming that 100% goes to Doom only).

wesleyjohnson · October 15, 2012

By whose standards, on which machine, for which renderer, which class of problem wads ???? Going to get at least 10 different answers.

I have spent considerable time optimizing DoomLegacy for software renderer on 486 class machines, but it could use more.

For the sprites, gave DoomLegacy a user control to tune the number of sprites displayed to the machine capability. You can turn it up until your machine starts to bog down. It displays the closest sprites to the player, with the rest getting random chances to display (they flicker).
I tested it on the "long day" wad. Frame rate of 1/2 frame/sec with all sprites displayed became almost normal when set to 512 (software render on 1.6Ghz Athlon).
Even a 300Mhz machine should work using 128 sprite setting. You would just get a LOT more flicker on those farther away sprites.

I favor such gradual degradation of display, to achieve some kind of play on difficult wads. I try to make it possible to play the difficult wads on even the slower older machines.

Optimus · October 18, 2012

So, in Legacy with sprite number limitation it goes like a charm? But does the same happen with many projectiles and monster AI for all the monsters, even if few are displayed? I would guess the AI and projectiles is what bogs this down, as Maes says (though it's strange that you still get such high numbers, or was it another Nuts WAD I tried?) I am going to download Legacy (I have a very old version) and try this again.

Then again, I should just try to compile my favorite port and try to see what the profiler says.

Maes · October 18, 2012

Nuts.wad has about 10000 monsters. Imagine each of them being updated and moving 35 times per second, and the vast majority of them also throw projectiles, so in reality there are way more than 10K map objects alive at any moment. Even if you kill several of them, the corpses are still in view and EVEN THEY get a "thinking turn" due to the way the engine works.

When the projectiles start flying....it's a memory allocation fuckfest of mobj_t's getting constantly created & destroyed. I once tried to monitor the rate at which this happens and it's in the order of several TENS of THOUSANDS per second, especially if you awake all monsters. Certainly a challenge for any memory allocation and garbage collection mechanism.

Optimus · October 18, 2012

I just did some tests with several ports. The tests were like, what's the speed when you look at the monsters before they are awaken, what if you look at the wall on the other side, or what happens when you suddenly awake them and either look at them or look at the wall. It's interesting how different each of the ports behave.

Legacy: With the sprite limiter or not, when I look at the wall and even if I awaken the monsters with a shot, it's still very smooth. But when I look towards the sprites it's quite slow with high limit. This means, projectile movement and AI is still fast. I see map and IDDT and lot's of projectiles are coming on me, regardles there is no sprite rendering.

Zdoom: Surprisingly if I wake up the monsters, even not looking at them, the frame rate drops dramatically to... 1FPS! No matter what, something is too slow with monster AI and stuff. Does it relate to the fact that Zdoom code has support for additional scripting (even though it's not used here). I am quite surprised by this result. However, sprite rendering (before waking up the monsters) is faster than what I have seen in Legacy.

GZdoom: Another strange thing. Before even waking monsters, whether I am looking towards them or away to the wall, frame rate is quite low, like 20fps (lower than Zdoom even when looking). Only when I went very close to the wall and only facing a single bonus item (the grey 200 health/armor sphere) suddenly it went up to 200. Strange. Maybe it's because of how they are handled in OpenGL, possibly regardless they are in your view or not, the quads are sent to the graphics card and let it zbuffer them automatically? And maybe even not with the most optimized way for the GPU to do this.

I've heard recomendations about PrBoom+ in one very slow city WAD (with lot's of sprites as layers of grass). I tried it and it was quite good so far. Either in software mode or opengl, I must have got 40-60 FPS (or more if I disable vsync) at the beginning, then I wake up the monsters and finally even with facing it I have some kind of semi-playable speed. 10-20fps at worse and then when some of them are killed I have a good speed. Facing them, waking them from above, or on the ground, finally it's playable. Then I guess I will be able to play that WAD with the city and foliage much better.

Interesting, I hadn't tried much PrBoom before and this version has many interesting options. Also smooth (above 35) movement I was used to in Zdoom/GZdoom ports. Maybe it's time for a port change? :)

Graf Zahl · October 18, 2012

Optimus said:
GZdoom: Another strange thing. Before even waking monsters, whether I am looking towards them or away to the wall, frame rate is quite low, like 20fps (lower than Zdoom even when looking). Only when I went very close to the wall and only facing a single bonus item (the grey 200 health/armor sphere) suddenly it went up to 200. Strange. Maybe it's because of how they are handled in OpenGL, possibly regardless they are in your view or not, the quads are sent to the graphics card and let it zbuffer them automatically? And maybe even not with the most optimized way for the GPU to do this.

What kind of graphics card do you have?

The big problem with sprites is that due to some translucency at the edges they need to be pre-sorted and drawn in order.

I never had any problems with this on NVidia but I have been told that ATI/AMD doesn't like it when it has to draw lots of stuff with constant state changes.

kb1 · October 19, 2012

Maes said:
When the projectiles start flying....it's a memory allocation fuckfest of mobj_t's getting constantly created & destroyed.

In my home port, I pre-allocate a large block for mobj_t's, and another block of the same number of elements for "delete pointers". When deleting a mobj_t, I simply mark the mobj_t "deleted", add an entry to the deleted pointer list, and increment the "deleted pointer" pointer. When adding a new mobj_t, I grab the last deleted pointer, use that slot, then decrement the "deleted pointer" pointer, if you know what I mean. It's extremely fast. The only realloc occurs when the block needs to grow. Don't know if this scheme is very popular - I haven't seen it before, but I'm quite proud of it :)

wesleyjohnson · October 19, 2012

Flying missiles last for at least a few seconds, so I expect the missile turnover per frame to be a reasonable load. Missiles are sprites, so drawing them (about 32x32 color pixel operations per missile) should take the most time. That is why pruning sprites as early as possible is effective. If the port tries to get the video card to do the pruning, then the upload time to the video card is wasted for any pruned sprite.
To do early pruning, the port must do a coordinate transform itself.

Malloc is known to be slow because it must search for a reasonable match among the free heap memory, so avoiding it where there is high turnover is highly recommended. Malloc can be sped up by only allocating large chunks so there are few small free memory blocks on the heap. This can be done by never releasing little memory allocations.

Saving unused objects on a free list is a popular technique that is used in several places in DoomLegacy and other ports too.
We just use one of the existing linked list pointers in the object for a linked free list. They are not being used for anything when the object is unused. That way there is no extra list to use memory. I have seen programs cast objects to a another record type to get the free list link. Anything taken off the list is ptr-cast to its correct type.
Get a unused object from the free-object-ptr, and then set the free-object-ptr from the next free object link. Very fast.
Releasing an object to the list is similar and fast.

If you use Zone memory with purge tags, then the objects can be allocated in blocks (say 128 at a time), and all the objects in the allocation put on the free list. To release the memory, use the Zone purge operation which finds all memory allocations that match the tag (whereever they are) and set the free-object-ptr to null.
Can do the same without Zone memory if you never try to release any allocated memory in the free list.

Quasar · October 19, 2012

kb1 said:
In my home port, I pre-allocate a large block for mobj_t's, and another block of the same number of elements for "delete pointers". When deleting a mobj_t, I simply mark the mobj_t "deleted", add an entry to the deleted pointer list, and increment the "deleted pointer" pointer. When adding a new mobj_t, I grab the last deleted pointer, use that slot, then decrement the "deleted pointer" pointer, if you know what I mean. It's extremely fast. The only realloc occurs when the block needs to grow. Don't know if this scheme is very popular - I haven't seen it before, but I'm quite proud of it :)

This is called a free list and is a common method for amortizing the cost of malloc/free - question: do you have any conclusive profilings that show an overall consistent improvement from this?

I ask because I have thought about doing this for Eternity many times in the past but decided not to do it because I felt it would be difficult to strike a balance with memory usage - I don't think it's really appropriate to let the free list grow to an indefinite size, for example.

Graf Zahl · October 19, 2012

I don't think that freelists are worth bothering unless you are talking about very small blocks that get very frequently allocated and deallocated. A good example would be nodes in a clipping list that gets created during rendering or, like in many Boom-compatible port the sector nodes list where actors get linked in.

Using it for a structure as large as a mobj_t in Doom is certainly not a good idea. First, they do not get allocated frequently enough - not even in Nuts.wad and second, they are hogging a lot of memory which would make normal heap use less efficient.

Maes · October 19, 2012

kb1 said:
In my home port, I pre-allocate a large block for mobj_t's, and another block of the same number of elements for "delete pointers". When deleting a mobj_t, I simply mark the mobj_t "deleted", add an entry to the deleted pointer list, and increment the "deleted pointer" pointer.

I once experimented with a pool of reusable mobj_t's (when requiring a new one, it was requested from the pool, which created or reused as necessary. When unlinked, it was returned to the pool) but there were several issues with "bringing them back to life": they needed to be scrubbed clean so their state was exactly that of a newly allocated one, otherwise you see very weird effects like fireballs hanging in mid air, and soon everything stopped moving, resulting in a surreal landscape of frozen monsters and missiles. Or I might have overlooked something.

Optimus · October 21, 2012

Graf Zahl said:
What kind of graphics card do you have?

The big problem with sprites is that due to some translucency at the edges they need to be pre-sorted and drawn in order.

I never had any problems with this on NVidia but I have been told that ATI/AMD doesn't like it when it has to draw lots of stuff with constant state changes.

Sorry for late answer, I have NVidia GTS 450.

p.s. Haven't tried to code something like that in the past, switching states frequently for lot's of sprite quads or something. Yes, last time I've heard switching states frequently would be a problem but didn't know it's worse on ATI. I am curious now what kind of sorting did original doom on the sprites too.

Maes · October 21, 2012

OpenGL (and hardware in general) renderers add another potential source of bottlenecks: depending on the overhead associated with certain types of geometry pushing, sometimes they can perform worse than pure software ones, as it had been shown in this thead.

Simply put, if the driver was not particularly optimized for sending e.g. 100k of billboard sprite images to it every second (which is an unusual usage by modern gaming standards), it's not inconceivable that it might perform worse than expected by common sense ("But it's a HARDWARE renderer, how can it ever be BAD???")

Also, don't forget that Doom maps are regarded as highly anomalous (by typical 3D geometry standards, that is) and require special processing and conversions to become displayable on standard hardware, yet another potential bottlenecking source.

However, all of the above are very circumstancial and can change easily between port, port versions, drivers and hardware.

kb1 · October 22, 2012

Quasar said:
This is called a free list and is a common method for amortizing the cost of malloc/free - question: do you have any conclusive profilings that show an overall consistent improvement from this?

Well, no real timings, but definitely a respectable improvement under contrived situations. Here's what I did:

I started with Boom's ZMalloc stuff (ugh), and added a on-screen "heap cycles" counter, that increments each time the heap pointer rolls over. Then I reduced heap allocation until thrashing occurred each frame while standing in the "Map02" room of dv.wad, map05, being bombarded by missiles. When thrashing occurred, the Boom ZMalloc code *really* sucked, in that it tossed and reloaded the same textures each frame, even when I moved out of line-of-sight (freeing missile mobj_t's), and moved into smaller fights. In other words, the ZMalloc code will typically purge the heap on each cycle unnecessarily!

I then modified the ZMalloc code to try a little harder to find free space. Even though it had more searching to do, it appeared to work better, faster, and cycle much less, given the same conditions.

Unsatisfied, I did the aforementioned approach - the "freelist". Man, what a difference that made - the heap stopped cycling altogether. See, the real benefit is that you're taking the thing that would wreak havoc on the heap (mobj_t allocations), and getting it out of the heap altogether. That lets the ZMalloc code do the only thing it's really designed well to do - load "static" pre-level objects, like textures, sounds, flats, etc.

Quasar said:
I ask because I have thought about doing this for Eternity many times in the past but decided not to do it because I felt it would be difficult to strike a balance with memory usage - I don't think it's really appropriate to let the free list grow to an indefinite size, for example.

Yeah, it's tricky. I add 4096 mobj_t slots dynamically as needed, which is probably way too much, but these counts are, of course, easily changable. I think the idea was that 4096 let me fit all but the slaughter maps into one malloc. The point is that, if you need more, you need more. It's not like you can justify failing on creating a missile :)

Graf Zahl said:
Using it for a structure as large as a mobj_t in Doom is certainly not a good idea. First, they do not get allocated frequently enough - not even in Nuts.wad and second, they are hogging a lot of memory which would make normal heap use less efficient.

Come on, man - heh. One allocation can push out a texture required by the renderer, requiring countless cycles to pull back in. Hogging a lot of memory? It uses as much memory as required to store a mobj_t (and some pointers). Yeah, it's a block of them at a time - that's the whole point.

Here's a excerpt of my C source - pretty straightforward stuff. (I should probably get it released, so it can be picked apart and ridiculed :)

Call P_FastMobj_Init() from G_InitNew.
Call mobj = P_AllocFastMobj(); from P_SpawnMobj() and P_UnarchiveThinkers().
Call P_DeallocFastMobj((mobj_t *)thinker); in P_RemoveThinkerDelayed.

I'd be interested to know what you think, how it performs, etc.

//
// [kb] Fast mobj allocation/removal that does not use Z_Malloc. Mobjs
//  are always the same size in memory, so this code allocates a block of
//  memory dedicated to mobjs. When exhausted, a new block is allocated.
//  A separate FIFO list of deleted mobjs is also maintained, and is
//  used to handle requests for new mobjs. Unitl the delete list
//  becomes empty, all new allocations will overwrite previously
//  removed mobjs.
//

#define FASTMOBJ_SPAWNBLOCKSIZE		4096
#define FASTMOBJ_DELETEBLOCKSIZE	4096


mobj_t			**fastmobjblocks = NULL;
int			fastmobjblock_index = 0;
int			fastmobjblock_alloc = 0;
int			fastmobj_index = 0;

mobj_t			**fastmobjdeletes = NULL;
int			fastmobjdelete_index = 0;
int			fastmobjdelete_alloc = 0;

void P_FastMobj_Init (void)
{
	int i;

	if (fastmobjblock_alloc)
	{
		for (i = 0; i <= fastmobjblock_index; i++)
		{
			free(fastmobjblocks[i]);
			fastmobjblocks[i] = NULL;
		}

		free(fastmobjblocks);
		fastmobjblocks = NULL;
		fastmobjblock_index = 0;
		fastmobjblock_alloc = 0;
		fastmobj_index = 0;
	}

	if (fastmobjdelete_alloc)
	{
		free(fastmobjdeletes);
		fastmobjdeletes = NULL;
		fastmobjdelete_alloc = 0;
		fastmobjdelete_index = 0;
	}

	P_ClearTmThing();
}


mobj_t *P_AllocFastMobj (void)
{
	if (fastmobjdelete_index)
	{
		fastmobjdelete_index--;
		return (mobj_t *) fastmobjdeletes[fastmobjdelete_index];
	}

	if (fastmobj_index >= FASTMOBJ_SPAWNBLOCKSIZE)
	{
		fastmobjblock_index++;
		fastmobj_index = 0;
	}

	if (fastmobjblock_index >= fastmobjblock_alloc)
	{
		fastmobjblock_alloc++;
		fastmobjblocks = realloc(fastmobjblocks, fastmobjblock_alloc * sizeof(mobj_t *));
		fastmobjblocks[fastmobjblock_index] = malloc(sizeof(mobj_t) * FASTMOBJ_SPAWNBLOCKSIZE);
	}

	return (mobj_t *)&(fastmobjblocks[fastmobjblock_index][fastmobj_index++]);
}

void P_DeallocFastMobj(mobj_t *mobj)
{
	if (fastmobjdelete_index >= fastmobjdelete_alloc)
	{
		fastmobjdelete_alloc += FASTMOBJ_DELETEBLOCKSIZE;
		fastmobjdeletes = realloc(fastmobjdeletes, fastmobjdelete_alloc * sizeof(mobj_t *));
	}
	
	fastmobjdeletes[fastmobjdelete_index++] = mobj;
}

Graf Zahl · October 22, 2012

kb1 said:
I then modified the ZMalloc code to try a little harder to find free space. Even though it had more searching to do, it appeared to work better, faster, and cycle much less, given the same conditions.

Unsatisfied, I did the aforementioned approach - the "freelist". Man, what a difference that made - the heap stopped cycling altogether. See, the real benefit is that you're taking the thing that would wreak havoc on the heap (mobj_t allocations), and getting it out of the heap altogether. That lets the ZMalloc code do the only thing it's really designed well to do - load "static" pre-level objects, like textures, sounds, flats, etc.

Yes, that happens if you combine

- a heap that's too small
- an inefficient memory allocation system
- bad use of purgeable data

The obvious solution should be to make the heap sufficiently large to hold all the data needed.
Don't forget that Doom's memory allocation system was made to make the game run on 4MB of RAM. Using the same mechanics for extra-huge WADs with very dynamic memory allocation is sure to cause slowdowns.

The zone heap is ok if talking about Vanilla maps but any 'modern' engine should not work off a small preallocated heap and do its own memory management in there. I can't imagine any situation where this would perform better than just using the system/runtime heap instead.

If you want to test performance gains by a free list, it is pointless to test against an extreme fringe case. Test against a more common situation where objects are constantly being allocated but not in amounts that the heap starts thrashing.

kb1 · October 23, 2012

Graf Zahl said:
...stuff...

Graf, I think you missed the point of my whole post.

Yes, it's obvious that if you can allocate an unlimited amount of memory, you at least solve the "out of memory" problem.

The point is that, combining media data with mobj_t's in the same heap is demonstrably problematic, without a brilliant allocator. Hell, the Boom allocator purges textures on each heap cycle, whether there's free space or not. Doom's does too. They just loop around happily stomping whatever's there, assuming that it'll just get reloaded, no big deal - heh. The low-memory scenario I consturcted was simply to make the heap activity as hectic as possible, and, to then observe the effects of the freelist. I design empirically. That bothers some people - so be it. I like to see the "real-world" effect of my changes, not just theorize about them.

What I posted was code that separates the spawn-happy mobj_t's out of the rather static texture data. This was obviously needed by my port, and I suspect that any port not using sophisticated, dedicated memory management will find some performance boost with this technique.

There's still other things that screw up the cache - sounds are a big one. They're big, and you can need a ton of them in a single tic. Especially if you've got a bunch of custom monsters each with custom sounds. Sprite images fit in this category as well. Let's face it - thrashing sucks. But mobj_t's are a special case: They are all the same size. That makes them especially insteresting as an optimization target.

I find this technique was an excellent choice for me, because:

1. mobj_t's are all the same size.
2. Alloc/dealloc is basically free as far a performance goes - no garbage collection time, no search for free blocks, no splits/merges, no compromise.
3. You may get some cache benefits from them all being located contiguously.
4. It's easy to implement.
5. It visually improved performance for me.
6. You can continue to run a crappy general heap (ZMalloc), because it will behave, since it's not being asked to do as much.

For me, it wasn't so much about handling low-memory conditions, as it was upping performance. In fact, my code doesn't really protect against low memory situations. But, if you are particularly sadistic, you *could* allocate the freelist using ZMalloc... (hides)

Please experiment (if possible with your code). I love to hear the success stories, or if you didn't notice a difference. Remember, the effects of bad memory allocation take time to materialize - the cache needs to be thoroughly populated and "scattered" before real detrimental problems appear.

Maes · October 23, 2012

All this talk made me want to bring back my pool of reusable mobj_t's...I hope I can work out what caused the "freezing" this time.

wesleyjohnson said:
Flying missiles last for at least a few seconds, so I expect the missile turnover per frame to be a reasonable load.

On nuts.wad it's not ;-) 10000 monsters, 5000 of which active in one room, 80% of them shooting multiple times at the player (or at each other...there are arachnotrons too!), plus there are mobj_t's for the projectile explosions, too. This easily results in a turnover of several hundred if not thousands of mobj_t's per frame, without even counting the time to actually draw them. Even just keeping the automap open is enough to trash memory.

With Mocha Doom I actually had the option to set the heap memory to a deliberate low size (e.g. 80 MB max for everything), which caused purging from the very first frames, and about 100K purges in the first minute alone. With default heap values (way larger than 80 MB), it took several a level reload to witness any mobj_t destructors being called (that's how I measured turnover), so there was the benefit of lazy deallocation as well, contrasted with forced fine-grained dealloc.

Graf Zahl · October 23, 2012

kb1 said:
The point is that, combining media data with mobj_t's in the same heap is demonstrably problematic, without a brilliant allocator. Hell, the Boom allocator purges textures on each heap cycle, whether there's free space or not. Doom's does too. They just loop around happily stomping whatever's there, assuming that it'll just get reloaded, no big deal - heh. The low-memory scenario I consturcted was simply to make the heap activity as hectic as possible, and, to then observe the effects of the freelist. I design empirically. That bothers some people - so be it. I like to see the "real-world" effect of my changes, not just theorize about them.

No, I know precisely what you did. But if you want to get useful data you cannot use results from fringe cases as a benchmark to test against.

If you want to know how much real improvement you get from your freelist you got to use a testing pattern that's at least realistic.

And deliberately running on low memory conditions to worsen performance is not realistic.
Also, testing a freelist against a known bad implementation doesn't show the superiority of the freelist either, it merely demonstrates that the bad implementation performs badly.
Of course the freelist performs better, there's not even need to demonstrate it - but that's not due to the greatness of the freelist but due to the badness of Doom's zone heap.

To get the real value of the freelist you still need to test it against a sane heap system with sufficient memory.

The ultimate question should be: How much improvement would switching to malloc/free bring, compared to the freelist?

And more importantly, how much are the improvements in relation to overall execution time?

And sorry, but my tests in this matter done years ago showed that:

- designing a heap that doesn't thrash is the most important thing of all.
- If you want to run maps that need lots of memory, use a system that's up to it. Even Doom.exe, despite being able to run with 4MB of heap didn't do too well. It needed 8 MB to avoid the constant disc accesses. It just needed to support 4MB to lower system requirements on the packaging.
- with the above taken into account, allocation and deallocation is more or less insignificant in relation to the other things being done. With lots of mobj_t's around the main bottleneck is mostly the movement/collision detection code.

The big problem with a freelist is that ultimately it requires more memory than proper handling of deallocation of these objects. You end up with (peak of mobj_t) + (peak of everything else) instead of peak of (mobj_t + everything else). (which makes testing against low memory conditions that induce thrashing even more meaningless.)

Maes · October 23, 2012

Out of curiosity, I fucked around a bit with my old mobj_t pool code. To my surprise, it did work without problems, only that it didn't give any measurable speed advantage (on the opposite, it caused some minor degradation, due to being forced to do a class checking before submitting a "dead" object back to the queue).

I was able to offset that by allocating more than one new object when the pool was empty, but still, it maybe gained a frame or so in speed. No difference in heap memory use either (around 72 MB for nuts.wad in either case) which means that Java is still comfortable enough to defer garbage collection. It might make more of a difference in a constrained memory scenario, where the GC will be forced to work overtime.

kb1 · October 24, 2012

Graf Zahl said:
No, I know precisely what you did.

Yeah, I know - I provided source, and a description.

Graf Zahl said:
But if you want to get useful data you cannot use results from fringe cases as a benchmark to test against.

If you want to know how much real improvement you get from your freelist you got to use a testing pattern that's at least realistic.

And deliberately running on low memory conditions to worsen performance is not realistic.

What's not realistic? It really happened. What suggests to you that I did not test hundreds of patterns? I already explained that the low memory condition caused increased heap cycling, which is precisely the condition that is alleviated when using the freelist. The low memory condition makes things apparent faster - 1 minute vs. 20. Why on earth do you think I sought a solution in the first place? Just cause I wanted to fuck around with things? It's because I noticed an effect that was undesirable - in normal playing conditions.

Graf Zahl said:
Also, testing a freelist against a known bad implementation doesn't show the superiority of the freelist either, it merely demonstrates that the bad implementation performs badly.

But...it doesn't perform badly...anymore - heh.

Graf Zahl said:
Of course the freelist performs better, there's not even need to demonstrate it - but that's not due to the greatness of the freelist but due to the badness of Doom's zone heap.

Better performance is not worth demonstrating? Not worth using? Yeah, a freelist is pretty great if it improves performance - what exactly are you trying to accomplish with this post?

Graf Zahl said:
To get the real value of the freelist you still need to test it against a sane heap system with sufficient memory.

No, you don't. With the freelist, the heap system is effectively taken out of the equation.

Graf Zahl said:
The ultimate question should be: How much improvement would switching to malloc/free bring, compared to the freelist?

The only lasting programming problems I encounter is when I'm using someone else's packaged code. There is no question that the freelist is extremely fast, as you noted. And, at absolute theoretical best, malloc will be comparably fast. And, at worst, who knows? Malloc is a "black box", whose performance characteristics can only be measured empirically. Sure, OSes/language compilers do their best, and have some pretty smart people writing them, but the best performance is gained when you don't make the call at all.

Graf Zahl said:
And more importantly, how much are the improvements in relation to overall execution time?

I'm damn sure I answered this more than once...something along the lines of "significant, noticable visual improvement."

Graf Zahl said:
And sorry, but my tests in this matter done years ago showed that:

- designing a heap that doesn't thrash is the most important thing of all.
- If you want to run maps that need lots of memory, use a system that's up to it. Even Doom.exe, despite being able to run with 4MB of heap didn't do too well. It needed 8 MB to avoid the constant disc accesses. It just needed to support 4MB to lower system requirements on the packaging.
- with the above taken into account, allocation and deallocation is more or less insignificant in relation to the other things being done. With lots of mobj_t's around the main bottleneck is mostly the movement/collision detection code.

Don't be sorry for me - I don't need it. Be sorry that you're:
- posting misinformation
- potentially keeping programmers from using a simple tool that provides real, demonstrable benefits.
- generally copping a crappy attitude

Be sorry that you're movement/collision detection code is too slow.

Want to know why Doom w/4MB needed constant disk access? It's because it uses the heap for objects that need spawning during the playing of a map (mobj_t's for example). As noted previously, the Doom Zone code doesn't do a good job of finding free memory, it just pushes the first thing it finds out of cache. Therefore, the problem cases are new allocations. The freelist greatly reduces the need for new allocations in main heap, therefore less heap rollovers, therefore less cache purging.

Graf Zahl said:
The big problem with a freelist is that ultimately it requires more memory than proper handling of deallocation of these objects. You end up with (peak of mobj_t) + (peak of everything else) instead of peak of (mobj_t + everything else). (which makes testing against low memory conditions that induce thrashing even more meaningless.)

How do you know what malloc is doing behind the scenes? Can you prove that it's not maintaining a pool of memory in anticipation of expansion. If it's not, it's a crappy implementation, cause it will be slow.
When a block has not yet been allocated, malloc always allocates a whole new page, even when you need 1 byte. That's exactly what I'm doing with the freelist (more than one page though). When you're looking for performance, you're typically not running on a 4Mb machine anyway. You say it requires more memory, how so? If you need 10,000 mobj_t's, you need 10,000 mobj_t's. So, yes, my implementation would have allocated space for 4096*3=12288 mobj_t slots. This space gets purged on each level load. I suppose you could have a function that checks the max number needed during that past 10 seconds, and use that to reduce the pool dynamically, but, what have you accomplished?

If you want to be a programmer respected for his knowledge and wisdom, try not to be disrespectful yourself, try to think about the actual goal, and learn to admit when you're wrong.

I offered up my algorithm on good faith, for good intentions, with the simple hope that it may be useful, and with the enthusiasm that comes from believing that I could assist. It took me considerable time to instrument and understand the problem, devise a solution, develop that solution, and test it.

With that much at stake, you had better believe that I will defend against such an idiotic, misguided attack. Having done so, I will not further respond to any more of your attacks on this matter.

To everyone else, sorry about the textfest - give the algorithm (or something similar) a shot - I think you'll be pleased :)

kb1 · October 24, 2012

Maes said:
Out of curiosity, I fucked around a bit with my old mobj_t pool code. To my surprise, it did work without problems, only that it didn't give any measurable speed advantage (on the opposite, it caused some minor degradation, due to being forced to do a class checking before submitting a "dead" object back to the queue).

I was able to offset that by allocating more than one new object when the pool was empty, but still, it maybe gained a frame or so in speed. No difference in heap memory use either (around 72 MB for nuts.wad in either case) which means that Java is still comfortable enough to defer garbage collection. It might make more of a difference in a constrained memory scenario, where the GC will be forced to work overtime.

Sorry, double-post. Yeah, I forgot to suggest that the mobj_t needs to be zeroed first, I thought that might have been what caused you problems the first time.

The code I submitted is very careful to never need to do any iterations on alloc and dealloc, and that made a big difference for me because of the aforementioned Boom (and Doom) ZMalloc quirks.
But, you're using Java, which I can't say I've had much experience with, but, yes, a good allocator can do wonders. Yes, you have to allocate an array of new blanks, otherwise you lose the benefit (until you've allocated the absolute max needed for the level, of course) Then again, many allocators recognize that you're malloc-ing a lot of the same size objects, and it will create a special pool just for objects of the same (or nearly the same) size. If your allocator is good, the freelist will only help sporatically, like when garbage collection would otherwise occur.

But, with your own implementation, you can control GC (by not needing it nearly as often), which is, in a way, what the freelist also does. You *do* have to try it both ways for a while, on a lot of maps and scenarios - as I'm sure you know, it can be tough to test conclusively (but it was obvious for me rather soon.)

DaniJ · October 24, 2012

I just want to point out a contextual difference between where Graf is coming from and you (kb1) and Maes are - in ZDoom (and it's derivatives) a mobj_t is not a single fixed size. In his port a mobj_t instance is a dynamic composition, inheriting from (potentially) several different base classes.

So in his project's case, its a little more complicated to implement a freelist because first they need to know the maximum amount of storage required by the most complex mobj_t composition.

However, as ZDoom implements a garbage collector there is little need to use the Zone for it's caching features. Thus they might as well just use malloc/new and in the process avoid the performance hit of allocating from the Zone. Thats not to say there is no potential benefit from using a freelist instead of malloc/new but I don't know enough about the details to say one way or the other.

As you noted kb1, DOOM's Zone is definitely not a general purpose heap allocator, it's too simple for that. It is basically a fixed-size memory block cache.

Graf Zahl · October 24, 2012

kb1 said:
Yeah, I know - I provided source, and a description.
What's not realistic? It really happened. What suggests to you that I did not test hundreds of patterns? I already explained that the low memory condition caused increased heap cycling, which is precisely the condition that is alleviated when using the freelist. The low memory condition makes things apparent faster - 1 minute vs. 20.
Why on earth do you think I sought a solution in the first place? Just cause I wanted to fuck around with things? It's because I noticed an effect that was undesirable - in normal playing conditions.

So...

As I pointed out, you have a higher peak because you got 2 distinctive blocks of memory, which both need to be able to keep the maximum required amount of RAM.
If you run out of memory in such situations, keeping a freelist would inevitably mean you allocate more memory - because you need to keep all the freed objects' memory allocated.
So, why not increase the heap's size to begin with?

Seriously, the first thing I'd do with Doom these days is dump the zone heap. It has very little advantages these days, the major one being the PU_LEVEL tag, but that can be taken care with a thin layer over malloc, too - but without the problem that you are restricted to a small section of your system's resources.

kb1 said:
Malloc is a "black box", whose performance characteristics can only be measured empirically. Sure, OSes/language compilers do their best, and have some pretty smart people writing them, but the best performance is gained when you don't make the call at all.

Uh... Sorry, no. This may have been the case in days long past, but with modern CPUs and memory caching on the CPU it's not that easy. Your freelist *may* be better, yet again it may not. It depends a lot on memory usage patterns.

Also, micro-optimization, meaning you optimize code that only runs less than 1% of the entire execution time will bring no realistic speed improvement at all. I did some extensive profiling years ago and found the general malloc/free to be superior to any homegrown heap substitute on nearly any scenario.

The only exception I have to make is when a significant number of very small objects gets frequently allocated and deallocated. By small objects I mean 20 bytes or so, not a big chunk like mobj_t. And by frequent I mean something like several hundreds of these that get thrown away in the same frame normally. For a clipping list which gets allocated and deallocated during the same frame and causes quite a bit of fragmentation having a preallocated pool makes sense. For longer lived stuff like mobj_t's, malloc is more than enough in all circumstances I can imagine. Obviously the zone heap is not and sticking to that will indeed necessitate the implementation of crutches to work around its limitations and shortcomings.

And I consider a freelist mostly a crutch because any memory added to it is lost for everything else.

Maes · October 24, 2012

kb1 said:
Sorry, double-post. Yeah, I forgot to suggest that the mobj_t needs to be zeroed first, I thought that might have been what caused you problems the first time.

In the working version of the code, I noticed that I didn't do any particular zeroing except maybe for the function codepointer. TBQH I don't remember what the actual issue was, however it seems solved now :-/

Quasar · October 24, 2012

Maes said:
In the working version of the code, I noticed that I didn't do any particular zeroing except maybe for the function codepointer. TBQH I don't remember what the actual issue was, however it seems solved now :-/

I feel it should be noted that establishment of rigorous referential integrity is an obvious prerequisite. If Mobj instances can still be freed with other objects referencing them as in the unmodified vanilla engine, then they could cause modifications not only to objects on the freelist, but to the new objects they become after an object in the freelist is put back into play (reincarnated, so-to-speak).

I mention this because I'm not aware of how strict Mocha Doom is with enforcing referential integrity between instances of its version of Mobj, nor the mechanisms you use to do so. I'd imagine you have it down pretty well, given the relative strictness of Java when it comes to not being able to access deleted objects.

Maes · October 24, 2012

Digging into the FORCED and PAINFUL history revealed this: I introduced the pooling system shortly before I introduced a proper "codepointer" system with separate classes for action functions (version 1.5).

Before that, I used a big-ass dispatch() method, that executed the proper action function depending on the value of each mobj_t's think_t field (it was an enum of all possible functions, so a huge switch statement was used).

Anyway, under the old system, objects for deletion got a "null" function pointer, instead of the special "NOP" value I introduced later. Apparently, using null was catastrophic when mixed with reusing for reasons I can't really remember even with repo history, and I had to introduce the special "NOP" value, which signaled an object to be removed definitively from the list. Maybe that wasn't the cause at all, but the code changed a lot since them, and I didn't commit the unsuccessful attempts.

This in turn triggered me to complete a proper "codepointer" system with rewirable per-mobj "function pointers".

kb1 · October 24, 2012

DaniJ said:
I just want to point out a contextual difference between where Graf is coming from and you (kb1) and Maes are - in ZDoom (and it's derivatives) a mobj_t is not a single fixed size. In his port a mobj_t instance is a dynamic composition, inheriting from (potentially) several different base classes.

Yes, absolutely - In my case, mobj_t's are fixed size, which is what makes them an obvious choice for me, I probably wouldn't have considered trying it otherwise. Excellent point.

Maes said:
In the working version of the code, I noticed that I didn't do any particular zeroing except maybe for the function codepointer. TBQH I don't remember what the actual issue was, however it seems solved now :-/

Quasar said:
I feel it should be noted that establishment of rigorous referential integrity is an obvious prerequisite.

Yes, I should have mentioned this. I'm using the Boom P_RemoveThinkerDelayed scheme, which is absolutely essential, and things of this nature when spawnning a new mobj:

  memset(mobj, 0, sizeof(*mobj));
  info = &mobjinfo[type];
  mobj->type = type;
  mobj->info = info;
  etc.

Graf Zahl said:
(things in a respectful manner)

Ok, you're being respectful :), I appreciate it - thank you, so I will respond.

Graf Zahl said:
So...

As I pointed out, you have a higher peak because you got 2 distinctive blocks of memory, which both need to be able to keep the maximum required amount of RAM.

This is true, but, then sizes of each pool are effectively smaller, aren't they? (of course, it's tricky to know what those are, which is partially alleviated by dynamic allocation)

Graf Zahl said:
If you run out of memory in such situations, keeping a freelist would inevitably mean you allocate more memory - because you need to keep all the freed objects' memory allocated.

If you're dynamically allocating space, and you run out of memory, too bad - you don't have enough memory to run this map! Seriously. I assume that peak memory usage is what is needed, because during one frame of the game, it *is* needed. I think you're getting thrown off by my first discussion about running a low-memory condition to speed up testing. I'm not concerned with low memory handling in this discussion - this is about maintaining performance throughout a game session.

Graf Zahl said:
So, why not increase the heap's size to begin with?

Because I had a very real need to get the mobj_t's out of Doom/Boom's crappy heap, because they were causing the poor z_zone heap to needlessly purge and reload itself frequently. Yes, I could use a better implementation, or simply malloc/new each object. But, WOW! What a difference it made to just use the freelist - I literally stopped worrying about the crappy heap once that was done - the very specific problems I was concentrating on at the time simply disappeared.

Graf Zahl said:
Seriously, the first thing I'd do with Doom these days is dump the zone heap.

Agreed, it's shit.

Graf Zahl said:
Uh... Sorry, no. This may have been the case in days long past, but with modern CPUs and memory caching on the CPU it's not that easy. Your freelist *may* be better, yet again it may not. It depends a lot on memory usage patterns.

What I was saying is that it's *not* easy to judge what the external allocation functions are going to do. That doesn't mean not to use them, but, I won't whole-heartedly trust them either. Interesting that you mention the memory caching schemes - the freelist certainly does not provide perfect cache performance, but it *does* stick all the mobj_t's together.

Graf Zahl said:
Also, micro-optimization, meaning you optimize code that only runs less than 1% of the entire execution time will bring no realistic speed improvement at all...The only exception I have to make is when a significant number of very small objects gets frequently allocated and deallocated...And by frequent I mean something like several hundreds of these that get thrown away in the same frame normally.

1% is a *huge* amount! The maximum amount of time you can spend on a frame is 1/35 second, otherwise you're missing frames. If all of your code for rendering, state management, collision, sound, without counting memory alloc/dealloc time, takes 100% of 1/35 of a second, that extra 1% needed to allocate/deallocate causes you to miss a frame, which is a catastrophe if you're also running synchronous network play - now, your lagging. 1% of a total frame's time is not what you should be concerned with, it's the balance of remaining time (or lack of) that determines how smooth the game is.

Imagine a scenario if you will: Running a "Brutal Doom"-like add-on, in a rather brutal map. You've walked into a room with, say, 200 monsters. These are custom monsters, that spawn multiple missiles at a time. You fire the BFG, and they fire their missiles. Your BFG shot lands, and the traces are spawnned. You now have 200 monsters, half of them have spawnned 3 missiles, and you've spawnned, say 30 BFG clouds. You splatter 30 monsters, each of which sprays arms, heads, legs, and countless blood gibs. Tons of these gibs fall into the lava, which spawns splashes, and smoke spawnners. As these gibs bounce and hit the walls and floor, they spawn extra blood splats.

Can you honestly say that, this intense, but not unheard-of scenario does not have the potential to call SpawnMobj multiple hundreds of times per frame? Also consider that, you may be running HD resolution, 32-bit color, uncapped frame-rate, in a 3-player coop game. In this scenario, you have *no* time to spend in memory allocation calls - calls that would intersperse mobj_t's in with textures, flats, sounds, sprites in memory, messing up the memory cache.

*Anything* you can do to enhance performance, that is not too convoluted to maintain, should be considered for inclusion. Otherwise, your users just suffer needlessly for your coding philosophy.

You can place "valid REJECT map building" in this category as well. I wish people would spend the small amount of extra time it takes to do this - it's almost always a win.

Maes said:
...but the code changed a lot since them, and I didn't commit the unsuccessful attempts.

This in turn triggered me to complete a proper "codepointer" system with rewirable per-mobj "function pointers".

Just curious, are you now maintaining original demo sync with your port? That would be incredible! Is that something you're considering?

Graf Zahl · October 24, 2012

kb1 said:
You can place "valid REJECT map building" in this category as well. I wish people would spend the small amount of extra time it takes to do this - it's almost always a win.

Is it?

Here's a better idea: Use a faster sight checking algorithm! And the good news: It already exists in the form of the Doom 1.2 version!

If the one critical bug in here is fixed it outclasses the newer version easily, even without ever using REJECT - because it immediately stops checking when it finds a one sided line. The later BSP algrithm has to step down deeply into the BSP before even starting to check anything.

Also, please point me to a map where REJECT does help. I have done some profiling with a few larger monster heavy maps over 10 years ago on PrBoom because I wanted to know how useful REJECT is. I couldn't see any measurable advantage. Sure, I could see that it took a tiny few milliseconds less - but as I said, using the alternative sighr checking code even without REJECT provided much better performance improvement.

kb1 · October 25, 2012

Graf Zahl said:
Is it?

Here's a better idea: Use a faster sight checking algorithm! And the good news: It already exists in the form of the Doom 1.2 version!

Yep, already using it, with the diagonal fix (conditionally, of course, based on demo sync needs).

Graf Zahl said:
Sure, I could see that it took a tiny few milliseconds less - but as I said, using the alternative sighr checking code even without REJECT provided much better performance improvement.

And, better yet with both. Once again, *any* time saved is a plus, as long as it doesn't make your code unmaintainable. You have to read it anyway (if you care about things like RMB tricks, and demo sync), so why fill it with nothing when you can let it improve performance?

Notice that Doom Press-release beta maps lack the REJECT lump, but it's included in the v0.99 release onward. However, the "old, fast" sight-checking was in there to at least v1.2.

Clearly, id found benefit in adding REJECT while using the old sight checking.

I could probably find countless examples of improvement without trying too hard. Once, I literally stopped a slow coop game, build a valid REJECT map, and started the game back up. We had an improved experience after that.

Did I measure frame rates? No. Do I know how much it improved? Not at all. But, I do know that the game ran quite a bit smoother, since, in our 3-player game, I sped up 3 instances of Doom, simply by building a required lump properly. That's what matters to me - good fluid gameplay under adverse conditions. That creates immersion.

If I were to find such a map, and post it, would it make a difference to you? If you were to test it and reach the same conclusions, would you post a reply stating that you were wrong? Would you stop telling users that REJECT is useless? Would you actually start telling your users to build valid REJECT lumps? Or, would I just be wasting my time?

At first I thought you just liked to fight, but I now tend to believe you are feeling personally attacked somehow by my posts. Relax, man, it's not a personal attack. My goals aren't to make you wrong, I want to make Doom playable, so I can have fun playing coop with my friends. And, suggesting how others can improve their experience is a plus. Isn't that what this site is all about?

Sign In

Which are the ports with most optimized renderers

Recommended Posts

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in