Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
Sign in to follow this  
printz

Which is the fastest/most efficient Hexen port?

Recommended Posts

I'm trying to play a NUTS.WAD setup converted to Hexen. Which is the fastest port for it? ZDoom? GZDoom? DOSBox probably isn't efficient due to being a VM. I have an NVidia video card so I can use GZDoom.

We can't have PrBoom+ here.

Share this post


Link to post

Since you have a video card, GZdoom or PrBoom with the opengl renderer would probably be faster since the software renderer would just eat up CPU cycles. Any other port with a OpenGL/DirectX renderer would probably be fast too.

Share this post


Link to post

Since this is about Nuts-style, other factors are also important.
ZDoom has been known for performance issues on extremely monster heavy maps though.

For Hexen this means, you'll probably have problems finding a suitable port because there just isn't anything basic.

You may also want to try Doomsday with all light effects switched off if GZDoom doesn't work.

Share this post


Link to post
Graf Zahl said:

You may also want to try Doomsday with all light effects switched off if GZDoom doesn't work.

I think Doomsday with any settings will be 100x slower than GZDoom. Last time I tried it on nuts.wad it was <0.1 fps

Share this post


Link to post

Ok, good to know. I thought it may be better on the gameplay side but of course I forgot how much the renderer's performance sucks...

Share this post


Link to post

Basically, your choice is between Chocolate Hexen, Doomsday, GZDoom, and ZDoom. Other Hexen ports are dead. (I'm not sure whether Vavoom is entirely dead or not, but it kinda looks dead.)

Doomsday is very slow and will remain so until the 2.0 is finalized and they start working on optimizing the code again.

Tosi said:

PrBoom with the opengl renderer would probably be faster

Hexen.

Share this post


Link to post

Exactly what makes some ports so slow when handling NUTS-like maps? I didn't make any particular effort or optimization for Mocha, and it runs nearly as fast as prBoom+ on such levels *wtf*

Share this post


Link to post
Maes said:

Exactly what makes some ports so slow when handling NUTS-like maps? I didn't make any particular effort or optimization for Mocha, and it runs nearly as fast as prBoom+ on such levels *wtf*

I don't know anything about the Doom engine, but I imagine that advanced features add a bunch of conditionals to the code. Multiply that by thousands of monsters, and it can have a large impact.

Share this post


Link to post

I seem to remember Entryway fixing a bug in PrBoom+ in MBF code that caused way to many monster-to-monster interactions to happen, but I've never tracked it down. Maybe Entryway can enlighten us.

Other culprits are:
* Added conditionals as exp(x) mentioned.
* slow line-of-sight checks
* slow memory allocation (missiles)
* believe it or not, slow sound mixing
* sprite overdraw. On modern CPUs, this is a BIG problem, thrashing memory cache many times over per frame.

Share this post


Link to post
kb1 said:

I seem to remember Entryway fixing a bug in PrBoom+ in MBF code that caused way to many monster-to-monster interactions to happen, but I've never tracked it down. Maybe Entryway can enlighten us.

Other culprits are:
* Added conditionals as exp(x) mentioned.
* slow line-of-sight checks
* slow memory allocation (missiles)
* believe it or not, slow sound mixing
* sprite overdraw. On modern CPUs, this is a BIG problem, thrashing memory cache many times over per frame.

It was one line of code from MBF that had never been adapted into the PrBoom codebase which removes dead monster corpses from the th_friends or th_enemies thinkerclass lists. This kept the search time for enemy targets excessively high on maps with large amounts of monsters. It was also responsible for the only known demo desync for an MBF demo in PrBoom-Plus.

Share this post


Link to post
kb1 said:

Other culprits are:
* Added conditionals as exp(x) mentioned.


That one is plausible, especially if they lead to complexities higher than O(n) (where n is the number of monsters). However if they "simply" lead to O(cn) increases (where c some constant << n), it's harder to swallow that the difference can span orders of magnitude.

kb1 said:

* slow line-of-sight checks


Same as above. I'd expect this to add a (fairly larger) constant overhead to each thinker's processing time, but not enough to justify a degeneration of an order of magnitude or larger. Then again, ZDoom completely eschews REJECT map calculations, so it replaces an O(1) or O(c) operation with one of much higher complexity (a BSP search...dunno, O(nlogn) where n= number of nodes in a map?)

kb1 said:

* slow memory allocation (missiles)


I think that's the only truly strong point of Mocha against all other ports: superior garbage collection and lazy deallocation to the extreme, as my tests have shown in NUTS.WAD: stuff doesn't get collected until memory runs really low (I had to force a low heap) or you force it with GC trickery. Compare this with having to pay for a free or dealloc operation for every object in a NUTS.WAD mad (the rate of spawning/death of projectiles in that map can easily run in the 1000s per second).

* believe it or not, slow sound mixing


More like taxing the sound channel allocation with thousands of requests that never get played back, but that depends on the channel management strategy used. Of course, the less spurious/brief request reach the actual mixing state, the better.

[i]* sprite overdraw. On modern CPUs, this is a BIG problem, thrashing memory cache many times over per frame. [/B]


Understandable, though, as I said, Mocha seems to handle that just as well as prBoom+. It really becomes a bottleneck when trying to parallelize, though.

Share this post


Link to post

What's sizeof(mobj_t) in ZDoom? I think mobj_t or whatever it's called is INSANELY bloated in ZDoom. No wonder Nuts runs so poorly.

Share this post


Link to post

That got nothing to do with it.
The main problem is that the enemy logic is a lot more complex than the original one and there's probably some bug in it. These are hard to find though.

If cache misses would cause this kind of slowdown most software would run creepingly slow.

Share this post


Link to post
Graf Zahl said:

If cache misses would cause this kind of slowdown most software would run creepingly slow.


Fire up memtest. Diving the transfer rate of the level 1 or level 2 cache with the main memory's. That's the amount of slowdown you can expect from a complete cache miss ;-)

On modern CPUs, the ratio of L1 cache to main memory speed is about 10:1. It can be MUCH higher for older CPUs and memory technologies though.

Share this post


Link to post
Maes said:

Fire up memtest. Diving the transfer rate of the level 1 or level 2 cache with the main memory's. That's the amount of slowdown you can expect from a complete cache miss ;-)

On modern CPUs, the ratio of L1 cache to main memory speed is about 10:1. It can be MUCH higher for older CPUs and memory technologies though.



Sure, but the question isn't how much the effect of a single cache miss is but the increase in cache misses by certain operations.

Code that has to check lots of separate data already has lots of cache misses by default so what happens here isn't to go from 0 to 10 misses but maybe from 100 to 110 which has a far less pronounced effect.

For example, I was once toying in GZDoom's renderer with precalculating and caching some render data. Ultimately it caused a 10% speed decrease due to caching behavior - so saying that code that runs reasonably fast suddenly slows down to a crawl just by adding more cache misses is dubious. Sure, it may get slower but not by such large factors.

Share this post


Link to post
Graf Zahl said:

Sure, but the question isn't how much the effect of a single cache miss is but the increase in cache misses by certain operations.


It depends a lot on the memory layout of the data structures used, cache line length, cache size, main memory-to-cache size ratio and cache associativity. -if someone REALLY has no life, they could work out exactly in what sequences/patterns data is accessed, and lay them out in memory beforehand in the most optimal way for a specific set of cache parameters -kinda like a Story of Mel on steroids.

Of course, almost none does such a thing, AFAIK, not even hyper-specific compilers. However some things in Doom are glaringly anti-cache e.g. the column-based rendering is a killer, when the screen buffer is row-first (and even if it wasn't, you'd have to pay for an expensive transpose operation at the end of each tic, unless you have column-first video hardware as well).

The only thing a "general" programmer can do is use some common sense e.g. don't try and perform a matrix-vector multiplication starting from the last row and column and going backwards, that just fucks up cache coherency, cache commonly calculated const values inside loops etc. and in general, the less you access the main memory and the less you go "against the grain", the better.

Share this post


Link to post
Maes said:

(and even if it wasn't, you'd have to pay for an expensive transpose operation at the end of each tic, unless you have column-first video hardware as well).


You could let the graphics hardware do that.

But it wouldn't help. It's only walls and sprites that are drawn vertically, not flats. Of course flats will cause other types of cache misses because they aren't accessed sequentially.


It's all academic anyway. Yes, larger data structures will decrease cache performance to a degree - but I've yet to find an example where this decrease exceeds a few percentage point unless using deliberately constructed examples.

Share this post


Link to post

In Hexen (and Heretic) there needs to be more performance than in Doom or Strife, because the player is more likely to play on fast mode. In such a case, every monster will keep shooting, resulting in an immensely larger amount of actors at a given time.

Share this post


Link to post
Graf Zahl said:

The main problem is that the enemy logic is a lot more complex than the original one

What does that mean exactly, are the ZDoom monsters smarter?

Share this post


Link to post

No, not smarter, but the logic contains quite a bit of code to support new ZDoom features, e.g. following a predefined path.

If you compare ZDoom's A_Look and A_Chase functions with the originals you'll see that ZDoom's versions are considerably larger.

Share this post


Link to post
Quasar said:

It was one line of code from MBF that had never been adapted into the PrBoom codebase which removes dead monster corpses from the th_friends or th_enemies thinkerclass lists. This kept the search time for enemy targets excessively high on maps with large amounts of monsters. It was also responsible for the only known demo desync for an MBF demo in PrBoom-Plus.

Very interesting, and good to know!

Maes said:

...Then again, ZDoom completely eschews REJECT map calculations, so it replaces an O(1) or O(c) operation with one of much higher complexity (a BSP search...dunno, O(nlogn) where n= number of nodes in a map?)

Nah, ZDoom actually does something very smart here - it still uses REJECT, but if REJECT fails to stop a sight-check, ZDoom uses a fixed version of the old Doom 1.2 sight-checking, that happened to make it's way over to Heretic (or Hexen??). This code is generally much faster than the final stuff in later Doom versions, but, it originally had an ugly bug that would have to be studied carefully to find (see ZDoom source) Randy found and fixed it, and has been using it ever since.

Maes said:

...I think that's the only truly strong point of Mocha against all other ports: superior garbage collection and lazy deallocation to the extreme, as my tests have shown in NUTS.WAD: stuff doesn't get collected until memory runs really low (I had to force a low heap) or you force it with GC trickery. Compare this with having to pay for a free or dealloc operation for every object in a NUTS.WAD mad (the rate of spawning/death of projectiles in that map can easily run in the 1000s per second)...

My port uses a freelist for mobj_ts, so I enjoy a half-dozen or so allocations per game! Hard to beat.

Maes said:

...However some things in Doom are glaringly anti-cache e.g. the column-based rendering is a killer, when the screen buffer is row-first (and even if it wasn't, you'd have to pay for an expensive transpose operation at the end of each tic, unless you have column-first video hardware as well)...

Oh, you're right, it's horrible. If you want a REALLY diabolical case, set your horizontal resolution to a multiple of cache size, like a power of 2...like 1024x768 - then you guarantee a cache write/flush cycle each pixel write! Adding extra bytes to each line of your frame buffer will make a huge difference in this case (1028 bytes vs. 1024).

Now, add sprite overdraw to that. Now you're invalidating cache that was written quite some time ago - now you're flushing MULTIPLE cache layers.

ZDoom was the first (I think) port to try to write 4 horizontal pixels at once before switching lines. It used a mind-boggling algorithm to attempt to align 4 separate vertical runs horizontally. It's way more complicated than the original renderer, but, amazingly, can double renderer performance in some cases.

Eternity followed with its quad-renderer, which is a similar idea, but very different approach.

Modern CPUs have write instructions that deliberately avoid the cache, but, unless you write the assembly code yourself, it's tricky (if even possible) to get compilers to use the instructions. Of course, there's no portable way to code it anyway. A shame.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  
×