Very interesting, and good to know!
It was one line of code from MBF that had never been adapted into the PrBoom codebase which removes dead monster corpses from the th_friends or th_enemies thinkerclass lists. This kept the search time for enemy targets excessively high on maps with large amounts of monsters. It was also responsible for the only known demo desync for an MBF demo in PrBoom-Plus.
Nah, ZDoom actually does something very smart here - it still uses REJECT, but if REJECT fails to stop a sight-check, ZDoom uses a fixed version of the old Doom 1.2 sight-checking, that happened to make it's way over to Heretic (or Hexen??). This code is generally much faster than the final stuff in later Doom versions, but, it originally had an ugly bug that would have to be studied carefully to find (see ZDoom source) Randy found and fixed it, and has been using it ever since.
...Then again, ZDoom completely eschews REJECT map calculations, so it replaces an O(1) or O(c) operation with one of much higher complexity (a BSP search...dunno, O(nlogn) where n= number of nodes in a map?)
My port uses a freelist for mobj_ts, so I enjoy a half-dozen or so allocations per game! Hard to beat.
...I think that's the only truly strong point of Mocha against all other ports: superior garbage collection and lazy deallocation to the extreme, as my tests have shown in NUTS.WAD: stuff doesn't get collected until memory runs really low (I had to force a low heap) or you force it with GC trickery. Compare this with having to pay for a free or dealloc operation for every object in a NUTS.WAD mad (the rate of spawning/death of projectiles in that map can easily run in the 1000s per second)...
Oh, you're right, it's horrible. If you want a REALLY diabolical case, set your horizontal resolution to a multiple of cache size, like a power of 2...like 1024x768 - then you guarantee a cache write/flush cycle each pixel write! Adding extra bytes to each line of your frame buffer will make a huge difference in this case (1028 bytes vs. 1024).
...However some things in Doom are glaringly anti-cache e.g. the column-based rendering is a killer, when the screen buffer is row-first (and even if it wasn't, you'd have to pay for an expensive transpose operation at the end of each tic, unless you have column-first video hardware as well)...
Now, add sprite overdraw to that. Now you're invalidating cache that was written quite some time ago - now you're flushing MULTIPLE cache layers.
ZDoom was the first (I think) port to try to write 4 horizontal pixels at once before switching lines. It used a mind-boggling algorithm to attempt to align 4 separate vertical runs horizontally. It's way more complicated than the original renderer, but, amazingly, can double renderer performance in some cases.
Eternity followed with its quad-renderer, which is a similar idea, but very different approach.
Modern CPUs have write instructions that deliberately avoid the cache, but, unless you write the assembly code yourself, it's tricky (if even possible) to get compilers to use the instructions. Of course, there's no portable way to code it anyway. A shame.