Jump to content
Search In
  • More options...
Find results that contain...
Find results in...

GooberMan

Members
  • Content count

    1621
  • Joined

  • Last visited

Everything posted by GooberMan

  1. Almost everything will still function as in the vanilla renderer. As it is the vanilla renderer, but at higher resolution and with speed optimisations. But we also need to be specific about what kind of HoM will work. If it's a built in mapping trick? Absolutely. If it's relying on the MAXDRAWSEGS soft limit being hit to make automatic HoM in the distance? No, and virtually every limit removing port will also behave like this. Handling errors for vanilla limits is something I've been mulling over, and this question has made me put real thought in to it. I can implement an optional error highlighting feature. Which will basically come down to "after the scene is rendered, give a full screen render context some alternate column/span rendering functions that can render data in to an overlay texture wherever a vanilla limit is breached". I was thinking of merging data from the threaded contexts after, but my profiles show that the act of writing data to the render buffer is the biggest time cost at high resolutions. So I'll keep code simple, make some debug column renderers, and just parse the BSP again. (Keep asking questions like this, everyone. When I'm forced to stop and think about something, my thoughts turn from basic ideas in to plans. The people mentioned in the first post have been asking questions all the time, for example.)
  2. Tiny update on the render graph. I find information much easier to digest if I don't have to flick between tabs, don't you? Oh and I'm slowly replacing the ye olde faithfule setup program. I guess it's starting to turn in to a real source port?
  3. No article yet, because I've been putting some real debugging abilities in to the code. The Hellbound better-than-reasonably-expected performance got me thinking that I really want to know a ton of stats. And rather than go for that spreadsheet solution I've been running with, I sat down and integrated Dear ImGui in to the codebase. Accessible like a console normally is, key configurable in the setup program too if you're that way inclined. It certainly shows that at these resolutions, flat rendering is the bottleneck. More will be added as I deem it necessary. Next on the list is to update the build scripts so that using the Chocolate Doom instructions to compile work again. EDIT: Compiles and runs on my x64 Linux box. Compiles on my Pi4 running Ubuntu 20.04, but that's where things get tricky. Need to detect at compile time how to handle setting up OpenGL for ImGui differently, and then work out why it segfaults even if I hardcode the setup locally. Anyway, it should work for OSX again. I'll be setting up osxcross (thanks, sponge) soon to at least test compiles locally.
  4. This was actually a self-suggested title. Ling was all "I'm giving out titles" one weekend in the early 00's and I had a reputation for pushing what was possible with ZDoom scripting. So I was like, eh, sure, this will do. I don't think it's been a valid title for 15 years. No nodes in the wad. Top kek. And a bunch of other things I need to work out. But yeah, that's out of scope for now.
  5. That's what I get for doing stuff before I run out for work I guess. Totally forgot to get the single-threaded comparison shot. So. Let's ignore that this is still with every other optimisations I've made so far. And assume the milliseconds counters are indicative of an average frame from this location (it's as low as 58, as high as 72 without moving - welcome to thread scheduling and system bus/cache performance). Divide that number by 8, and you get 7.95. Take the worst performing thread in my prior shot, and it's 7.2. I want to do the multiple renderbuffers before I write that article. Funny thing about a cache, if you have threads writing to the same general area of memory you end up triggering cache flushes all the time. So give each thread its own distinct - and cache-size friendly - region of memory to render to and you eliminate that problem. But tonight is beers and Borat and Big Trouble In Little China. So y'all only get that screenshot for now. EDIT: And comparing the shots after making the post. Yeah. So, uh, if I move the view to match. Single threaded takes 16ms longer than that shot. Which honestly makes my point even better (78 / 8 = 9.75ms as a target per-thread goal to beat, and I'm already well under and have clear paths for further optimisation).
  6. Ah ha, the content I seek. I've been looking for good limit-removing maps to test with. And yeah, the nature of going threaded means that you're limit-removing by nature. Even if I keep the 128 visplane limit for example, that's now 128 visplanes per thread the way I've threaded things. Altazimuth is also planning to get everything I'm doing in to Eternity, so testing on something like Eviternity will tell me exactly what else we need to optimise. Anyway, here's Wonderwall. Just normal MAP29 for now. I had to patch mapdata structures to not use the original signed 16-bit integers but move over to unsigned 16-bit to even get that one to load, and I need to find what else I'm not bringing to unsigned before the full one will load. And also it uses tons of visplanes per thread anyway, so I don't think I'll be able to run it at the 8 threads my machine has here (and to be clear, this is a 2560x1600 backbuffer so over twice as many pixels as 1080p, many normal Doom maps really thrash performance at these resolutions) without dynamic visplane allocation and seriously getting memory usage back down to sanity for high resolutions. But I'm not finished. Notice how the main loop waits around for a full three milliseconds after the most expensive thread finishes rendering. That's just the easiest problem to fix. The fact that I'm still using one render buffer is also slowing things down. And I need to trigger some monsters at some point, the sprite renderer is currently unoptimised. So it'll go down further again.
  7. This is my "fuck you" to everyone who ever argued that threading the software renderer is a bad idea. The "fuck you" part is that I'm not finished and that line is gonna keep going down. Article to come this weekend. EDIT: Bonus screenshot from my i7. Showing one reason why that red line should be lower. Loop time taking 1.4 milliseconds longer than the worst performing thread is simply because the loop is waiting for those threads to wake up. Got some ideas on how to deal with that.
  8. The CMake files are currently not up to date, so until I fix that in about 12 hours I can't compile it for my Raspberry Pi nor my Linux box and neither can you. So you might want to hold off a little there.
  9. And another sneak preview. Being Chocolate based (and testing that I don't break Vanilla every step of the way) means that I can just go ahead and load up Plutonia 2 to get a screenshot.
  10. So here's a sneak preview of something I'll be ready to talk about proper in a few days time, screencapped from the Pi used in the above post.
  11. GooberMan

    Looking for a very specific sourceport

    And you'll need to put a similar check for visplanes in while you're at it. So instead of crashing in vanilla, you'll just add HoM on floors. I do not think "limit removing" means what OP think it means.
  12. GooberMan

    Looking for a very specific sourceport

    HoM is fixed specifically by removing/increasing the MAXDRAWSEGS limit. You can't have what you want without fixing that bug.
  13. Oh, and just to highlight that cache really is the problem on modern systems. Here's performance against a non-transposed renderer at 2560x1600 on an ARM processor. Ignore the titlepic performance, I didn't patch the scaling code across to my clean Chocco build. But that graph is essentially the same as the original i7 graph I captured at lower resolutions. Notice how outright terrible ARM's cache performs on wall/sprite heavy parts of the DEMO1 loop. (The capture is 700 frames from program start, it ends around where the barrel in front of the secret wall is being shot)
  14. This is my exact plan, in fact. Well. In my experience with similar splitting of buffers, you need to pay attention to cache sizes on your system or else the L3 will trip over itself trying to propagate the buffers before it needs to. So I won't use a single buffer. There'sextra advantages to not using a single buffer besides the complete avoidance of cache contention. I'll be doing threading next actually, it's time to take that break from SIMD, so I'll have more information if it does actually work as I think it should soonish.
  15. Oh, to be clear, I'm Australian and haven't written a demo in my life. Working at Remedy and Housemarque though, I've been surrounded by demosceners. Getting arcane knowledge about bit twiddling is just a matter of finding the right person to ask.
  16. So the advantage of working at a company with a strong demoscene culture/history. One of the graphics guys, programs Atari ST demos in his spare time. Suggested to just use a lookup table for a SIMD mask I was trying to calculate at runtime. Given that I've been trying to avoid loads, I didn't think of it. Or, as I've been putting it: "It's so obvious, it's unintuitive". Because the results speak for themselves. Before: And after: (It looks much clearer side-by-side, open in different tabs and switch back and forth)
  17. Fog is something I'm have to deal with when I get to making Hexen run again. Let's see what I come up with when I get to it.
  18. Which reminds me, how maintained is GZ's software renderer these days? I looked at the code the other day but not the history. Things like PNGs would definitely need special consideration to even run properly in this code path. (I've also stated how I'd do a hardware renderer previously on these forums. I'll get back to that at some point, but now that I'm learning the software renderer inside out this will honestly improve the methods I was going to employ.) I have had to bump the default page size to 128MiB thanks to REKKR. I'll rewrite the allocator one day to be a bit more modern, specifically grabbing new virtual pages when needed. Likely a solved problem in every other source port, but as noted above Chocco is so close to vanilla.
  19. At a minimum, the backbuffer transpose should be applicable to every port with a software renderer. I am curious to see it profiled against ports that try to render multiple columns at a time, but my suspicion is that this will perform better because I'm not branching all over the place to handle multiple columns and it stays within one cache line for writes far longer than other methods. This really should have been done and made standard years ago IMO.
  20. You think that's amazing? I just compared the high-res Chocco running on my i7-6700HQ to my optimisations running on the Raspberry Pi at the same resolution. aaaaahahahahahahaha an ARM that has a maximum clockrate of 1.5GHz running my optimisations performs basically as well an i7 running an uprezzed Chocco. *ahem* So. Uh. That red line is gonna go further down by the time I'm done.
  21. Just did a run against a nearly-stock Chocco I have here locally that adds high res support and not much else. And wow. I've been so focused on incremental improvements that I forgot how far it's already come. First post updated with the profile in question.
  22. GooberMan

    DOS Doom Code Execution

    Tricks like this rely on stack overflows. Code will read more data in to a local variable than it should, end result is that the program executes the loaded data as if it were actually code. Chocolate Doom's save code has been rewritten to avoid this, basically by being explicit about the values it reads/writes with instead of leaving it up to memcpy. You'd need an entirely different attack vector to get it to work. EDIT: Also, depending on compile settings, stack overflows can be detected these days.
  23. So, uh, what is it exactly that you expect AVX to do? Simulation? Rendering? What grounds do you have for listing anything with AVX impossible, especially considering AVX is explicitly a superset of SSE 4.2 and is at least as capable as that instruciton set?
  24. GooberMan

    Kernel-mode anticheat is a huge nope

    "Ironic hyperbolic response gets serious hyperbolic response on internet forum. News at 11"
×