Jump to content
Search In
  • More options...
Find results that contain...
Find results in...

GooberMan

Members
  • Content count

    1621
  • Joined

  • Last visited

Everything posted by GooberMan

  1. How about a before and after of the Pi 4 running a 1080p backbuffer? This is the SDL renderpath: Aaaaaand the GL renderpath: It's a rock solid 60 FPS with Doom 2. Of course, you need to run it fullscreen. The windowed compositing on Ubuntu at least is atrocious. But there's more than enough there to show that a Switch can handle the IWADs at 60 FPS with Rum and Raisin Doom. There is one oddity though. Black rendering as not-black. I've seen this somewhere before, and on the titlescreen too. I just cannot remember where. Will have to have a think about what might be causing it. You can check the GitHub log for things the Pi didn't like. This will just be another one of those things. EDIT: And yes. Confirmed to be a precision issue, depending on how the hardware feels it's upflowing to the next palette entry. TITLEPIC actually uses the blue colour ramp 240-247, so the black 247 is coming out as 248 when it tries to sample from the 8-bit texture. There's various ways to deal with it, and I'll settle on one of them in the near future. EDIT2: Yeah, so depending on GLES version your highp floats could be half floats instead. As in, 16 bit floats. Which leads to a complete loss of precision breakdown when you're trying to get palette entry 247 and higher. So we solve this in two steps: Since 0-255 is being mapped in to 0-1, we bring that mapping back to where we want it by dividing by 256 and multiplying by 255 (represented by the 0.99609375 constant). Since that breaks precision elsewhere, we nudge the lookup by 1/1024. Bit dodgy, but you've just gotta think like a half-precision floating point to get to something precise enough for your needs. Either way, everything renders perfectly on the Pi now.
  2. My laptop is also driving a 4k 65 inch HDR display. HDR off though, pretty sure this hardware doesn't support it. It does support 120Hz output though, but only at 1080p. But yeah, I'll have to work out how to get good performance metrics for dropping to kernel mode out of this thing without telling everyone interested in testing to run something like Superluminal and profile every sesson. Not only do you have bandwidth to worry about, but the OS is going to try to switch your active threads to idle every 4 milliseconds on Windows (pretty sure it's similar on Linux). So if we can line up the scheduling so that the OS would normally try to take focus away while it's down in kernel mode drivers transferring over the PCIe bus then we will get a far more stable framerate.
  3. Allow me to give you an anecdote from nearly 13 years ago in my career now. I was working on Game Room, a commercial emulator service for Microsoft platforms. Now, it was certainly cool getting a Commodore 64 emulator running and both listening to Monty on the Run on a 360, and programming BASIC with a chatpad on the 360... but that emulator never saw release so that's just a cool moment. What I actually want to talk about is a bug that I had to track down. Microsoft's QA was adamant that we were getting choppy performance on a specific hardware configuration. We couldn't reproduce it, so we actually duplicated the hardware and software installs. Still couldn't get it. But Microsoft wouldn't waive the bug. End result, the EP put me on the problem. I really had no idea where to start on this one. Can't reproduce it on the exact hardware and software configuration? Could be anything. Failing video card? Nah, they've already replaced it with an identical one. The hardware was entirely ruled out, so it had to be something with the software. Drivers? Identical. So I did what every average programmer does, and started googling about it. I really can't imagine googling for stuff these days, there's just SO. MUCH. JUNK. at the top of search results. Back then, it took me an hour of crawling the web to finally come across something that made me go "No, it couldn't be." I went over to the test machine. Opened up the power configuration settings. Put it on low power mode. Loaded up Game Room. Bam. Finally, we reproduced the bug. So I updated the bug report saying that the only way we could get this bug to happen was by forcing our identical hardware and software machine in to low power mode. And I heard back from the EP next day that Microsoft had decided to quietly waive the bug. Now, this little anecdote is a lead in to the next point. Because my main development machine is a laptop. The i7 Skylake processor I refer to. Good CPU for the time... but afflicted with a GeForce 960M GPU. This was the year before they started putting desktop GPUs in to laptops, and the performance of the 960M in benchmarks was about 10% lower than the original Xbox One's GPU. But never mind that. It's a laptop. Which means the primary workhorse GPU is the Intel Integrated GPU, and the GeForce only gets used by request. Which means that its normal state is about this: Now, that bit I've highlighted will be a bit confusing unless you know how to read it. It's basically saying that it is a PCIe 3.0 capable card with 16 lanes, but is currently running at PCIe 1.1 capabilities. That's power settings for you. Not only will this be incredibly common, but it's good for example purposes here. PCIe 1.x at 16 lanes does give you bandwidth of 4 gigabytes per second. Which means that you could hit 4K 120FPS with a 32-bit software rendering... if the system is doing literally nothing else but blindly pushing data over the PCIe bus. Which means ring 0 mode with the CPU doing nothing while DMA does its thing. Have your normal Windows session running? Not a chance. Not while there's other things sharing the PCIe bus (sound cards, M.2 NVMe drives are becoming more common, etc) and you're multitasking things like Discord and web browsers and what have you. Theoretical maximums like that are basically as useful as saying that a small and cheap hatchback car right off the factory line can easily hit 300km/h if you drop it from a plane and it approaches terminal velocity. It requires a very special set of circumstances to hit that speed. Did you know that Nvidia GPUs only need 8 PCIe lanes? Or that AMD cards only need 4? If you get one running at full 16 then excellent great job, but there's a chunk of people out there that will have lower bandwidth by default because of this. So even if they have a PCIe 3 card. Power settings can ruin the day. The number of lanes in use can ruin the day. PCIe 3 16x has a maximum bandwidth of 16 gigabytes a second. tl;dr is that PCIe 3 4x (ie minimum AMD) is as good as PCIe 1 16x. Not good enough to handle 4K 32-bit 120FPS data transfers. But let's specify exactly why I said the bandwidth isn't enough. Because the theoretical maximum bandwidth per second does actually matter. Here's the thing: if we're playing Doom, we're running a simulation. And then we generate a visual representation of that so that the user can see and interact with the simulation. If we want that to hit 120FPS, we need to be doing all that in an 8.333 millisecond slice. A rule of thumb in the industry is to aim a millisecond slower than the target - performance spikes happen, so give it some wiggle room. In the case of Doom here, we can see that 4-thread rendering on a Doom 2 map will take over 4 milliseconds. Let's call it 4.5 milliseconds. That's already over half of our timeslice. Add in the simulation and everything else that goes on in a simulation frame, and we're really getting towards breaking that 7.333 millisecond barrier. Now, there are tricks like entirely kicking rendering off the main thread to get a good chunk of time back (at the expense of introducing input lag). But realistically I want my frame uploaded and displayed in 1 millisecond at most once I'm done rendering. And this is where faster PCIe speeds come in handy. How long does it take to upload a 4K 32-bit buffer? Well, for PCIe 1 16x/PCIe 3 4x, that's 1/120th of a second. 8.333 milliseconds. Yeah, that's never going to be good enough. PCIe 3 16x? 1/480th of a second. About 2.1 milliseconds. Still uncomfortable. And it certainly wouldn't fly with the more complicated maps I've been testing with even if I kicked rendering off the main thread. PCIe 4 16x is the first spec that hits my goal at a hardware level. So just like everything else with Rum and Raisin Doom - throw money at the problem and it's solved. Render with more threads. Buy a higher speed bus. The world is your oyster if you can afford oysters for dinner every night. I'm not looking to throw money at the problem though. I've mentioned that my primary GPU on my laptop here is the intel integrated. The screenshot at the top of the post is using a 4K 8-bit backbuffer decompressed with a pixel shader on the GPU. As you can see there, it's knocked a huge amount of time off the previous 4K shot. 8 to 10 milliseconds on average. I didn't upload a shot of the 960M running that scene with 4K 32-bit backbuffers because it actually runs slower than the Intel Integrated. This will be likely very common on laptops where the discrete GPU is not in charge of the show. It's the Intel chip that's responsible for displaying the end product. So to use the GeForce, we need to upload the buffer to the GPU to create the final presentation and then the Intel needs to get it back to display on the monitor/over HDMI. Not a good setup at all. As it turns out though. Reducing that bandwidth to 4K 8-bit is enough to finally push performance in favor of the GeForce. It's still an unstable framerate. As I've discussed, I'm not hitting my targets in other areas just yet. And the Intel GPU running the show for the system means that there's going to be the second bottleneck there that I can't control. But hey. Anyone on a desktop with a comparable CPU and a similarly crap GPU will get better performance than me as long as it's running in PCIe 3 16x mode. Tomorrow, I'll get the latest results running on the Raspberry Pi. Speaking of low-powered low-bandwidth systems.
  4. Not the first time I've done it either. This method goes back to the early days of Xbox 360 development. Also very likely, I'm never happy with SDL any time I look at it so the more I move over to a hand-written backend the better. Consider this though: 4K with palette decompression on the GPU will use essentially the same bandwidth as my 1080p equivalent buffer there that's hitting 120FPS.
  5. https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.2.2 So. Short story, I went to London late July/early August and returned with the rona in tow. That took me out for a few weeks, and then I couldn't get back in to the swing of things. And then something happened today. A new mod release on the Unity port. I knew that all mapsets/mods had been converted to actually be IWADs instead of PWADs, so I decided to get the IWAD version running in Rum and Raisin. First things first is to get the IWAD. Which you'll find at %UserProfile%\Saved Games\id Software\DOOM 2\WADs\17 on your Windows machine. Take the file called 17 from inside that folder, copy and rename it to whateveryouwant.wad, and then load it up with the -iwad command line parameter. There was a few things that I had to fix to get it up and running. Importantly: none of this works without the -removelimits command line parameter. IWADs with a DEHACKED lump now autoload it Widescreen asset support was added. Tested with Doom 1 (shareware, registered and ultimate), Doom 2, TNT, and Plutonia; and with both Unity's 16:9 assets and Nash's 21:9 assets. Also tested in 4:3, 16:9, and 21:9 resolutions That was pretty much all I had to do to get things up and running. I assume other ports should handle the IWAD just fine, but there's still so much of this codebase that is still Chocolate. On the subject of widescreen assets. There is now an optional download, the WidePix Pack, linked in the release. These are just WAD files that I look for when -removelimits is specified. If the files are there, they're loaded. If they're not, then you continue on your merry. I basically dumped Nash's lump files from GitHub in to separate WADs. Literally didn't need to do any more than that, they're all set up with correct offset markers. Great stuff. I decided to keep them separate for the time being, but it's easy enough to use them. Note, however, a couple of IWAD gotchas with this: If you're going to use the Unity doom.wad and doom2.wad (found in the DOOM_Data\StreamingAssets or DOOM II_Data\StreamingAssets subfolders), you need to use -removelimits since widescreen support is a limit-removing feature. If you want to only use those Unity widescreen assets, you either need to not have the .widetex files alongside Rum and Raisin Doom or use the -nowidetex command line parameter. If you're using an IWAD other than the id Software IWADs (say, for example, Harmony) then you will need to use -nowidetex since I haven't worked out a good way to automate detection of non-id IWADs. Anyway. I also did some work before I got the rona to remove hardcoded limits from the renderer entirely. As such, you can now select from some 4K backbuffer resolutions. Right now, if you turn on the render graphs with 4K backbuffers running you will notice something weird... So it's clearly rendering in enough time to hit 120 FPS, but it's only hitting 45 FPS (and unstable at that). What gives? The reason is pretty simple: The Chocolate Doom code that uploads the backbuffer to the GPU does so by converting to 32-bit on the CPU before uploading. So not only do you have that cost, but you have the cost of the PCIe bus to deal with. 4K backbuffer, multiplied by 32 bits, multiplied by 120 times a second, equals LOLNO not enough physical bandwidth. Perfectly fine for Vanilla sized buffers, but not for 4K. I'm still in the middle of refactoring things to upload the palette and the backbuffer without modification to the GPU so that I can run a shader and decompress to full colour. The bandwidth requirements then become realistic, so 4K at 120FPS will be quite reasonable. But I'll save that for a next minor version release instead of a revision release. Still. This is the same scene with a 1080p equivalent backbuffer. I like how the render threads linearly scale to the actually-four-times-the-output-pixels.
  6. I've done way more than that. I wrote a C header parser specifically using wxWidgets as a testbed so that I could automate bindings to other languages. (Protip: Never write your own C header parser, even if the reasons seem sound). But hey, I'm sure support is terrible on TempleOS or whatever it is you use, so you have a point there. Probably.
  7. Number one: There's cross-platform GPL-compatible UI libraries that have been around for 30 years, like https://en.wikipedia.org/wiki/WxWidgets And number two: Expect to see a frontend in Rum and Raisin Doom in the future using ImGui - another cross-platform UI framework that proves your complaint wrong once again. These threads would be so much shorter and not endless if you stopped continually trying to gatekeep everything.
  8. I won't be attempting this until I've simplified the renderer some more. As mentioned above, sprite clipping is next in my sights. The flat rasteriser I wrote will also be the only rendering routine when it moves over to the GPU, no more wall rendering code. Which will also mean simplifying how the wall rendering is set up, most of the time right now is spent outside of the lowest rendering loop just setting up things like wall offsets etc. But needless to say: I don't think the original renderer would be worth porting over to compute code. I know the kind of code that gets all those effects you see in Returnal, the simpler your code the better it is on the GPU. And, as is displayed here, the better it is on the CPU. tl;dr - I've still got work to do.
  9. GooberMan

    >60 HZ broken in the Unity Port

    If you're going to be talking refresh rate issues, you need to be talking what hardware you're running. Doing so without mentioning it is borderline pointless. For example. I just got two different experiences running on my Intel® HD Graphics 530 and my GeForce 960M running on the same i7-6700HQ system. There is some weird timings going on though. Best results have been on the 960M. I've been getting it to run up to 120Hz on my 120Hz panel with vsync on, but it does require setting the frame limiter to off (ie 0) with vsync on. You can turn vsync off, but then you're at the mercy of whatever frame was last rendered so it can lead to jerky results. With vsync on though, it slowly drifts to timings where it gets locked to 60Hz. This may not be the easiest thing to fix, given that Unity Doom doesn't exactly have access to Unity's internal frame sync code to make it behave a bit better in borderless fullscreen/just plain old windowed modes. In every case though, toggling between fullscreen and windowed has reset the behavior. This suggests to me that the Windows scheduler is at least partly to blame. Otherwise, needless to say: The gameplay itself is capped at 35 actually, just like with every other source port.
  10. It initialises GL 3 or higher, yes. It's more for planned features than anything it requires right now. Buuuuuut having said that, I have stated that this is about getting the renderer to run efficiently on modern systems. So yeah, Core 2 Duo, that predates the first i7. What I've done so far should theoretically work just fine on that line of processors, but I'd definitely want to look at the threading performance with how I've got things set up thanks to the way-less-sophisticated cache. And it's an Intel integrated GPU there so it would certainly not have the capability for what I have in mind.
  11. Just poking my head in to say that MAP30 does things to my port. Short story: I treat every texture and flat as a composite and cache them in to memory. Cool. But I also convert each texture to every light level before the level starts (saving a COLORMAP lookup at render time), which means I use 32 times more memory than a normal software renderer. End of the day: I chew up 4 gigs of memory. Had to do work on the zone allocator in fact, which was still using 32-bit signed integers for memory tracking instead of size_t. But I got it running at least. Doing the math, that means that a normal software renderer would need 128 megabytes of memory to keep everything loaded all at once. And after a bit of googling, it turns out that there were 486s back in the day that had 128 megs. So if you were super rich/freeloading off a work machine/etc you could have played this map back in the day without incurring constant cycling of textures from memory. I also found an instance of the midtex bleed bug that eluded me for a while before fixing it, which made me unreasonably happy to see it in Chocolate Doom and not Rum & Raisin. Anyway, this mapset is looking like a beast from what I've seen. Looking forward to the full release. Probably won't have much time to do a proper playthrough and report bugs at the moment, as you might gather from this post my main interest is in making sure the thing actually runs.
  12. This is intentionally broken until I finish player interpolation. So here's how it currently works. For all mobjs, the previous and current positions are interpolated according to where the display frame is in between the simulation frame. This does technically mean that you will be guaranteed to see past data except for one out every <refresh rate> frames where a tic lines up exactly with a second. The one exception to this rule is the player. The angle is not interpolated at all. Instead, the most recent angle is used. And for each display frame, we peek ahead in to the input command queue and add any mouse rotation found. This was the quick "it works" method that let me get everything up and running. Here's how it should work. Exactly as above, except do for the display player's movement exactly what we do for mouse look. The full solution will require properly decoupling the player view from the simulation. At this point, it will be the decoupler's responsibility to create correct position and rotation for the renderer instead of the renderer doing the job. There's some more setup I need to do for that to work correctly, but it will also cover all keyboard inputs once working. It may not seem like much to anyone, but I really got sensitive to input lag when implementing 144Hz support in Quantum Break. I tested a few other ports to see how they feel too, in fact. prBoom has the worst feel by far to me, with every other port tried feeling about the same. My intention is to ensure there's basically zero effective lag between when input is read and when it is displayed (ie read on the same frame you render) and thus remove input error considerations from skilled play. There's also one thing I've realised after playing Doom at 35Hz for so long: it cannot be overstated how much of an effect frame interpolation has had on raising the Doom skill ceiling.
  13. GooberMan

    Scientists prove that AAA gaming sucks.

    This game will kick your ass, ya filthy casul. The entire industry is expecting Elden Ring to take the statue next year.
  14. GooberMan

    Scientists prove that AAA gaming sucks.

    Having worked on a BAFTA-award-winning AAA game with no microtransactions and a reputation for requiring a high degree of skill, I feel I should point out that the game likely to win the same BAFTA next year can be described in the same manner. And that the AAA game that's likely to win the same award next year has many players saying how rewarding it is to progress after grinding out better weapons and stats points to get past their progression blocker. You basically can't release a mobile game without the practices highlighted here - unless you expect to make $bugger-all from your game. These practices continue to make money. The gnashing of teeth that they suck is a sentiment that I share, and entirely expected from a community based around a nearly-29-year-old game where satisfaction for most comes from freely-available user made content.
  15. Maybe it's time for a Planisphere update too. Because I've been seeing threads drop below 9ms lately. But maybe more impressively: The Given. At 2560x1200, the original screenshots showed >40ms per render thread back on July 6. That's basically playable in software now. Even more so if you drop it to a Crispy-style resolution. Still todo: Fixing the load balancing code to not pile on the last thread when threads > 4. But I'm chasing something else right now: vissprites and masked textures. I decided to open up Comatose yesterday (runs, but seems to require some Boom line types so you can't leave the first room without noclipping). It's something of a dog on software renderers. Disproportionately on sprite draws. Running -skill 0 shows very reasonable render times. So I wanted to know what was going on. Threw some more profile markers in to see where the time was going. I'm seeing two problems here: 1) Sprite clipping is awful, it does a ton of work just to render nothing. 2) Sprite clipping is awful, it does a ton of work and then when it does draw stuff the rendering routines aren't ideal but aren't really the performance bottleneck here. So I'm currently grokking how sprite clipping works. I already have ideas on what I want to do to it, but I need to understand a few more bits of the code before I can dive in and do what I want with it.
  16. Been going through the column rendering routines to get speed back on UI/sprite/etc elements. You know what that means: It's glitchcore time! Also I guess this shows off frame interpolation and all that. Some issues with SDL and it being unable to detect the highest refresh rate a duplicated display is running at means I can't get 120FPS footage just yet. But it'll come.
  17. Latest release: 0.2.1 https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.2.1 Still the same deal as the last release, it's semi supported. I want maps that are limit-removing that break this port so I can work out why and tighten it up. This release has some null pointer bug fixes, and some oddities I encountered when trying to -merge Alien Vendetta instead of -file. The big one y'all will be interested in though: I decided it was well past time I implement frame interpolation. Now it hits whatever your video card can handle. As it's borderless fullscreen on Windows, it'll be limited to your desktop refresh rate.
  18. And also the ARM used in the Raspberry Pi. But I think I'm going to do a deep dive on how to handle division anyway. You can turn on a faster division at compile time on ARM; and there's also things like libdivide. I don't think it'll be a massive win at this point, but it'll still shave a bit of time off. My next focus on ARM though is just what in the heck is going on with thread time consistency. Only the final thread is performing in a consistent manner, every other thread wildly fluctuates in execution time. I can eliminate weirdness with the load balancing algorithm too, based on screenshots. With load balancing: And with no load balancing. Getting those to level out and not fluctuate should let the load balancer work better, and bring the total frame time down.
  19. https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.2.0 Release is out. Preliminary support is in for using flats and textures on any surface. Which means Vanilla Sky renders as intended. But it's not perfect. It has rendering routines based on powers-of-two. And MSVC absolutely cannot deal with the template shenanigans I'm doing, it takes half an hour to compile that file now topkek. Clang just does not give a fuck, even when compiling on my Raspberry Pi. Still got some work to do though. Vanilla Sky isn't exactly playable thanks to bad blockmap accesses. Still, this 0.2.0 release is the "break my port with some maps" release.
  20. Getting dangerously close to being a real port now... I'll do a 0.2.0 release once I've done a bit more work on the limit-removing side. I was incidentally pointed towards Vanilla Sky on Discord the other day. It needs a limit-removing port. And, yeah, Rum and Raisin doesn't break a sweat playing it. It's the complete opposite of Planisphere in that regard - it's a big city map where the entire map isn't visible half the time. What it does do, however, is mix flats and textures. This isn't a huge task to hack something together. Both are already stored as full textures in memory, although I will need to transpose flats and update code to match. And it's easy enough to use the index values there to indicate whether to look for a flat or a texture. But as I'm converting the code to C++, I can do it better. So I'll do that, then call it a build.
  21. There's nothing technically standing in the way of doing Switch homebrew myself. Professionally, however, is a different matter. I'm a professional in the video game industry, and that means I need to play by those rules. Having said that, it's unlikely I'll ever actually use a Switch devkit myself. But still.
  22. This one's just for fun - I've used fraggle's text screen library to render the traditional vanilla loading screen messages.
  23. I got curious and decided to see how much of a performance difference it made. Change some defines in m_fixed.h so that rend_fixed_t is just an alias for fixed_t (itself an alias for int32_t) and: About 10 milliseconds saved on that scene... at the expense of reintroducing rendering bugs that have no good solution at 32-bit short of using 32-bit floats. Still, it does confirm to me that there's value in my intended approach - providing a renderer at 32-bit precision by default, and only using the 64-bit precision renderer when -removelimits is running. I'm 100% curious as to what some of the worst-performing 100% vanilla maps are now. One of my stated goals here is to get Doom's renderer running at 1080p on a Switch, and since I can't exactly do Switch homebrew the Raspberry Pi is the next best thing. I suppose the obvious test WADs will be whatever's been released on the Unity port, since those currently do run on a Switch. EDIT: Well how about NUTS.WAD Yeah, I definitely need the 44.20 renderer for that.
  24. Loaded up the ol' Pi 4 and ran latest on it. That's with a pixel count slightly higher than 1080p. The IWADs should all be able to run at 60, just need to focus on some usability issues and do that "wake threads" thing I was talking about. The real question though is "What about Planisphere???" Not great. Not terrible. But actually playable in a "This reminds me of playing Doom on my 486SX 33 back in the day" way. Another way to look at it - that's "GTA4 on consoles" performance territory. I dropped down to 1706x800 renderbuffer and gave it a bit of a play. Not bad. Some scenes absolutely murder the Pi though:
×