-
Content count
109 -
Joined
-
Last visited
About sqpat
-
Rank
Junior Member
Recent Profile Visitors
The recent visitors block is disabled and is not being shown to other users.
-
FastDoom: DOS Vanilla Doom optimized for 386/486 processors
sqpat replied to Redneckerz's topic in Source Ports
I'm in the middle of rewriting R_DrawPlanes by hand. I came across this code and it seems so obvious in hindsight but... while (t2 < t1 && t2 <= b2) { spanstart[t2] = x; t2++; } while (b2 > b1 && b2 >= t2) { spanstart[b2] = x; b2--; } This code can just turn into a couple of memsets (at the c level, or a REP STOSW at the asm level.) You just need to calculate the byte count based on the minimum of the two conditionals. At the 32 bit level maybe even DWORDs can be written at a time (I think memset handles this by default?). For realdoom the change was not huge performance gain but it's at least measurable for me (0.1, 0.2% fps) and i see the same code in the fastdoom repo. -
FastDoom: DOS Vanilla Doom optimized for 386/486 processors
sqpat replied to Redneckerz's topic in Source Ports
Ahh, that makes sense. In my case I'm using so much XLAT, LODSB, STOSB, etc that are already one byte instructions. I'm not sure that any of those work in protected mode with 32 bit addresses or segments. -
FastDoom: DOS Vanilla Doom optimized for 386/486 processors
sqpat replied to Redneckerz's topic in Source Ports
I'm not super familiar with the addressing modes but I took a look again. I think actually that it will be a smaller instruction if the offset is one byte in size (between -128 and 127). But if it is even two bytes in size, then the instruction uses the whole 32-bit offset Here is what i'm seeing out of my compiler output: mov [edi-80],al -> 88 47 d8 mov [edi-160],al -> 88 87 60 ff ff ff I suppose for real 'min-maxing' you could do four of these ([EDI -120] [EDI - 40], [EDI + 40], [EDI + 120]) for four consecutive pixels and then add 320 to EDI once for the next four pixels. However, you'd have to offset EDI by the right starting amount for entry into the unrolled loop, which might be tricky to figure out. This could potentially also work without the extra register - maybe adding constant 320 to EDI once (7 byte instruction) is worth avoiding four instances of the 6 byte mov (12 bytes saved in total). Neat, didn't know about those.. I sometimes store backup values in ES to avoid push/pop and going to memory on the 286 side. And then in some other spots I turn off interrupts and use bp/sp which is pretty gross but sure helps, too. -
I went through ultimate doom a bit today. In order to make RealDOOM work with it, I needed to account for a higher maximum number of nodes and subsectors to support E4M9, which is bigger in those fields than any other currently supported map. It wasn't too bad - one particular segment of memory overran by about 32 bytes, which was annoying. I decided to runtime detect if the ultimate doom wad was loaded (kind of hacky) and shift some fields around based on that. There's a lot of small bugs to fix - the menu doesn't display the episode 4 title, but it's selectable. Help screens and such are buggy across different wads and different versions of the exe.. the scenario 3 and 4 level end screens crash. I'll have to test the ultimate doom finale screen too, I'm sure it's buggy because it's different. But otherwise it seems to be working for the most part. I didn't know that the doom source was set up so that all these different versions of the exe existed and didnt work properly under the same exe. I'm going to try and remove as many version #ifdefs from the code as possible and have alternate behaviors run based off the detected wad. That way there doesn't need to actually be multiple exe versions to maintain. It will make the exe a tad bigger, but oh well. Aside from that, there are 15-20 small bugs I'll work on over the next couple weeks to hopefully make the game behave better.
-
FastDoom: DOS Vanilla Doom optimized for 386/486 processors
sqpat replied to Redneckerz's topic in Source Ports
I don't know how important prioritizing 386sx performance is for FastDOOM, or if you have looked into 386sx specific rendering loops but I was doing some reading on the 386sx prefetch queue and some of the render loops I saw in the fastdoom code came to mind. Since the 386sx bus is 16 bit the prefetch queue is probably struggling to keep up with some of the more powerful 386 instructions - especially ones with 32 bit offsets and such. Correct me if I'm wrong but i think a line like mov [edi-(LINE-1)*80],al results in a six byte instruction, possibly causing a prefetch bottleneck. But a pair of instructions like add edi, ebx ; register stores 80 constant value mov [edi],al results in four bytes worth of instructions. Of course, this also assumes you have a spare register around in which you can shove the "80" constant into. -
I have four EMS pages (64k) in conventional memory planned for it that I haven't used yet. That's meant for 3 sound effects and one music file playing at a time to exist in those four pages. Then there is 2 MB minimum of total EMS memory that can be paged in in total. Once the wad contents are loaded into those EMS pages, it's not that slow (maybe several hundred to a thousand cycles worst case) to page in to conventional memory and I probably will do that mid-interrupt if necessary as sound files larger than 16kb approach the 16kb boundary. But this is happening every several seconds probably so its not too expensive. (I only have 150-250kb free of that 2 MB of EMS left actually, so when sound gets added I'll probably make the application require 4 MB of EMS. I'm pretty sure there will be enough memory at that point.) To be clear, it is the 640k (ish) conventional memory limit that I'm struggling with right now with final doom. Since I already have some set aside for sound, I'm not worried about that. But generally speaking, level data is always in conventional memory when it is needed for a specific code region. There is no paging going on, and if I were to add it it would cause many paginations to happen per frame. I think the easiest option is to just forget about final doom. I could probably stretch some things to make it work if I were willing to make things a lot slower, but I'm pretty happen with doom 1 + 2 support.
-
I've put in a bit of work over the past couple weeks and converted a lot of smaller functions in the render pipeline to ASM in the first couple days. It was about an 0.5% fps gain but the functions were pretty minor. Then It was becoming apparent that I probably needed to move the texture caching code to ASM next, but I wasn't happy with the texture caching code yet. I spent a good two weeks on it now, and while it's not necessarily any faster it is a lot more stable and doom2 map 30 is stable again. Some improvements include that I had to store multi-page textures (16384 bytes and up) in sequential EMS pages, but now I no longer have that requirement. I also used to store patch textures and composite textures in separate caches, but now they are in the same combined cache so there's less wasted space in the EMS pages and less code. Overall the textures probably waste less memory now. It mostly matters for large complex scenes (like doom 2 ep 30 for sure) where there's hundreds of KBs of textures on screen at once. I have mentioned a few times that I think I have more memory than I know what to do with. I think I will try to increase the conventional memory texture space (what I call the L1 cache) from 64kb to 96kb. I am pretty sure I have the space for it, but I need to shuffle around memory locations to make room for it. If I can, I'd like to also make space for 96kb instead of 64kb of L1 sprite cache. In order to make these changes though, I need to make some final big decisions about max memory sizes for various fields (sectors, nodes, etc) and that probably means doing Final DOOM support now instead of later, since it represents some of the 'worst case' memory scenarios. So the ASM conversions and speed improvements are taking a small break after all. Which means the plans are probably shifting towards: Next release - Final DOOM support and try to fix all known bugs Following release: Finally do all the render pipeline in asm Following releases: Savegames and sound EDIT: Okay, upon second look final doom is too big. I need 300kb instead of 190 kb in memory to store level data. I might have 30-40 kb extra available but not 100kb available. So lets just replace occurrences of "Final DOOM" in this post with "Ultimate DOOM" and call that good enough.
-
Questions About Column Drawing Code in Doom Rendering
sqpat replied to Sonim's topic in Doom General Discussion
OK, I took a quick look at the topic... let me see if i can sum it up simply. Let's ignore windows, steps, etc for now and talk about the simple wall case. In DOOM, basically everything is straight up and down in the z dimension (height). When the BSP is subdivided relative to the player's position/angle (camera), the visible walls ("lines/sidedefs") are generated and those lines are mapped to (camera-space, not world space) X pixel ranges. Each of these "wall ranges" map a single wall to a single x pixel range. I believe StoreWallRange will basically copy the relevant data into a drawseg if im not mistaken. RenderSegLoop will render the draw segs, and R_DrawColumn will render the columns in the general case. Anyway, this wall will be at some angle relative to the player view. If you imagine a single wall in the distance, you can imagine a left and right end point of the wall - per pixel as you render left to right, the top of the wall, the bottom of the wall, and the texture stepping per y pixel will change. But they do so in linear fashion, interpolating between some value based on the left and right end points of the wall. I'm not looking at the code at the moment, but - In a given wall range being rendered (RenderSegLoop) - I believe topfrac may just be the the starting y value for a column draw, and topstep may be the per-pixel delta of topfrac. Aside from that, the u (texel x) is static per column rendered and is passed into R_GetColumn to get the texture column for that pixel column. Then the starting y, number of pixels to draw, starting v and v step is used by R_DrawColumn and the texel per pixel is generated. EDIT: As for HEIGHTBITS, I assume thats just the Z scale they designed the game for. Try changing it and see what happens. I assume the camera view will get either super stretched or squished. -
RealDOOM v 0.23 is out! - Fixed-in-place DS/SS segments at 0x3C00 - Fixed-in-place near variable address support - Build scripts reworked - ASM and C now share constants - Render code removed from main binary and now dynamically loaded in to high EMS addresses during render phase. - 386 build with faster FixedMul (this can later be expanded a lot more...) - Chipset specific EMS builds for VLSI SCAMP/TOPCAT, C&T SCAT, Headland HT18 with direct chipset accesses for EMS pagination - Fixed column rendering precision - Binary ~20k smaller than version 0.22 (Currently around 151k) - Extra runtime data files reduced from 5 to 3. (just doomdata.bin, doomcode.bin, dstrings.txt) - A little bit faster. 1-5% on 386, 486, pentium (mostly thanks to the 386 build option) 286 is similar 1-2% faster, but if you use SQEMM or a chipset build (built since v0.22) then things will be up to 10% faster. I'm away from my hardware and can't confirm the chipset builds work right now so I've left them out of the release package. I think I'll need to debug them again later before I can say they are stable and good to go. The main goal for 0.24 is to do the whole render pipeline in ASM, maybe some of the most-run physics code in ASM, do some math optimization and hopefully squeeze out another 10% fps gain. If the scope feels too large I'll break that into two releases. After this next speed improvement push, I'll move onto remaining features (savegames, sound, maybe final doom/ultimate support, maybe networking?, etc) and try to release a real beta version.
-
I managed to get DS fixed at 0x3C00 by doing 2 things: 1. Copying the data from the starting DS segment to 0x3C00 (interrupts off for the stack portion) 2. This took me a while to figure out, but the DS segment must be re-set at the start of any interrupt function because the (openwatcom) compiler manually sets it back (makes sense I guess). These two tasks were pretty basic ASM tasks but I had to stare at disassembly and read openwatcom cstart source for a bit to understand what was going on and figure it out. Once this was done, I fairly quickly made all references to thinkers (and thus mobj, plats, doors, glow, etc etc) near instead of far (since the thinkerlist is positioned at the 0x4000 segment, they are within the near segment's addressable range). The code compiled 4700 bytes smaller. This is just openwatcom now turning all these far pointers into near ones in all the varied physics calls. On pentium its 0.3% fps gain, and on a 286 it's 0.7%. There's a handful of other things besides just thinkers that reside in this memory area and I'll soon be able to make the compiler generate more smaller and faster code. I'll work on that for a little bit. Overall, the main tasks have been completed for the 0.23 release and I mostly want to do some polish and bug cleanup (doom 2's not loading again...) before a release. with this fixed DS segment and manual control of variable locations, so many more optimization targets (space-wise moreso than speedwise...) have become possible due to the control I have over memory. I want to focus on the render pipeline again once the release is out and try to squeeze another 10% fps gain out of the engine. I also need to figure out if i can create some better caches (column caches, texture caches, etc) while that code is still in C and not asm. I feel there is some performance left on the table there. With so much memory available, it might make sense to increase some cache sizes. Also I'm a big dummy and the player sprite is *not* always colormaps 0. Oh well!
-
Interesting. I'm not knowledgable on 32 bit x86 at all yet (eax addressing?! what?). I guess if i'm understanding right the stall comes from EAX being overworked. Would you be able to alternate using EAX and EBX every other pixel to ease the agi stall a little bit? For player sprites, my observation is that colormap is always 0 no matter how bright the area you are in is. I think it gets overridden by fixed colormaps though. So it would be a very very high pecentage of the time. For all other draws I did some investigating and posted about it earlier in the thread... colormaps 0 is definitely the most common in use, but in the doom1 shareware timedemos its in use 15-30% of the time depending on level. I was wondering if it was worth having a colormaps 0 version of every texture in memory... it would be pretty wasteful but maybe up the fps a couple percent. I think it depends on how much memory you are willing to waste... In theory you can have a copy of a texture for each colormap sitting in memory and be really wasteful - and then never have to do the extra lookup. --- Meanwhile I managed to get an extra build step pulling R_DrawSpan (and R_MapPlane) out of the binary into a file and I load it during startup. Memory usage is down ~3kb from that. Theres a lot of other functions I can do this to in the render pipeline and I should be able to save another 5-10kb or so. These functions are all already loaded high, but they are copied (duplicated) from the main binary at runtime so it's kind of wasteful. I talked about the really nitty gritty details on the vcfed forum thread. To sum it up, i'm using the linker/assembler to create a memory area in DGROUP near the start and i put variables in there with fixed DS-offset addresses rather than use c variables. This way I can rip binary code out of the binary to file, and the linker cant move the near variables I need access to around from build to build like it will with local variables, so its safe to export this code out to a separate file and only update when I update that code. The C compiler is really really bad at optimizing this code, and r_segs got 1kb bigger and the build is 1% slower, but this is all stuff that's next on the list to be handwritten into asm so I will get it all back soon. The work on build scripts and process continues and I hope to get the binary size down by another 20-30k and maybe fit the wad index into low conventional memory so I dont have to juggle it around in EMS at runtime. I'm not sure it will contribute to FPS much though, so maybe there is something more effective I can do with spare memory - lookup tables or somesuch.
-
Sorry, by the 1:1 part do you just mean to avoid doing scale calculations in draw column? It'd probably not be a huge gain in RealDOOM either- but it also makes code smaller and the amount of time it takes to fetch all the instructions adds up. Probably 0.1 or 0.2% fps gain at best though. I definitely have to look into the masking thing and the colormaps 0 thing as I think those would have much better gains.
-
Small things: - Fixed the visual issue with plane lighting. It was a zlight data issue. - Made a build script that takes C constants (especially stuff calculated by preprocessor) and outputs it to a .inc file so asm code has access to the same stuff. It used to be manually copied over, now it's part of the build process and overall much cleaner. - Added some optimizations like imul immediate, leave, and shift immediate in ifdef blocks of asm. - Made the 386 build script with a 386 version of fixedmul. Surprisingly it just kind of worked first try once i figured out the proper tasm syntax for the 386 asm file. Fixed Div was not immediately working. On a pentium it was 1% faster. If I want to take this to it's eventual end for a 386sx optimized version, there's also lot of Fixedmul variants i've made for when smaller parameters are used (helpful for the 286 but not 386) where I probably need to alias them all back to the generic version. It's not an immediate priority, I just wanted the basic idea working. - Moved variables (_DATA) to be after the null area and before where strings are stored (CONST/CONST2) in the DGROUP segment. This way the variable region does not move around from build to build as strings are added and removed. This was doable just with wlink options. This is kind of a first step towards the main goal of the next release. Next I will try and get rid of near variables and replace them with casted constant memory locations (basically doing the linker's job). Then I will build this into the .inc generator and expose all the same memory locations to the .asm file basically having the linker's job done by the c preprocessor (for variables, not functions). This will then make it easier for me to export code from the binary and make the main binary smaller. Since the external references in the asm will be under my control and the locations are not moving around from build to build, I can then kind of just stick code in a file and load the code at runtime into EMS memory and page it in and out. For example, stuff like am_map and p_setup which are large but are running only during very specific instances do not need to be constantly robbing the system of conventional memory. Improved build tooling is probably the main theme of this next release, before I get back to converting everything to asm. I had some thoughts about another way to improve performance. Basically, player sprite rendering should be able to have some optimized draw functions. Depending on screen size the scale of the sprite changes, but in some sizes its scale 1.0 and i can probably do something similar to the sky render function where I remove a lot of the texture math. Even in the cases where the scale is not 1.0, the scale is a known number and I believe requires less precision so I can probably get away with a faster function anyway. Since scale does not change from column to column in render I could also make the outer function faster. Also, the palette is almost always 0 so i can possibly have a colormaps-lookup baked-in version that also renders faster. I also forget and want to look into whether or not the player sprite ever used for masking. Maybe we can get away with less drawing of pixels that will be drawn over by the sprite. Player sprite pixels take up a lot of the screen so I think this is another type of improvement that could lead to something like on average a full percent improvement.
-
Phew.. finally got this working on a real 5150 again: The code hasn't really run on 8088s since I made the change to EMS 4.0 late last year, so it's been awhile. (There was even a several month period where it did not run on a 286). I ported SQEMM to support the Intel Above Board - sort of. I still don't know all the ports on the card and so I have to boot with the intel driver, then reboot with mine in the config.sys otherwise I get parity errors - and the Intel driver doesn't work with RealDOOM, so it has to be rebooted in that order. But hey, I was able to get a real recording, so I'm happy. Since the 8088 build is always 6-7k bigger, saving all that memory in the past week off the binary really helped. I'm going to be travelling for the next six weeks starting tomorrow, so no retro hardware access, but I'll probably keep doing a little bit of work on RealDOOM. If everything with 0.23 goes according to plan I'm going to have like another 30-40k of memory freed up. I don't even know what to do with it anymore. Final DOOM etc feels like a foregone conclusion at this point. I just wish I could use the memory to make things faster. Maybe a larger sprite or texture cache, not sure. There was always that crazy idea of keeping a copy of all textures etc with palette/lightmap 0 prebaked into it to do a faster render like I do with sky textures but it would require a lot of extra EMS pages. Maybe some more multiplication lookup tables would save a small number of frames. I'll have to think about it.
-
It turned out I needed to upgrade watcom from 1.9 to 2.0 to fix my printf dependency issues - it was an old OW 1.9 bug. (Thanks to Jiri Malak from openwatcom for helping me debug this). So then I did that, and I rewrote the uses of fprintf and redid the printf function to a new smaller version (doesn't support everything, but enough...) and now the binary is 2-3kb smaller. The binary is around 8-9k smaller than a week ago. I'll eventually rewrite this stuff in ASM and make it even smaller. I think I got a decent implementation of the HUD text optimization working. I have a feeling bugs might eventually pop up with it, but I can't find any right now. I didn't really do a full rewrite of the HUD code, but I think I still may down the road. So while I had made improvements to performance to smaller screensizes (9 and lower) with those changes, i've gone ahead and done the opposite now. I rewrote V_DrawPatch and V_DrawPatchDirect (only the 2nd was really necessary) so that patches render a little faster. It makes about an 80 realtics different on screensize 10 with a fast 286. This is sort of regardless of quality level, so its hard to attribute a real percentage fps increase, but its 1% on potato, 10 screenblocks and probably less on high detail. In these functions I experiemented with some crazier self-modifying code to modify loop instructions directly rather than stack variables. I think it is a little faster and really helps in juggling fewer things in registers but it's uses are somewhat limited. It might find use in R_DrawSpan in the outer loops so I may examine it later.