Maes Posted July 1, 2010 Sorry for the bump, but this is relevant: during tinkering with my Java source port, tentatively named Mocha Doom (WIP), I tried to see what would happen if I rendered columns "horizontally" into memory (aka if I stored the columns of a screen buffer columns-first, kind of like a FORTRAN array). First, I used the straightforward code to paint TITLEPIC on a 320x200 canvas 40000 times (we're talking just byte copying here, not displaying) and it took about 22.5 seconds, that boils down to 0.55 ms per "rendering". However, by arranging the screen buffer in "columns first" order the time dropped to 15.5 seconds, that's a 25% improvement just from memory arrangement. And using System.arraycopy instead of a hand-written loop, resulted in just 2.5 seconds(!), that's nearly a 1000% improvement (however that's because of the use of pure native code AND favourable memory arrangement). Even 25% seemed impressive enough. The downside? In order to actually display the picture to screen or save it into a file, I had to transpose rows/columns...and that CAN'T be sped up with any tricks -_- In fact it took 2.5 ms to "unscramble" the columns-first picture into a rows-first one, which is almost an order of magnitude slower that the slowest renderer, thus nullifying any potential speedups arising from a columns-first arrangement. So here's your answer: at most a 25% improvement, or around that much. On an architecture that supports columns-first bitmaps natively. As for my port, I'll probably allow for patches to "decode" to a rows-first form and have an auxiliary renderer, which can use System.arraycopy during patch writes, along with a "traditional" one. 0 Share this post Link to post
40oz Posted July 1, 2010 I'm imagining your suggestion to make the maps shaped like traditional wolf3d maps except that the floor and ceiling can bend into many different directions in weird sloping patterns. I have no idea what the benefit of mapping this way would be. 0 Share this post Link to post
Maes Posted July 1, 2010 Err..nothing like that. On a related note, repeating the tests (the "suites" are in the project's CVS) on a Core 2 T8300 (original timings were on a Pentium IV 3.00 GHz, under 32-bit Windows XP and Eclipse IDE, these ones are under 64-bit Ubuntu) revealed faster times by a factor of 4-5, but the general ratio between rendering modes was the same: an even less marked speed advantage for columns-first, and the unscrambling process itself is still a half order of magnitude slower than the slowest renderer. So there. To better illustrate this... here's how the TITLEPIC looks when "scrambled" in memory: Essentially, I save columns in ROWS, one after another. It's possible to switch back and forth between the two representations. yeah, it's FASTER to write columns-first in memory (instead of skipping rows with += SCREENWIDTH), but then unscrambling the buffer negates all advantages of doing so. Sorry for the FUGLY Java default 8-bit indexed palette :-p 0 Share this post Link to post
RestlessRodent Posted July 2, 2010 If you do decide to go horizontal you might as well start drawing polygons on the screen. 0 Share this post Link to post
Maes Posted July 2, 2010 GhostlyDeath said:If you do decide to go horizontal you might as well start drawing polygons on the screen. Not that, but it's possible to convert all flats and sprites into "normal" bitmaps once in memory -or in "rows major" patches-, and draw them in that fashion without changing the essence of the renderer. Whatever optimizations might be nullified by that approach might be recuperated in cache locality -which however adds up only between 10%-25% speed from what I've tested-. 0 Share this post Link to post
RestlessRodent Posted July 2, 2010 Maes said:Not that, but it's possible to convert all flats and sprites into "normal" bitmaps once in memory -or in "rows major" patches-, and draw them in that fashion without changing the essence of the renderer. Whatever optimizations might be nullified by that approach might be recuperated in cache locality -which however adds up only between 10%-25% speed from what I've tested-. I'm talking about walls. Unless as you draw a wall you have a drifting pointer that exists on a texture's x and y axis. 0 Share this post Link to post
Maes Posted July 2, 2010 GhostlyDeath said:I'm talking about walls. Unless as you draw a wall you have a drifting pointer that exists on a texture's x and y axis. Not quite sure where you're trying to reach here...however yeah, I actually do have a "drifting pointer" that goes inside patches, columns, posts and linear screen buffers (after all, I'm following the C code quite closely, especially in number-crunching stuff -> thank my unspeakable days of converting C and FORTRAN number crunching stuff to Java for that!!!). Perhaps you meant that I could use some sort of 2D acceleration/do direct bitblts to the screen/linear buffer? That's another option I'm considering -at least the various DrawPatch methods in v_video.c (now fully converted to Java) always seem to draw entire patches with no column clipping etc. so there's no reason not to draw "horizontalized" patches with those. However when scaling is involved, there are other functions in r (draw.c and others) that are used. Those access the screen LNB and draw one column at a time, scaling it if needed. Rewriting those for horizontal rendering would be more of a chore. It would be interesting to see if in modern architectures the same column-major format that allowed for easy clipping back in the 486 days is actually a bottleneck, and that drawing and overwriting a few extra pixels is now actually more preferable to trashing the cache. I haven't reached that part of the code yet (just noted how it works) however a scaled texture could presumably be "accelerated" by drawing a tiled polygon directly or somesuch, instead of multi-passing. So far however, I'm pleased to see that in general rendering appears fast enough as not to be a bottleneck, so if this is ever completed it should have framerates comparable with other software-rendered ports. 0 Share this post Link to post
arrrgh Posted July 3, 2010 Spleen said:The reason to go with the software renderer rather than OpenGL (or using Quake's renderer like Vavoom does) is to preserve how DOOM looks in general. Extending the Doom renderer to have full freelook, even if very difficult, is still much less work than extending it until it has all the features of a 3D renderer! Hopefully, if I rework small pieces at a time, I can avoid changing the appearance too much. A 3D renderer using nearest-neighbor scaling and no mipmaps for a classic pixelated appearance looks almost identical to software rendering, except for the lighting(which is admittedly a big difference). If you are going to render horizontally then the colourmap lighting will be slow because it has to flip between many entries as it renders across the screen, as opposed to column based rendering. 0 Share this post Link to post
tempun Posted July 3, 2010 arrrgh said:A 3D renderer using nearest-neighbor scaling and no mipmaps for a classic pixelated appearance looks almost identical to software rendering, except for the lighting... ...and also sprite clipping. 0 Share this post Link to post
arrrgh Posted July 4, 2010 True, but IMO it doesn't really change the feel of the game much. 0 Share this post Link to post
RestlessRodent Posted July 4, 2010 arggh and tempun got some pretty good points there. 0 Share this post Link to post
Mr. T Posted July 4, 2010 The only reason I can imagine someone wanting to use a software renderer in this day and age is for nostalgia, which is fair enough considering how damn old Doom is. Personally I always choose hardware at around 800x600 in a window with filtering turned on, that looks nice. 0 Share this post Link to post
Kira Posted July 4, 2010 Mr. T said:The only reason I can imagine someone wanting to use a software renderer in this day and age is for nostalgia, which is fair enough considering how damn old Doom is. I use the software renderer for older wads because some have glitches with OpenGL (including AV). 0 Share this post Link to post
Mr. T Posted July 4, 2010 K!r4 said:I use the software renderer for older wads because some have glitches with OpenGL (including AV). Yeah for was that have "effects" that break gl I will use the software renderer but other than that no. The jaggies and horrible colors (reds for example look crap in dark areas) turn me off, even if that is how I played doom back in 1996. 0 Share this post Link to post
Xaser Posted July 5, 2010 I've always found GL in Doom surprisingly effective at not making it look totally wrong, though generally that's referring to GZDoom, which is somewhat nicer (though still not perfect) about preserving old rendering-oddity-features. 'Course, that's with filtering off and light differences ignored, but it works. ;) 0 Share this post Link to post
Maes Posted July 7, 2010 After implementing a "fast transpose" algorithm, the time it takes to tranpose one framebuffer from column-major to row-major format is comparable to a "slow" vertical renderer, and slightly better than it for large resolutions (aka the time to do a column major rendering + transposing to rows-major for displaying on the screen, is better than just rendering the screen "vertically" directly in rows-major format). But the canvas must be forcibly blown to be a square matrix with its sides being a power of two (although there is probably a workaround) and thus making a memory/speed tradeoff. Now, if EVERYTHING was made to be rows-first (patches, scaled sprite drawing, column drawing, flat drawing) then there would be no need for transposing and fast memcpy (or System.arraycopy) calls could be used, in both C,C++ and Java or whatever. The problem is that the renderer scatters this functionality along many functions and modules, and thus you must not "miss" any. The good thing is that you can have mixed methods (there are already DrawBlock methods that work with a favourable rows-first memory arrangement), as long as the final canvas is of one kind only. 0 Share this post Link to post
Porsche Monty Posted July 10, 2010 Viewtiful-Chris said:What do you mean by classic? I think zDOOM can play in perfect classic-style. Just because it saves at the beginning of a level for you (saving those few seconds) and it has some fairly minor technical differences, it's still very much 1993's masterpiece. It just looks nicer and runs at silky-smooth framerates. One of the biggest problems with using ZDoom for "classic" dooming (again, favoring ZDoom over the vanilla-focused alternatives for a vanilla experience makes little sense to me) is the screen (more specifically the architecture and the offset of the weapon) which were distorted/displaced around version 2.0, possibly after a significant render update. Try comparing a 320x200 ZDoom screenshot against a vanilla screenshot; the differences are unpleasant to say the least. 0 Share this post Link to post
beetlejoose Posted July 10, 2010 If you don't care about efficiency because any computer nowadays is going to be fast enough, in principle you could achieve real vertical rotation without having to convert the engine to scanline rendering - probably to about +-45 degrees. Use a larger frame buffer than will fit on the screen. Render to it as normal. Next, map the result onto an equirectangular texture using a buffer of precomputed spherical coodinates. This result can be rendered onto the inside of a sphere. Position the veiwpoint at the origin of the sphere. Rotating the sphere 'up and down' will give you free mouse look, but you would only have a limited viewable vertical angle. It could be implemented using a wrapper library and no changes to the original engine itself. Maybe a bit overkill - but a different way to think about it... 0 Share this post Link to post
Maes Posted July 11, 2010 beetlejoose said:If you don't care about efficiency because any computer nowadays is going to be fast enough... No. beetlejoose said:in principle you could achieve real vertical rotation without having to convert the engine to scanline rendering - probably to about +-45 degrees. Of course you can, but the "naive" transposition algorithms are incredibly slow, and while they may be OK in MS PAint and photoshop, they are NOT in a real-time game. Faster is better, always. That's what I'm working on right now, and judging by the number of scientific papers dealing with "fast matrix transposition" (that's what the "vertical rotation" I'm trying do here actually is) this is an ongoing problem bound by logical/architectural limits. You will have to go from a memory layout that's cache-friendly (e.g. storing columns-first in memory and rendering as columns) to one that's cache-unfriendly (converting to rows-first requres to skip columns in a way that causes cache misses). The only way to speed this up, is to segment a matrix into small squares sized as powers of two (and whose width can fit into a cache line) and use different algorithms for diagonals/off-diagonals, if we're talking about full matrices. beetlejoose said:Stuff about -/+ 45 degrees Would work for what you proposed, but efficiency aside, it's not what I've been trying to address here. As of now, whoever is implementing the software renderer is faced with two perverse options: Keep the screen buffer in a rows-first format (which is also compatible with the graphics card) which however will SLOW DOWN the column-based rendering which DOOM uses for sprites/wall textures because of cache-unfriendly arrangement. Keep the screen-buffer int a columns-first format which speeds up column rendering greatly REGARDLESS OF LANGUAGE USED AND REGARDLESS OF COMPUTER SPEED, IT'S JUST FASTER as long as you have cache memory, BUT you have to transpose THE WHOLE SCREEN BUFFER before sending it to the graphics card (it's NOT exactly the same as rotating it 90 degrees, it's more like mirroring along the diagonal) Doing it with a precomputed table and polar coordinates would be even slower than procedurally copying rows/columns: what causes slowdowns here are the cache misses, introducing further lookups of computations in between isn't going to help. The only way to speed it up is to arrange data in memory in a specific way (e.g. 64 bytes apart, so that EVERYTHING is cached) at the expense of wasting extra unusable memory. Some gotchas: The columns-first rendering that DOOM uses (coupled with a rows-first screen buffer) is actually not that disadvantaged when it comes to scaled columns (because you can draw "enlarged" pixels by just drawing on pixels of the same row, which will be in the cache). Then again a fully horizontal renderer will ALWAYS have that advantage, even for unscaled sprites/graphics. The columns-first rendering that DOOM uses (coupled with a rows-first screen buffer) is advantageous for quickly clipping portions of walls/rooms/sprites that are NOT visible, because of the 2.5D nature of the engine. At that point, it all depends on whether more time is spent on the game/clipping logic or on actually drawing stuff on the screen, and which one of the two causes more cache misses. 0 Share this post Link to post
beetlejoose Posted July 11, 2010 I may have stated what I had in mind a bit too casually - but then again it was just something that I dreamed up in the 2 minutes while reading the post. It sounds like you've been thinking about this for quite some time, Maes. It seems like an interesting challenging problem - one that I'd like to think about some more. At the moment when I think of Doom I automatically think of Doom95 - compiled and no source code. Granted, that may seem pointless - but its what I was introduced to a few months ago and now I use it to exercise my brain in my spare time... So we're talking about different problems - at least as an implementation issue - but I still think its relevant. Anyway, I realize there are hard problems to do with this and I don't propose to use some 'naive' method. I have programmed the mapping of photographs onto spherical textures and rendering them on spheres in realtime before - this is similar and at a much lower resolution. Also it only has to deal with a particular case which means all the slow stuff can be done beforhand. The transposition would happen in a new layer after doom had done all of its rendering by hooking and overriding the directdraw library. The only time the slow mapping using matrices will be required is at startup when the precomputed array is generated. The data in the frame buffer will already be layed out in scanlines -doom has already done the column to row conversion. The mapping from the frame buffer to sphere texture per frame will require no slow arithmetic operations. The precomputed buffer will be a list of pointer pairs to the pixels in the framebuffer source and the sphere texture destination. This list will be read through in sequence once per frame. The pairs can be sorted to minimize cache misses on the source and destination arrays - although a certain amount will have to be tolerated. The mapping will not reach all the way to the poles of the sphere so the distortion will be less. The destination texture simply receives the colour pointed to in the source buffer. The final step will be to do the textured sphere (OpenGL or DirectX...). I think using hardware acceleration in the final step should not matter from a purist point of view since this is already quite a departure... 0 Share this post Link to post
Maes Posted July 12, 2010 beetlejoose said:The data in the frame buffer will already be layed out in scanlines -doom has already done the column to row conversion. That alone is shit-slow, and comparable to the time it takes to render a frame columns-first to begin with. And N.B., it needs no computations other than pointer arithmetic, whether explicit or implicit, to access the screen buffer array. DirectDraw doesn't really enter into it at this point, and sending the screen buffer to a graphics device is another problem in itself. beetlejoose said:The mapping from the frame buffer to sphere texture per frame will require no slow arithmetic operations. The precomputed buffer will be a list of pointer pairs to the pixels in the framebuffer source and the sphere texture destination. Again, this is most irrelevant to the problem at hand, but I'll elaborate a bit more. So basically you will have a LUT that maps spherical coords to a linear screenbuffer (actually, it will just distort a "full" rectangular screenbuffer into a sparser, spherical framebuffer). The problem with this is that such a table will pollute even the largest Cache memory: simply put, for a 1024*768 resolution, you would need to store X,Y pairs for nearly a million points. Even with 16-bit integers, that would mean constructing a big-ass table of 4 MB in size, which means that the Cache will be trashed by it and nearly nothing else will be able to use it. Unless you're working on an architecture so shitty that has no cache to begin with, this is an issue you can't overlook. Then there's another problem: the "sphere" will map to -you guess it- your screen, which has a finite number of points. If you just map the Doom screen buffer to spherical coordinates, you will get a rather sparse "expansion": it will look OK near the center view (pixels will be close together), but near the edges pixels will start to form gaps between them, because they are not actually pixels but SURFACES. Good luck drawing properly deformed polygonic SPHERE SURFACE FRAGMENT to fill in the gaps...so actually you will need TWO LUT tables: one that maps from the final screen to the sphere (so that EVERY pixel in the final screen that you see has a spherical equivalent), and THEN a mapping from ALL visible pieces of the sphere to the Doom Screen buffer. I am designing GIS software where you have to do this sort of transformations between two geographical maps projections and a rasterized form, and well, it just is not something you'd like to do in software for real-time stuff. It can only become viable by using hacks that trade pixel-perfect accuracy for a smaller memory footprint, dumbed-down approximations and so on. 0 Share this post Link to post
Spleen Posted July 12, 2010 Thanks for all the useful info, Maes! Maybe this is why (almost?) nobody who tries to make games for money still makes 3D software-mode games despite the fact that as of 2007, 90+% (source: http://computershopper.com/feature/the-right-gpu-for-you) of computers come with horrible integrated graphics cards - it sounds like a huge task to optimize. The market is huge and the graphics should probably be around Wii-quality by now. Now here's a question which will probably sound silly to you and other experienced programmers, however I am not experienced enough to answer it for myself, so I'm going go ahead and ask it. How much exactly does using Java instead of C/C++ affect performance in this particular instance? Also, maybe at least the spherical-mapping part could be GPU-accelerated (with OpenCL or CUDA probably)? Using a GPU to handle the map geometry ruins the lighting, but if you do it on an already-rendered image, then maybe it will be fine. 0 Share this post Link to post
Maes Posted July 12, 2010 Spleen said:Now here's a question which will probably sound silly to you and other experienced programmers, however I am not experienced enough to answer it for myself, so I'm going go ahead and ask it. How much exactly does using Java instead of C/C++ affect performance in this particular instance? You will hear a lot of speculation and FUD in this field, however I have used both C, C++, FORTRAN and Java for heavy number-crunching (that's one reason I spent so much time trying to optimize the transposer, even though it's probably not worth it) and I feel more entitled than most to have a say in this matter :-p If you google for "C vs Java" and "Java vs C++" you'll get a lot of links to some infamous benchmarks that show Java pulling ahead of C/C++ or even kicking its ass, which is perfectly possible under certain circumstances and vs certain compilers. In general, pure number-crunching such as floating-point operations is equally fast in both potentially, because of Java's Just-in-time compilation. Things change when there's a lot of array access/pointer magic: Java has strict bound checking, so there's a small penalty each time you access an array. Now, if you also do heavy maths for every element you access, that's not much of a problem. E.g. if for each element in Java it's 10% access time and 90% computations, and in C/C++ it's 5% access and 95% computations, that's nothing to be too excited about. If the access time is more important though (such as copying random shit from all over memory to other random places) then yeah, you would have a net 50% advantage with C/C++ off the bat. What I noticed is that in practice with Java you will have 60-80% of optimized C/C++'s performance in number crunching involving large arrays and tables. A practical example was timing the Client JVM vs gcc with -O3 for Doom's fixed point (implemented in Java) over large arrays: Java had 60-70% of the best possible C performance. If you just brutally copy data around without much manipulation, C/C++ will be faster (but then you can use System.arraycopy in Java too for naive copies, which is written in native code, and probably it will be some suck-assy code to begin with). Also, things change when you allocate/deallocate a lot of objects which however is not a typical situation in number crunching (and if there ever is a speed tradeoff, Java makes up for it with type safety/runtime robustness). Take note that there's also a Server JVM which applies much more aggressive optimizations to the code during runtime than the "tame" Client JVM that runs the Applets in your browser, and that can often apply runtime optimizations that are impossible for a static compiler to foresee. And finally, there's always GJC, the GNU Java Compiler ;-) Spleen said:Also, maybe at least the spherical-mapping part could be GPU-accelerated (with OpenCL or CUDA probably)? Using a GPU to handle the map geometry ruins the lighting, but if you do it on an already-rendered image, then maybe it will be fine. Yes, that would be an ideal application for GPGPU: you isolate a repetitive, parallelizable, CPU and Cache intensive problem from the main system, and let it grind away while you do more general/logic stuff. 0 Share this post Link to post
beetlejoose Posted July 12, 2010 Sorry for hijacking your thread Spleen. Errm, sorry Maes - 'shit slow', 'irrelevant'? Maybe you should take a little extra time to understand somebody else's ideas before shooting off at the mouth like that! I don't mind constructive critisism but what you are saying sounds like you've got the wrong end of the stick. First of all, please understand that I'm not saying that spherical rendering is a fast way of accomplishing anything at all. I don't suggest that you or anyone else should use it for writing a new software renderer! I am merely saying that I think it can be done sufficiently well using todays hardware as an add on to a piece of software that was COMPILED for the machines of 15 years ago based on what I have tinkered around with before. The faster CPU's, memories, caches and buses provide a gap into which I can add a limited vertical rotation specifically for Doom95 by providing a wrapper library that overrides DDRAW.dll... Maybe I should have started a new thread for these ideas - anyway. I don't understand your objection to what I intend to do with DDRAW.dll so I'll let you 'elaborate' on that a bit more before I respond. Now a few points about my irrelevant lookup table. First of all, I specifically said that this table will be read 'sequentially' once per frame. I said nothing of the table being cached entirely. Its read in a simple cache coherent highly efficient sequence - not random access! It dosen't matter how big the table is it will not flood the cache. Besides : There are size optimizations applicable to this table If I use array indices rather than pointers. The mapping is symmetrical horizontally and vertically so the total size can be reduced by a factor of 4. Also the destination pixel spacing can match the sphere texture so those 'pointers' will not be needed. That'll reduce the size by a further factor of 2 at the expence of having to ignore a few pixels outside of the mapping. There will be no need for two tables! If the lookup table is made to match every pixel in the spherical buffer there will be no gaps closer to the edges. The pixels in the source image will be mapped to more than one pixel in the spherical image where there would otherwize be gaps. So I will not need any luck with spherical polygon filling algorithms! Having a precomputed buffer is MOST relevant! It reduces a whole matrix multiplication per pixel to a simple lookup. I am not stepping on your toes here, Maes. My ideas are irrelevant to your software renderer in Java - but I wasn't talking about what you are doing. The GIS software sounds interesting though. I should note that these are only ideas that need testing - and that is what I shall do. 0 Share this post Link to post
Maes Posted July 13, 2010 beetlejoose said:Errm, sorry Maes - 'shit slow', 'irrelevant'? Yes to both. beetlejoose said:Now a few points about my irrelevant lookup table. First of all, I specifically said that this table will be read 'sequentially' once per frame. So the table will need to have at least as many entries as pixels on the screen if you want to do a 1:1 mapping without symmetries. That's 300K+ for 640*480 resolution. Said entries must be the very least pairs of 16-bit integers (so 4 bytes per entry), so even by cutting everything by 1/4th we're looking at one big-ass 600K+ table, in addition to the actual image data itself. beetlejoose said:I said nothing of the table being cached entirely. Its read in a simple cache coherent highly efficient sequence - not random access! It dosen't matter how big the table is it will not flood the cache. You can't decide what will be cached and what will not. Since you'll need to read ALL of the table at some point, parts of it will flood all of the cache, just as with any other table, like the screen buffer itself. LUT tables don't work well when you use too many of them. beetlejoose said:There are size optimizations applicable to this table If I use array indices rather than pointers. ...which are the one and the same, at the end. beetlejoose said:The mapping is symmetrical horizontally and vertically so the total size can be reduced by a factor of 4. That's the only thing that makes sense, but it's not enough to cut the cache onslaught back enough. It's just too big of a table anyway. beetlejoose said:If the lookup table is made to match every pixel in the spherical buffer there will be no gaps closer to the edges. That is true, but you are putting the cart before the horse here. beetlejoose said:Stuff about needing, then not needing, then needing again two tables OK... let's see: Doom renders its shit in a typical rectangular fashion. You somehow map this to a spherical surface -> that WILL introduce gaps and distortions, some pixels from the original screen buffer will not map directly to any pixel in the spherical texture (which should actually be thought of as as non-rectangular texel) You need an inverse mapping (from the "sphere" to the screen buffer) so that, as you said, some pixels from the screenbuffer map to more than one sphere points. In GIS software, you often need to work with "pluggable" pairs of geographic/cartesian coordinate systems which can both be non-linear (e.g. Conical input, Mercator output, with or without a specified target and input raster). You can't dumb it down to a single table. OK, so unlike the GIS software in your -hypothetical- renderer you can get away with a single table as long as the resolution of the renderer and the geometry and intended output resolution of the spherical texture itself doesn't change. You would have to recreate it as soon as either of these parameters change, and, quite unintuitively, it should be mapped from the sphere to the Doom's screen buffer, and not viceversa. 0 Share this post Link to post
beetlejoose Posted July 13, 2010 Admittedly the table will be big. But I still suspect that the transformation can be made efficient enough to achieve an acceptable framerate - admittedly not the fastest possible. This is the ONLY method I can think of that might work to give Doom95 vertical rotation. I'm doing this because its a challenge not because its sensible! I cannot directly control what parts of the table are cached. But at least by using strong sequential locality when reading, the cache has the option to prefetch the table one or several cache lines at a time discarding the previous ones because they are not referenced again. The same will apply to the writing of the spherical texture. That will be sequential. Once each written cache line is finished with it will not be written again and flushed to main memory as the space is needed. The reading of the flat buffer will not strictly be sequential but there will still be long stetches of data referenced next to each other. My point of array indices versus raw pointers is that it will 'slightly' simplify calculating the reflections in the mapping in my program - but point taken - this is probably a moot point to make as the results will still boil down to pointers. Correct, the mapping will need to be regenerated if the screen size is changed - but I don't think that will happen very often. I am going to try a test case and it will be interesting to see what the results are. 0 Share this post Link to post
Maes Posted July 13, 2010 beetlejoose said:I cannot directly control what parts of the table are cached. Basically, whatever you read + whatever fits on the same cache line. For typical tables accessed rows-first, that means that going e.g. first horizontal and then vertical works wonders, and works even better for "flattened" array. The Doom rendered "scrambles" this nice order by jumping from row to row -> different cache line. beetlejoose said:But at least by using strong sequential locality when reading, the cache has the option to prefetch the table one or several cache lines at a time discarding the previous ones because they are not referenced again. The problem is that with a large enough cache, all of it will eventually have to "pass" through the cache and then start rolling over. At the same time, it will have to write stuff to the screen buffer (which is ALSO a large table, and will be cached with the same rules). And of course there is still code to be executed between a LUT table read and a screenbuffer write...so that further smudges things up. In general, moving stuff about between two relatively large tables is about the worst thing you can do to a cache memory, unless your tables are very small and can fit wholly into the cache without associative overlapping (e.g. for a 320*200 screen, and you can control the memory layout so that the tables are sequential and don't map to the same cache lines). 0 Share this post Link to post
beetlejoose Posted July 13, 2010 I've thought of another way to do this! You don't need to create an intermediate spherical texture at all! See, I was thinking in terms of using a pre defined sphere primitive in OpenGL or DirectX that requires the texture to be in the spherical format. But this problem is really about how the texture coordinates are generated for the sphere in the first place. If I create a new sphere object that generates its own texture coodinates taking into account the spherical distortion needed then the flat buffer can be used directly. That could all be executed on the GPU and no large transform arrays will be needed! The quality of the transform can be adjusted by setting the number of slices and segments in the sphere like normal to create larger / smaller polygons. It pays to be critical because it makes people think harder - thanks Maes! 0 Share this post Link to post
tempun Posted July 13, 2010 beetlejoose said:I've thought of another way to do this! You don't need to create an intermediate spherical texture at all! See, I was thinking in terms of using a pre defined sphere primitive in OpenGL or DirectX that requires the texture to be in the spherical format. But this problem is really about how the texture coordinates are generated for the sphere in the first place. If I create a new sphere object that generates its own texture coodinates taking into account the spherical distortion needed then the flat buffer can be used directly. That could all be executed on the GPU and no large transform arrays will be needed! The quality of the transform can be adjusted by setting the number of slices and segments in the sphere like normal to create larger / smaller polygons. It pays to be critical because it makes people think harder - thanks Maes! Why is sphere needed at all? Imagine a tall box with vertical walls. Its walls are textured with output of Doom's software renderer. We only need to adjust view angle. 0 Share this post Link to post
beetlejoose Posted July 13, 2010 I'll answer based on what I think you mean, but if I'm wrong then please give some more details. Do you mean to render the output onto a series of large flat surfaces? Simple, but unfortunately that won't produce the desired effect. As you rotate the view the image on the surfaces become compressed. Imagine when you have rotated so that your viewpoint is in the plane of a surface then you won't see anything at all. A sphere produces the correct projection. But even when rendering a sphere on a graphics card its an approximation made out of small flat surfaces - triangles. Each triangle has the same distortion as above but because they can be made small and many a better approximation can be achieved and the distortion minimized. 0 Share this post Link to post