Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
antares031

Frame rate drop in ZDoom

Recommended Posts

Whenever I play complex levels with tons of linedefs in ZDoom, the frame rate drops worse than any other source ports, except Doomsday. I can understand that GZDoom has better frame rate since it uses OpenGL rendering while ZDoom uses software rendering. But PrBoom+ has the best frame rate without OpenGL rendering, even better than GZDoom. I believe that PrBoom+ also uses software rendering, so I can't understand why there's a huge difference between PrBoom+ and ZDoom. Both has the same resolution, 1280x960, with default display options. A screenshot below shows the frame counter in MAP28 of Speed of Doom.



I couldn't find fps counter for PrBoom+, but I'm pretty sure that the performance was better than GZDoom (Maybe around 40~50 fps). Is there any options to optimize the frame rate in ZDoom? I tried to turn off options from display option and change resolutions to lower one with no success.

Share this post


Link to post

No surprise there, prBoom+ barged into the scene with loud fanfares and much rejoicing years ago exactly because, for the first time in nearly 15 years of Dooming, it made NUTS.WAD-like maps actually playable.

In such maps, there's much more that affects speed than just the renderer, and prBoom+, while also having heavily optimized software and OpenGL renderers, also has the most optimized implementation of the Doom/Boom engine around. Also don't forget that, unlike ZDoom, it doesn't have to support scripting or 10 different games: it only plays Doom (w/ or w/o Boom extensions) and that's it.

As to why using OpenGL doesn't automatically guarantee rubber-burnin' framerates at all times, that's another can of worms.

Share this post


Link to post

There's a few things at play here:

PrBoom is faster than ZDoom and GZDoom, there's two reasons for it: 1. The renderer is a lot simpler and 2. Entryway has access to an Intel compiler which can make quite a difference (comparing the official build with one compiled with Visual Studio could make a difference of 20%.)

When using high amount of translucency, software rendering can break down quite dramatically if it's supposed to fill the entire screen, especially at high resolutions.

And of course: ZDoom's software renderer really isn't the best thing around, it has indeed some performance issues that other ports may just shrug off.

Share this post


Link to post

@Maes, @Graf Zahl: Ah, I see. Obviously I was underestimating PrBoom+. I've just realized that I can actually play nuts.wad with almost no frame drop, even though my PC is not a good one, in PrBoom+. Thanks for the answer, guys. :)

Share this post


Link to post

Just to note: testing that map, the performance in the development builds (or at least 2.9pre-137-g01bed05) is far worse than it is in, say, ZDoom 2.7.1.

Share this post


Link to post

That may also be. The recent portal submission has some performance issues that haven't been analyzed yet.

Share this post


Link to post

That's why there was a 2.8 release made before merging in these big changes. I know it's generally accepted to use the development builds instead of the official ones, but in this one case if you're worried about stability or performances, please use the official build until the kinks get hammered out.

Share this post


Link to post

The most feature-packed port is also the slowest? Who would have thought?

Anyway for Speed of Doom you can use a lot of other boom-compatible ports, so no probs.

Share this post


Link to post

It's a temporary setback. Use the official release and it'll be ok. The new extensions still need some fixing.


I think now we experience the downside of not having anything official for 2.5 years with people assuming that it's still all the same...

Share this post


Link to post

That frame at the top is a terrible comparison with the frame below it, due to the overlaid translucent fireballs. Rendering those fireballs tends to destroy the memory cache. That scenario wreaks havoc on all software renderers, including PrBoom+. Just turning off translucency on that scene should add a few fps.

Just drawing non-translucent overlaid sprites taxes a software engine drastically, again due to memory cache invalidation and flush.

ZDoom's software renderer is actually extremely fast, as far as Doom software renderers go, as it implements a horizontal quad draw scheme which attempts to paint 4 contiguous pixels per vertical row, greatly reducing the cache issues.

Some engines will have unusual slowdowns in certain video modes. The width of the video mode is something to be aware of, as certain widths sit unusually poorly within the memory cache scheme. Sometimes a higher video mode can be even faster than modes with less resolution. 1024x768 used to be a big problem on older systems.

But, yes, the renderer is only part of the story. The game simulation speed becomes significant in levels with multiple hundreds of monsters throwing fireballs, tons of sectors/linedefs, huge levels, lots of lifts, empty or poorly-built reject map, etc.

If the engine allows mobj freezing, turning that feature on, in combination with toggling translucency and/or changing resolutions should help isolate the cause of the slowness.

Share this post


Link to post
kb1 said:

Some engines will have unusual slowdowns in certain video modes. The width of the video mode is something to be aware of, as certain widths sit unusually poorly within the memory cache scheme. Sometimes a higher video mode can be even faster than modes with less resolution. 1024x768 used to be a big problem on older systems.


Memory Stall problem: https://en.wikibooks.org/wiki/Microprocessor_Design/Cache#Memory_Stall_Cycles

Share this post


Link to post
jval said:

Wow, nice page, jval!

Now, there are ways to mitigate the issue somewhat. By separately attacking each aspect of the problem, some solutions become apparent. I'll list a few ideas below.

(note: I'm not getting too technical here. I present the basic issue, but reality is more complex, as usual.)

. The wall issue: Doom paints its walls vertically, which is fine on a 486. But modern processors are optimized to encourage programs to write "horizontally", or contiguously. When Doom paints a wall on a modern processor, the memory subsystem accesses that pixel, and starts doing caching on subsequent memory positions, in a "horizontal" direction. Unfortunately, that cached data is useless when Doom writes the next pixel vertically. Actually not only useless, but detrimental, because it now needs to be flushed, which can waste time. Depending on the resolution, and the sizes of various cache buffers, this flush/reload may occur every 4 vertical pixels, every 2, every pixel, etc., but definitely, way too frequently.

. The 1024 issue: 'Width 1024 (and 2048, etc)' sometimes compounds the problem. Since 1024 is a power of 2, lines are frequently evenly divisible by the cache size. This virtually guarantees a flush each pixel. One fix for this is to use buffers a bit larger, so cache lines don't line up perfectly. So, if the user wants 1024 resolution, build your off-screen buffers 1028 wide. This change alone can improve performance noticably (though unintuitively).

. "Quad" drawers: One way to avoid cache issues is to reduce cache misses by painting horizontally instead of vertically. A few ports are doing this, with varying degrees of benefit. The problem is, it's really hard to paint horizontally, when Doom is designed to paint vertically. One approach is to treat your off-screen buffer as if it consisted of sets of 4 pixels rotated 90 degrees. If it sounds complicated, that's because it is. This convoluted buffer is read in blocks of 4 when dumping the scene to the actual video buffer. This can actually be faster than the straight-forward approach.

. Modern processor instructions: Modern processors have special instructions that specifically prevent caching during data access. I guess the cross-platform goal gets in the way of hand-writing assembly renderer primitives that are optimized for the port's target processors. It's a shame, because there are real performance gains to be had, and it's plenty easy to conditionally include it.

Once written, they would probably not have to be touched again for a long time, and they are easy enough to do.

. Game AI - All possible renderer optimizations are for nothing, if your port has slow AI. This can occur when adding features to the port. Swapping hard-coded constants with runtime-changable variables can cause a performance hit. But one of the biggest culprits are conditional statements: If, Switch, etc. Modern processors try to predict when a conditional statement will prove True or False, even before they are encountered in normal program flow! When the processor guesses correctly, programs tend to run faster, cause the processor can do some work ahead of time. But, if they predict incorrectly, it actually can cause a large slowdown.

Programmers know how to optimize conditionals. Problem is, if you micro-optimize too much, the code can get difficult to manage or read. As far as ZDoom goes, I don't know how optimal the codebase is. But, ZDoom supports the most editing features (read: conditionals) of any port. I would suggest that, if there are optimizations to be had, the programmers are doing the right thing by holding off on micro-optimization, and instead waiting for, say, a major release. I do know that, a few years back, when I was profiling renderers, ZDoom's was at the top of the list.

Share this post


Link to post

ZDoom's software renderer should still be among the fastest, but with the development branch there's the issue of the recently merged in portal code, which has some performance issues as apparently it creates gratuitous overdraw in non-portal areas. Once this is resolved, performances should go back to what they were before.

Share this post


Link to post
kb1 said:

. The wall issue: Doom paints its walls vertically, which is fine on a 486. But modern processors are optimized to encourage programs to write "horizontally", or contiguously. When Doom paints a wall on a modern processor, the memory subsystem accesses that pixel, and starts doing caching on subsequent memory positions, in a "horizontal" direction. Unfortunately, that cached data is useless when Doom writes the next pixel vertically. Actually not only useless, but detrimental, because it now needs to be flushed, which can waste time. Depending on the resolution, and the sizes of various cache buffers, this flush/reload may occur every 4 vertical pixels, every 2, every pixel, etc., but definitely, way too frequently.

(citation needed)

Share this post


Link to post

I am not sure if you're being sarcastic, if you're looking (or advocating for) more information to enable more research into the phenomena, or if this is a new forum policy on the posting of vaguely technical information. Can you please be more specific? The source of the information I posted is my knowledge and experience. I did state that my goal was to avoid getting too technical, to allow me to state the issues in layman's terms.

Anyone more interested in cache performance can find plenty of technical information on the subject, stated much more eloquently than I can. A quick Google search on "pipeline burst cache", or "mapping memory into cache lines" yields lots of results.

Can you please provide some clues on when a "(citation needed)" response can be expected? (And exactly what is really being asked of me :)

Thanks.

Share this post


Link to post

I always assumed it was because CPU caches use the least significant bits of the memory address as the cache key. If you're writing vertically into a framebuffer and the pitch is a power of two, you'll therefore end up always writing into the same cache entry/entries and invalidating the cache with every write.

Problems like this (non-randomly distributed keys) are why good hash table implementations tend to use prime numbers for the table size, but of course it's very difficult to an efficient divide by a non-power-of-two in hardware.

Share this post


Link to post
fraggle said:

I always assumed it was because CPU caches use the least significant bits of the memory address as the cache key. If you're writing vertically into a framebuffer and the pitch is a power of two, you'll therefore end up always writing into the same cache entry/entries and invalidating the cache with every write.

Problems like this (non-randomly distributed keys) are why good hash table implementations tend to use prime numbers for the table size, but of course it's very difficult to an efficient divide by a non-power-of-two in hardware.

That is exactly the way I understand it, and you describe it very clearly and directly, which is something I tend to have difficulty with. Thank you! In fact, it doesn't have to be a power of 2. The number "64" is a cache size I remember. So, 320, 640, 960, 1280, as well as 1024, causes an invalidation quickly (not as quickly as a power of 2 would, but, maybe, every few pixels.)

Share this post


Link to post
kb1 said:

In fact, it doesn't have to be a power of 2. The number "64" is a cache size I remember.

Am I misreading this part or are you suggesting 64 isn't a power of two?

Share this post


Link to post
kb1 said:

Can you please provide some clues on when a "(citation needed)" response can be expected? (And exactly what is really being asked of me :)

I was under the impression that a cache write miss has a significantly smaller speed penalty than a cache read miss, because it didn't block the CPU from moving on. Or am I misinformed?

Share this post


Link to post

One of the things that modern cpus do to mitigate cache misses is to do the transfer in the 'background'. So the next CPU instruction can be executed even though the previous hasn't actually completed. When reading memory the cpu can execute the following intruction if it doesn't require the register that was read.
This is why modern compilers try to reorganise code to move the read/writes away from the actual usage.

Share this post


Link to post
Linguica said:

I was under the impression that a cache write miss has a significantly smaller speed penalty than a cache read miss, because it didn't block the CPU from moving on. Or am I misinformed?

Revenant said:

Am I misreading this part or are you suggesting 64 isn't a power of two?

It has to do with the address bits being equal, not necessarily being all zero. I'll try to describe it hypothetically, because I don't know the entire inner workings of modern memory management enough to be exact.

So our hypothetical machine has, say, 8 cache lines of 64 bytes each. Those 8 lines are blocks of memory that could be described as follows:

CacheLines[8]
{
  _i64 PhysicalAddress;
  byte Data[64];
  bool Valid;
  bool Dirty;
}
When the CPU wants to read from memory, the address is masked as follows:
mask = (Address / 64) & 0000111b;
This returns a value 0 to 7, representing on of our 8 cache lines. Then, the following logic happens:
ReadMemory(_i64 Address)
{
  If ((Address & 11111111 11111111 11111111 11000000b) == CacheLines[mask].PhysicalAddress)
  {
    If (CacheLines[mask].Valid &&
        !CacheLines[mask].Dirty)
    {
      ReadMemory = CacheLines[mask].Data[Address & 00111111b];
    }

    Else
    {
      ReadMemory = PhysicalMemory[Address];
      RebuildCacheLine[Address];
    }
  }

  Else
  {
    ReadMemory = PhysicalMemory[Address];
    RebuildCacheLine[Address];
  }
This is real simplified to the point that it's not correct, but, hopefully presents the idea that the address to which you read/write can have a devastating effect on how the cache works. If you read memory sequentially, you remain in the same cache line throughout all 64 bytes. But, if you read memory at byte 0, then byte 64, then byte 128, etc, you are asking the cache subsystem to pull 64 bytes into CacheLine[1], say, then immediately toss all 64 bytes, and reload CacheLine[1] with a new 64 bytes. Rinse, repeat.

So, when your horizontal resolution is a multiple of 64, and you are drawing a wall at the left-most position of the screen, here's what's happening:

[Resolution 640x]
Draw 1st wall pixel at screen/memory address 0
CacheLine = (Addr / 64) and 7) = 0
RebuildCache[0] from address 0 to 63.

Draw 2nd pixel below prev. pixel (position 640)
CacheLine = (Addr / 64) and 7) = 2
RebuildCache[2] from address 640 to 703.703.

Draw 3rd pixel below prev. pixel (position 1280)
CacheLine = (Addr / 64) and 7) = 4
RebuildCache[4] from address 1280 to 1343.

Draw 4th pixel below prev. pixel (position 1920)
CacheLine = (Addr / 64) and 7) = 6
RebuildCache[6] from address 1920 to 1983.

Then the pattern repeats at pixel 5: CacheLine 0, 2, 4, then 6. So, for resolution 640, you must flush every 4 vertical pixels, which is bad enough.

Now, how about 1024 resolution?

CacheLine = (Addr / 64) and 7) = 0
CacheLine = (Addr+1024 / 64) and 7) = 0
CacheLine = (Addr+2048 / 64) and 7) = 0

Yep, a flush for each and every pixel. This is worst case scenario.

Now, I know I have been discussing reads intermixed with writes. But a similar issue occurs with writes. Eventually, data that's written to the cache needs to appear in main memory, and the whole system has to be able to offer the proper value, regardless of where the data may reside at any given moment. It may be in L1, L2, L3 cache, or in main memory. That's what the Valid and Dirty bits are for. Writes to main memory may occur in the background, as long as the cache line is still valid, and not being forced to be flushed. That's what a flush does: Finish pending writes, and pull a new block of memory into cache.

That background writeback process is thwarted when the CPU is constantly forcing cache to be flushed. I am amazed that it somehow keeps it all together when this is happening. It's kin of amazing, if you consider what it is trying to accomplish.

Think of sequential reading/writing as moving along with traffic, and vertical writing as crossing a major highway, forcing oncoming cars to stop, backing up the whole highway, while you crawl across to get to the other side.

This is very easy for programmers to test:

1. Time this loop: Increase the 1000 enough so that it takes a few seconds, making it easy to get an accurate timing result.
int buffer[1024*768];

for (count = 0; count < 1000; count++)
{
  for (int y = 0; y < (768-1)*1024; y+=1024)
  {
    for (int x = 0; x < 1024; x++)
    {
      buffer[y+x] = 0;
    }
  }
}
2. Now, swap the "for y" line and the "for x" line, and time it again:
int buffer[1024*768];

for (count = 0; count < 1000; count++)
{
  for (int x = 0; x < 1024; x++)
  {
    for (int y = 0; y < (768-1)*1024; y+=1024)
    {
      buffer[y+x] = 0;
    }
  }
}
This second run is much slower, as the cache lines are being thrashed. You system may very, depending on your cache sizes, and other factors, but it should be obvious. Be sure that your compiler is not optimizing anything out of existance. Maybe compile without optimization to be sure.

Once again, this post is greatly simplified, and modern memory handling is much more involved. Also, there are other cache schemes. In some cases cache lines are preserved/tossed based on last use time, and other factors. However, the above info should apply as a general rule. By the way, for a 3rd test, modify the second test as follows:
int buffer[1028*768];

for (count = 0; count < 1000; count++)
{
  for (int x = 0; x < 1024; x++)
  {
    for (int y = 0; y < (768-1)*1028; y+=1028)
    {
      buffer[y+x] = 0;
    }
  }
}
Time this version. With any luck, it runs much faster, even though it writes the same amount of data.

By the way, please do some more research, and find some better articles that describe the process more closely. I have great difficulty trying to explain concepts like this, especially in a forum post. I'd love to hear about your results. It would be really nice to devise a most-cache-friendly renderer that works well in many resolutions on many architectures. Maybe even a self-adjusting renderer. I've had some pretty good results with my experimental renderers, but I know we have a long way to go, and software rendering needs to be a bit faster for today's resolutions.

Wow, I wrote another book.

Share this post


Link to post

linguica@hissy:~$ gcc fart1.c -O0 -o fart1
linguica@hissy:~$ gcc fart2.c -O0 -o fart2
linguica@hissy:~$ gcc fart3.c -O0 -o fart3

linguica@hissy:~$ time ./fart1

real    0m2.846s
user    0m2.610s
sys     0m0.007s

linguica@hissy:~$ time ./fart2

real    0m29.733s
user    0m25.973s
sys     0m0.110s

linguica@hissy:~$ time ./fart3

real    0m14.570s
user    0m13.397s
sys     0m0.093s

good times...

Share this post


Link to post

So these results confirm what kb1 said, if the first example takes 3 seconds, the second 30 (tenfold!), and the last 15 (halved).

Share this post


Link to post
Linguica said:

linguica@hissy:~$ gcc fart1.c -O0 -o fart1
linguica@hissy:~$ gcc fart2.c -O0 -o fart2
linguica@hissy:~$ gcc fart3.c -O0 -o fart3

linguica@hissy:~$ time ./fart1

real    0m2.846s
user    0m2.610s
sys     0m0.007s

linguica@hissy:~$ time ./fart2

real    0m29.733s
user    0m25.973s
sys     0m0.110s

linguica@hissy:~$ time ./fart3

real    0m14.570s
user    0m13.397s
sys     0m0.093s

good times...


Thanks, Linguica! Wow, even more delay than I expected. You can't really fault id for vertical drawing, though. The algorithm makes a lot of sense, taking advantage of the fact that wall strips only need be stretched, never rotated, unlike floor textures. It was a brilliant discovery that seems obvious now.

Interestingly enough, on 1993 machines (386/486), I think ALL memory references were slow and equivalent, so it didn't really matter which direction you read from/wrote to. And, without the extra cache wait states and penalties, reading and writing ran at a more or less constant speed.

I was shocked at how slow my port displayed 1024x768 vs. 800x600. The 1028 back buffer is a neat trick. Here's another one:

Say you have a wall texture that's 3x8:

Aa0
Bb1
Cc2
Dd3
Ee4
Ff5
Gg6
Hh7
Assume that you're close enough, and perpendicular enough that it will be drawn un scaled/unskewed, zoomed 100%, so the output screen should look just like the texture definition above:
Aa0
Bb1
Cc2
Dd3
Ee4
Ff5
Gg6
Hh7
The straight-forward approach for your back buffer is to model it just like the screen buffer, and write to it just like above (I won't repeat it again).

The quad-back-buffer approach would have you write to the back buffer like this:
Aa0 Bb0 Cc0 Dd0
Ee0 Ff0 Gg0 Hh0
Or, maybe this (using the power-of-2 disruptor trick aka 1028 trick):
Aa0 Bb0 Cc0 Dd0
    Ee0 Ff0 Gg0 Hh0
NOTE: You always want to offset with a granularity of 4 bytes to keep data aligned on 32-bit boundaries. That's why you use 1028 for 1024, not 1025. Also, make sure all of your buffer's physical (logical) addresses are aligned on at least 32-bit (maybe 256-bit) boundaries. Many processors have extra wait states when reading/writing unaligned data.

This reduces the 8 flushes into 2, and may benefit further on short textures. Upon final blit from back buffer to the screen, you write to the screen in 4-byte chunks at a time, reassembling the blocks in correct order. It's efficient, cause you paint with doublewords (32-bits) vs. bytes (8-bits), and It's difficult to explain without a picture - I hope you get the drift.
At one time (and maybe it still does), ZDoom goes a bit further, by aligning the drawing of 4 texture columns at the same time, into this quad buffer scheme. The complexity of doing this reduces it's performance, but time is still saved by further avoiding cache issues.

IMHO, modern processors should allow the programmer to have more control of the cache, but you'd really have to know what you were doing. In fact, there is a very small set of instructions that allow a bit of control on cache. I haven't studied them yet. I think there's a couple of special "read/write memory without updating cache" instructions, and there used to be a non-blocking "prefetch" hint instruction to instruct the memory subsystem to fill a cache line for future use. You could, for example, prefetch a texture strip before trying to draw it., or you could write to screen memory without disturbing cache.

The BIGGEST cache crime in Doom rendering is sprite/extra floor overdraw. Double points for translucency. To render a translucent explosion on top of another explosion is cache murder! The buffer's current color has to be read, blended, and rewritten - vertically! The read must be accurate, whether it's in cache, or main memory. It may not have yet been written to main memory. But, now, we overwrite it! In this case, main memory is double stale - I have no idea how that is handled :)

I tremble in fear when I think of these new 4k monitors. 3840x2160? Cache flush every 2 pixels. 3840x2160xtruecolor = 33,177,600 bytes x (back buffer+screen buffer+source data) = 99,532,800 throughput * 35 fps = 3,483,648,000 = 3.5Gb/sec of data transfer, just to paint the screen without sprites. The naive method suggests that at least 100,000 cache flushes would occur in that second. That can't be good.

A few other points: This is different for each system. It would be interesting to run the samples on a 486 - I imagine they would all run about the same.

The most recent processors have what's called a code cache. This is a special, super-fast on-chip cache for running tight loops. This allows loops to run extremely fast - if they fit within the cache. Unfortunately, these code caches are currently really small, like 64 bytes. Intel's Sandy Bridge has a code cache, for example.

Traditionally, programmers have been taught to unroll loops, and I have not only seen, but tried out some unrolled texture render loops, with some success. But, with this code cache, you may be better off not unrolling, if it allows the loop to stay within the cache.
A cool approach would be for ports to offer a range of different render functions, each optimized a bit differently than others. Then, the port could have an "Optimize" function that runs at startup, or by user request. This function would swap out renderers, timing the results of each, and keep the fastest one for your CPU and motherboard. That would be a nice feature.

There are a ton of different ways to draw a wall strip: Standard, unrolled 2/4/6/8, FPU moves, conditional moves (CMOVE) SIMD memory moves, aligned code, inline vs called, pointer vs array, double/quad/8-way buffer config, C vs. machine language.

Some of these are not portable, so you'd have to have CPU-specific/compiler-specific add-on code.

Share this post


Link to post

Of course is that we were not originally talking about unoptimized code, we were talking about modern ports with modern compilers.

Andrews-Air:~ andrewstine$ gcc fart2.c -O0 -o fart2
Andrews-Air:~ andrewstine$ time ./fart2

real	0m14.679s
user	0m14.365s
sys	0m0.135s

Andrews-Air:~ andrewstine$ gcc fart2.c -O3 -o fart2
Andrews-Air:~ andrewstine$ time ./fart2

real	0m0.009s
user	0m0.001s
sys	0m0.003s

Share this post


Link to post
Linguica said:

Of course is that we were not originally talking about unoptimized code, we were talking about modern ports with modern compilers.



Which is ultimately meaningless if you have a piece of code that can be optimized away.
It should be quite clear from these numbers that the compiler just either reordered the loops or completely folded them into a memcpy, which for real column drawing is simply not possible.

Share this post


Link to post

Yeah, it's dangerous to optimise benchmarks. When the compiler decides that some data/variable/result is never used, sometimes it just omits whole pieces of the "useless" code. :D

Check the optimised binary in a compiler Linguica.

Share this post


Link to post
Graf Zahl said:

It should be quite clear from these numbers that the compiler just either reordered the loops or completely folded them into a memcpy, which for real column drawing is simply not possible.


From time to time this prompts the question on whether a "completely horizontal" renderer for Doom would be possible/worthwhile to write.

IMO, in a sense such renderers already exist in the form of OpenGL/accelerated 3D renderers, which do away with the column-based rendering, so they can be used as a sort of yardstick/upper limit on any -potential- performance benefits.

Short of a complete rewrite, another possiblity would be to draw columns on a transposed screen (swap x with y), and then transpose it again when actually displaying. The catch is that the transpose operation on such a large chunk of data is a very expensive and cache-unfriendly operation itself, but it just might be that beyond/below a certain resolution, an optimized transposed renderer might be slightly more efficient.

Another possibility would be to use a "hybrid" approach and have cache-optimized transposed software rendering and hardware-accelerated transposing for the final display.

Either that, or become super-hardcore and learn to play Doom on a transposed display, because "it's leaner and meaner that way!" ;-)

Share this post


Link to post
VGA said:

Yeah, it's dangerous to optimise benchmarks. When the compiler decides that some data/variable/result is never used, sometimes it just omits whole pieces of the "useless" code. :D

OK, I added a fprintf to stdout for both to ensure the code doesn't get totally thrown away. (And reduced it from a loop of 1000 to a loop of 10 since it's much slower in general with the output...)

Andrews-Air:~ andrewstine$ gcc fart1.c -O3 -o fart1
Andrews-Air:~ andrewstine$ gcc fart2.c -O3 -o fart2

Andrews-Air:~ andrewstine$ time ./fart1 >/dev/null

real	0m1.200s
user	0m1.187s
sys	0m0.008s

Andrews-Air:~ andrewstine$ time ./fart2 >/dev/null

real	0m1.223s
user	0m1.211s
sys	0m0.007s

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×