Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
Hell Theatre

Reasons for success or failure of source ports?

Recommended Posts

kb1 said:

There is nothing even remotely obfuscated enough in Killough's code to make it difficult to make a beneficial algorithmic change. And, once you write your nifty new algorithm, copy and paste it, then optimize the hell out of it - why not? All the player cares about is the fidelity of the gameplay, and of performance, and the player should always come first. 10% saved, here and there, means 10% more time to maintain frame rate, especially in net games.


This simply does not work. If you optimize too heavily for one compiler, the next one may not like what you did and create even worse code.
The best way to have good results across multiple compilers is to simply write clean and readable code and be done with it. I'll just point you to Killough's R_PointToAngle2 as a textbook example of an optimizer-dependent code design. Some versions of Visual C++ create some utterly horrendous machine code from it, while they nearly create perfect code from id's version.

Also, your math doesn't add up. In general, even on complex maps, ZDoom runs 4-5 ms at most for the playsim per frame. That means if you manage to improve matters by 10% you gain 0.5ms. And 10% is a lot, you'll never get to that with micro/peephole-optimization because code execution is spread out too far with the most heavily hit functions mostly being those that get affected by CPU cache stalls (i.e. those which iterate over other data.) The code you try to optimize here is not where time is lost normally. What you show here is typical 1990's thinking when computers were a lot slower and this stuff could quickly add up to multiple milliseconds - what today amounts to microseconds. How do you think ZDoom is getting away with having large amounts of play code scriptified? Switching on the profiler shows that this code runs about less than 0.1ms per frame, even if you assume that interpreted VM code is 10-15x slower than native code, the savings here are simply too marginal.

kb1 said:

Killough was a bit lax on commenting (which is really all that would be required). But, the thing is (my guess anyway): Killough has no problem reading code of this complexity, so I don't think he felt like it needed a lot of comments.


I cannot disagree. Behold the epitome of code that's unnecessarily hard to follow:

dboolean PTR_NoWayTraverse(intercept_t* in)
{
  line_t *ld = in->d.line;
                                           // This linedef
  return ld->special || !(                 // Ignore specials
   ld->flags & ML_BLOCKING || (            // Always blocking
   P_LineOpening(ld),                      // Find openings
   openrange <= 0 ||                       // No opening
   openbottom > usething->z+24*FRACUNIT || // Too high it blocks
   opentop < usething->z+usething->height  // Too low it blocks
  )
  );
}
Yes, that's just 6 lines of simple code that, despite the comments requires a disproportionally large amount of thinking to follow its meaning.

kb1 said:

Academic code is nice for us humans, but, at the end of the day, I'd rather play the source that the compiler liked the best, if you get my meaning.


It's great to know that you value the machine more than the people who have to work with it. But in the end, if people do not understand your code, they either won't touch it or just replace it at the earliest possible convenience.

At the end of the day, what's important is that the code can be used and/or reused. No such luck if you start optimizing to the metal in pointless places.

kb1 said:

Yes, it is parsed and converted. But it takes an aggressive optimization pass to remove intermediaries, wisely dole out precious resources (registers), take redundant code out of loops, etc. There are literally hundreds of optimization opportunities being taken advantage of in today's compilers. Without those, the compiler does a stand-up job at converting your code to assembly, but misses a bunch of these techniques.


There's a difference between writing clean and concise code that groups its work properly and taking clean code and mucking it up, trading human accessibility for marginal performance gains. I think every software engineer out in the world will tell you that this stuff is not really making software faster, for that you have to plan bigger, i.e. on the algorithmic level, not the implementation level.

kb1 said:

Writing good optimizing compilers is a massively complex task, cause if they change the meaning of the code one small bit, they are useless.


That's a given. But today we do have the luxury of being able to work with good optimizing compilers that are perfectly capable of taking some human readable code and optimizing it for the machine. Nothing these days should ever require writing long one-liners in Killough style just to squeeze out one more free register. And even if that were possible, it's not where time is lost.

The first rule of source optimization should always be to optimize the 1% of code that runs 90% of the time, not to optimize the 90% that only run 1% of all time combined. Investing work there is not going to make any difference.

As for code optimization vs. algorithmic refactoring here's a recent example from ZDoom:

Here's one of the drawer functions ZDoom had until December:

void vlinec4 ()
{
	BYTE *dest = dc_dest;
	int count = dc_count;
	int bits = vlinebits;
	DWORD place;

	do
	{
		dest[0] = palookupoffse[0][bufplce[0][(place=vplce[0])>>bits]]; vplce[0] = place+vince[0];
		dest[1] = palookupoffse[1][bufplce[1][(place=vplce[1])>>bits]]; vplce[1] = place+vince[1];
		dest[2] = palookupoffse[2][bufplce[2][(place=vplce[2])>>bits]]; vplce[2] = place+vince[2];
		dest[3] = palookupoffse[3][bufplce[3][(place=vplce[3])>>bits]]; vplce[3] = place+vince[3];
		dest += dc_pitch;
	} while (--count);
}
This was a bit slow, especially on 64 bit. The 64 bit assembly version of this was twice as fast. After checking the assembly I realized that accessing global variables was the main culprit, so I did this:
// Optimized version for 64 bit. In 64 bit mode, accessing global variables is very expensive so even though
// this exceeds the register count, loading all those values into a local variable is faster than not loading all of them.
void vlinec4()
{
	BYTE *dest = dc_dest;
	int count = dc_count;
	int bits = vlinebits;
	DWORD place;
	auto pal0 = palookupoffse[0];
	auto pal1 = palookupoffse[1];
	auto pal2 = palookupoffse[2];
	auto pal3 = palookupoffse[3];
	auto buf0 = bufplce[0];
	auto buf1 = bufplce[1];
	auto buf2 = bufplce[2];
	auto buf3 = bufplce[3];
	const auto vince0 = vince[0];
	const auto vince1 = vince[1];
	const auto vince2 = vince[2];
	const auto vince3 = vince[3];
	auto vplce0 = vplce[0];
	auto vplce1 = vplce[1];
	auto vplce2 = vplce[2];
	auto vplce3 = vplce[3];

	do
	{
		dest[0] = pal0[buf0[(place = vplce0) >> bits]]; vplce0 = place + vince0;
		dest[1] = pal1[buf1[(place = vplce1) >> bits]]; vplce1 = place + vince1;
		dest[2] = pal2[buf2[(place = vplce2) >> bits]]; vplce2 = place + vince2;
		dest[3] = pal3[buf3[(place = vplce3) >> bits]]; vplce3 = place + vince3;
		dest += dc_pitch;
	} while (--count);
}
Result: Just as fast as the assembly, but overall the 100% speed gain here meant a 2-3% performance improvement overall because even this being a low level drawer it was only one of many.

Sounds good? How about this: The multithreaded drawer dpJudas wrote shortly afterward brought a > 50% improvement, which would not have been possible with how this code depended on global variables, much less the assembly version of this function. This was a nice bit of optimization but not really the best way to approach the problem as a whole, I only went through the exercise to justify getting rid of the assembly code.

Jon said:

I couldn't disagree more.


I dare saying that every programmer worth their money disagrees with that statement...
Back in 1992 I was developing a game and had to write code like that, just to ensure that the executable remained small - the whole thing had to be shipped on 5 1/4'' floppies so the stuff I did to trim it down was absolutely ludicrous.
Today that code only serves as a warning of 'How not to write code', it has become completely unusable because way too much optimization had to be done on the source level because the compilers of the time were not good enough.

Share this post


Link to post
kb1 said:

Academic code is nice for us humans, but, at the end of the day, I'd rather play the source that the compiler liked the best, if you get my meaning.

That means that you have to try various different implementation of the same general algorithm, including an academic one, and benchmark how they fare after compilation.

The result might surprise you.

Share this post


Link to post
Gez said:

The result might surprise you.



Probably as much as I was surprised when I compared ZDoom's highly optimized assembly code with its C-equivalents and finding out that the assembly benefit had completely evaporated over the years and giving the C-version a decided advantage on the current line of CPUs.

Share this post


Link to post
kb1 said:

There is nothing even remotely obfuscated enough in Killough's code to make it difficult to make a beneficial algorithmic change. And, once you write your nifty new algorithm, copy and paste it, then optimize the hell out of it - why not?


For two reasons. First is that it is no given that you picked the best algorithm in the first place. Once you go down the route of heavily optimizing a function you've usually gone past the point of no return - if the algorithm has to be changed after this, then you almost always have to rewrite the function. You therefore want to keep the "damage" as local as possible - the less functions you need to heavily optimize the better.

The second reason is that speed isn't the only important property of code. Take Cardboard vs ZDoom's software renderer. Which one would you rather work on? Any sane developer would pick Cardboard because its code is far more clear and readable. Why is it more readable? Well, because, it didn't decide to set the texture global pointer 10,000 miles away from where it is used. Why does ZDoom set the texture global so early? Because it was theoretically faster.

Okay, so if you can read the code yourself, what's the problem with the second reason? If you look at the libraries present in the late 80's and early 90's, you'll notice all the libraries that went for a 10% speed gain by FUBAR'ing their code all died out. The ones that didn't survived the transition. Even if your optimization gave users in 1998 a 25% speed increase, you really did them no favor as they the next decade didn't get new features due to the code now having become unmaintainable.

Share this post


Link to post
dpJudas said:

The second reason is that speed isn't the only important property of code. Take Cardboard vs ZDoom's software renderer. Which one would you rather work on? Any sane developer would pick Cardboard because its code is far more clear and readable. Why is it more readable? Well, because, it didn't decide to set the texture global pointer 10,000 miles away from where it is used. Why does ZDoom set the texture global so early? Because it was theoretically faster.


And this messed up variable handling was the reason why I wasn't able to implement per-tier wall lighting after managing to get per-tier wall scrolling in, I was never able to track down the dependencies of the variables involved. It was a huge clusterfuck of stuff that made no sense.

Share this post


Link to post
Rachael said:

If you want to have the option to be able to expand upon code in the future though, or have others help you with it, readability is really helpful.

Absolutely. You can always put the unoptimized code into comments. You can describe functionality, the optimization methods used, etc. You can also use conditional compilation, or you can even swap out the functions at runtime. You can be as verbose as you wish. But, if the coder cannot understand code, there's not much you can do about it.

Jon said:

I couldn't disagree more.

You don't like academic code? I don't like code the compiler understands best? Oh, you were making a funny! ha.

Gez said:

That means that you have to try various different implementation of the same general algorithm, including an academic one, and benchmark how they fare after compilation. The result might surprise you.

Exactly!

dpJudas said:

For two reasons. First is that it is no given that you picked the best algorithm in the first place. Once you go down the route of heavily optimizing a function you've usually gone past the point of no return - if the algorithm has to be changed after this, then you almost always have to rewrite the function. You therefore want to keep the "damage" as local as possible - the less functions you need to heavily optimize the better.

The second reason is that speed isn't the only important property of code. Take Cardboard vs ZDoom's software renderer. Which one would you rather work on? Any sane developer would pick Cardboard because its code is far more clear and readable. Why is it more readable? Well, because, it didn't decide to set the texture global pointer 10,000 miles away from where it is used. Why does ZDoom set the texture global so early? Because it was theoretically faster.

All good points (up to the sane part, anyway). Programming is full of trade-off decisions. Speed, accuracy, reliability, maintainability are all important goals. I argue that they do not all need to be mutually exclusively. The best code maximizes each as much as possible. And, of course, you'll want backups, regression tests, mock-ups, profiling, speculation, debugging.

dpJudas said:

Okay, so if you can read the code yourself, what's the problem with the second reason? If you look at the libraries present in the late 80's and early 90's, you'll notice all the libraries that went for a 10% speed gain by FUBAR'ing their code all died out. The ones that didn't survived the transition. Even if your optimization gave users in 1998 a 25% speed increase, you really did them no favor as they the next decade didn't get new features due to the code now having become unmaintainable.

So they half-assed the optimization, and the guys upgrading sucked. Not really a fault of the theory. Usually, the programmers get pissed and leave for greener pastures, and the company hires a bunch of fresh, dewey-eyed kids that think they can solve any problem by throwing a database engine, a web server, and a home-grown script engine into the mix, and, #3 Profit! :)

For every great programmer, there's typically 10 lousy programmers. So, I am promoting the "be great" theory, and expecting you guys to "get it".

Graf Zahl said:

If you optimize too heavily for one compiler, the next one may not like what you did and create even worse code.

Of course you make the best out of whatever situation you're currently in. If you switch compilers, obviously you may need to adjust your techniques a bit.

Graf Zahl said:

The best way to have good results across multiple compilers is to simply write clean and readable code and be done with it.

You be done with it. I don't want good code - I want great code. And I get it by doing the work, not making up reasons why I can't.

Graf Zahl said:

I'll just point you to Killough's R_PointToAngle2 as a textbook example of an optimizer-dependent code design.

We already discussed it, the reasons behind it, and you agreed that it was done for performance. He used the tools he had available to the best of their ability. Good for him (and his users, of course).

Graf Zahl said:

Some versions of Visual C++ create some utterly horrendous machine code from it, while they nearly create perfect code from id's version.

Again, each compiler has strengths and weaknesses, which is why you optimize based on the tools you have available.

Graf Zahl said:

Also, your math doesn't add up. In general, even on complex maps, ZDoom runs 4-5 ms at most for the playsim per frame. That means if you manage to improve matters by 10% you gain 0.5ms. And 10% is a lot, you'll never get to that with micro/peephole-optimization because code execution is spread out too far with the most heavily hit functions mostly being those that get affected by CPU cache stalls (i.e. those which iterate over other data.) The code you try to optimize here is not where time is lost normally.

There is so much BS in that quote...where to begin? First, the raw pseudo-science: "ZDoom gets X ms speed in playsim". On a 486 with 2Mb memory and 10,000 monsters? On a 12-Core i7, with 1 monster? You're breaking out some impossibly exact speeds of 4 to 5 ms, comparing it to a 10% number dpJudas pulled out of a hat (he did not mean it to be exact, he was trying to make a point), and then you multiply them together, and use his number times your number to tell me that I am incapable of optimizing a yet-unnamed port beyond this magic amount, because, apparently, I do not understand the cache subsystems, and that the code "I am trying to optimize" (you're psychic) is not the code that should be optimized (suggesting that you do know what needs to be optimized). WiTF?

Look, man, you injected yourself into this conversation, apparently intent on discrediting me with some fake-ass numbers and guesswork like that?

Graf Zahl said:

...1990's fiction...

I'll tell you what happened in the 90's: That's when Graf became an "expert" coder. "Expert" is a term I use to mean that you, at some point, "learned everything you needed to know about computers and programming." Because you're an "expert", no one can tell you how to write code, or that you made a mistake, because you're an "expert."

My guess was that you were employed as an over-worked, under-appreciated, under-paid code jockey, and this was your life for an extended period of time. That's when you decided to stop taking risks, stop going above and beyond the call of duty, 'cause it's just not appreciated. You stopped listening to the advice of your peers and superiors, 'cause, what did they know? You were the one in the trenches.

Yeah, I could be wrong. It sure seems to match a lot of your responses and reactions, though.

You see, I didn't give up. And, I'll be the first to admit what I know, and what I don't know. And, when I'm wrong. I am still learning, and will continue to learn until the day I die.

However, there are people, here, and at other places, who depend on my honesty, integrity, and knowledge, and I'll be damned if I'm going to let them down, by letting you drag me down with you. I have never questioned your ability to code, and, in fact, I will claim that you've said and done some pretty sharp stuff. And, you've also let out some incredibly ignorant statements and non-facts, a few of which were directed at me recently. Appreciate that.

Don't try to step to me, bud. I've been making money writing code for over 3/4th of my life, and you're gonna have to do a lot better than that flabby, flimsy, substance-less shit stain of a story you presented above to have a chance.

Save the pseudo-science for your script groupies, who love you unconditionally, as you rubber-stamp "No" onto their change requests. I don't need them telling me that I'm great, or that I'm doing a good job (or a bad job), and I sure don't need you telling me what code you cannot read, when you've decided to stop optimizing.

And, no, I'm not an "expert". But, I haven't given up.

@Graf, I will give you this:
Yes, in general, day to day, run-of-the-mill commercial application being built at a for-profit software house: Yes, yes, throw good algos, lots of RAM, and fast hardware at the problem. Do not be too clever, it ain't worth it.

But, come on, man, it's Doom! It deserves real consideration, extra effort, the works. Real blood, sweat, and tears, not "be safe". Those things you're calling "rules" are just guidelines, ,and can, carefully, be broken, for the sake of the user - the player - the game. A faster, more vivid output *is* a better game.

By the way, regarding the drawer:

. Is the source data properly aligned? Properly staggered to skip cache aliasing?
. Is the destination properly aligned?
. Do you properly prefetch the data into the cache?
. Do you properly invalidate cache lines at the right time?
. Does the code interleave (pair) instructions to take advantage of all execution units?
. Does it provide branch hints?
. Does it check cache configuration/cache line size?
. Does it use different instruction sets based on what's available from the CPU?
. Do you prevent register dependencies and allow for quick instruction retirement?
. Does it profile to determine which set is faster on that particular CPU stepping?
. Does it use the above info to choose a routine that works best on that system?
. Does the code fit into code cache? Completely, or only partially? Do you check?
. Do you actually check the number of logical cores, and adjust accordingly, or do you let the OS work it out?
. Do you check the debug registers to see cache miss count, cache flushes, pipeline flushes, dependency stalls, memory stalls?
. Is the code using hand-written assembler, or is it depending on the compiler to do the right thing, regardless of compiler?

dp, don't get me wrong - I give credit where is due. You guys have done great things with your renderer, despite your tendency to write tasteless, nasty forum responses at times...

My point is that, yes, today's optimization is not the same as it used to be, but that's been true since computers began. And, if you haven't done the steps above, you can continue to go further, if you wish.

@ those people arguing "code written exclusively for humans":
If you don't want to spend the extra effort to support performance as a feature, more power to you. But, why on Earth would anyone try to discourage me, or anyone else, from doing so themselves? Is it intimidation? Jealousy of a code you've never even seen? Spreading discontent? Fun to disrupt?

Optimized code does not have to be difficult for anyone to read. I think some people just like to be defiant. Code can be aligned and indented, horizontally and vertically, comments can be added. The syntax follows simple rules. Unless you try to make it difficult, what's the big deal? Some guys will complain about anything...

Share this post


Link to post
kb1 said:

..long list of very CPU specific things..

My point is that, yes, today's optimization is not the same as it used to be, but that's been true since computers began. And, if you haven't done the steps above, you can continue to go further, if you wish.

I'd like to remind you that QZDoom is targeting the following CPU architectures simultaneously: Netburst, Pentium M, Prescott, Intel Core, Nehalem, Bonnell, Sandy Bridge, Silvermont, Haswell, Skylake, Goldmont and the upcoming Kabylake. Add to that the Raspberry Pi uses an ARM architecture. And I didn't list all the AMD architectures also released in the same time period.

My point? It is not realistic for anyone to write assembly optimized for all of the above at the same time unless you have an army of developers. Clearly nobody in the Doom community has that. Virtually all the things you listed, except making sure that you're 16-byte aligned, is so architecture specific that optimizing for it will almost certainly be a loss in some of the other ones targeted.

If you think you can do it, more power to you. Unfortunately I'm quite convinced that 99.99% of all developers could not do this.

Share this post


Link to post
kb1 said:

Don't try to step to me, bud. I've been making money writing code for over 3/4th of my life, and you're gonna have to do a lot better than that flabby, flimsy, substance-less shit stain of a story you presented above to have a chance.

Save the pseudo-science for your script groupies, who love you unconditionally, as you rubber-stamp "No" onto their change requests. I don't need them telling me that I'm great, or that I'm doing a good job (or a bad job), and I sure don't need you telling me what code you cannot read, when you've decided to stop optimizing.

You aren't doing yourself any favors with comments like this. A lot of those [No]'s have a very good reason for them - I am not about to say that every piece of code submitted should be included, nor am I about to say that I agree with every single decision that Graf has made.

But I will tell you this: Graf and I agree a *LOT* more often than we ever disagree. And the reason for that is not because he or I are up each other's asses - it's because we have common goals and philosophies - it's because we understand each other better than you understand us. The same goes for dpJudas, as well.

kb1 said:

@ those people arguing "code written exclusively for humans":
If you don't want to spend the extra effort to support performance as a feature, more power to you. But, why on Earth would anyone try to discourage me, or anyone else, from doing so themselves? Is it intimidation? Jealousy of a code you've never even seen? Spreading discontent? Fun to disrupt?

Optimized code does not have to be difficult for anyone to read. I think some people just like to be defiant. Code can be aligned and indented, horizontally and vertically, comments can be added. The syntax follows simple rules. Unless you try to make it difficult, what's the big deal? Some guys will complain about anything...

It's simple, really, and I can't believe you haven't picked up on this so far: When you do that, you are the only one who can use that code. And even that is questionable - what happens when a decade goes by, code untouched, and you wonder just what in the flying fuck you were thinking when you made that?!

As mentioned also, compilers change, project managers change, etc. If you intend your source port to be usable by you alone - more power to you. Have fun with that. I am sure running it on your ancient 386SX will be worth every minute spent optimizing it.

But in reality - you are doing yourself a favor by creating human-readable code. One of the things that actually killed ZDoom was back when I played with its RGB555 blending drawers. Let me just get this off my chest now: THAT WAS A FUCKING NIGHTMARE! (I don't know who coded that monstrosity - whether it was Killough or Randy - it could have been both)

As soon as I figured out the algorithm and how it worked - obviously, that made it a lot easier for me to work with and understand what was going on. But do you think for one second when I rewrote that code with RGB666 drawers that I repeated those optimizations? Fuck no! As a result, performance on those drawers DID suffer - yes - they took twice as long as before. But I think it was worth it - primarily because blend drawers are so rarely ever even called in the first place. FPS did not tank on maps that were heavy on translucent walls unless the entire map was nothing but. And - like Eternity's Cardboard renderer - they were more precise, more accurate, looked better, and even had the option to switch back to the optimized quickies. Even better - someone who wants to look upon that code now knows what the fuck is going on with it.

However, that was the point where I put my foot down and said I am not touching ZDoom's drawers any more. I am not maintaining what would effectively become two different renderers. That's what killed ZDoom - that's when Graf said, "enough's enough, now without anyone touching the software renderer ZDoom will die."

Honestly - I wish it had not gotten to that point. I wish it would've been easier to work with that code - but guess what? This all stems from someone else's desire to optimize the code right up to the point of unreadability.

You want to know why everyone's discouraging you from doing it? That's why. Now you know.

Share this post


Link to post

Sigh...

I can only second Rachael about what was ZDoom's undoing - and it's really just that the code hadn't been optimized to be readable, it had been optimized to be fast - where 'fast' directly translates to 'fast on the one system it was tested on'.

The drawbacks of this approach were horrendous: We only had one person who really could make sense of that code and most attempts by others to expand on it were met by failure. Yes, it gave ZDoom some advantage 15 years ago, but after that atrition set in, the code was barely maintained, new features got nixed because nobody could implement it and the few things that got added were nightmarish to do. The one bigger addition - the 3D floor code required some intense fixing because it ran into subtle anomalous situations, which were all caused by stupid decisions to shave off a few cycles here and there. The best example: In order to save a comparison in some deeper loop, the renderer calculated the amount of estimated openings up front, then allocated just enough and went on hoping that this was sufficient. A classic optimization that was based on fixed assumptions. The overall effect of it was zero, though. And the code was stuffed with such things.

And now imagine we would have had not just one piece of code of such quality, but 5, 6, 7 or even 8(?) for different CPU architectures. Now all those features would not just have to be written once but multiple times!

I think the very trivial conclusion here is, that nothing would have been done at all! (except, maybe, optimizing the existing mess for the 9th, 10th or 11th(?) CPU architecture.

Sorry, but that's just a shortsighted attitude and would have gained nothing. Just have a look at the monstrous amount of cleanup work dpJudas had to do to implement multithreading into the renderer. And unlike your CPU fiddling, this brought some real gains - but a prerequisite was to clean out most of that 15 years of mess first!

kb1 said:

Don't try to step to me, bud. I've been making money writing code for over 3/4th of my life, and you're gonna have to do a lot better than that flabby, flimsy, substance-less shit stain of a story you presented above to have a chance.


I think I can return that quote verbatim to you.
But unlike you I grew up, I set my priorities where it matters, and that is favoring human readable code wherever possible, to NOT optimize each fragment of code to the metal and generally to prefer something I can look at 10 years from now and still be able to work with that.
None of the optimized code from the 90's I wrote ist even remotely usable by now. I have recently dug out some old game I wrote in about 1993, and it's a mess. I would like to fix it up to compile on today's system but what this first needs is a serious cleanup pass, just like ZDoom's software renderer had to receive to make it workable again. The reason why I wrote such a mess back then? Simply because I practically followed every single advice about 'optimization' you have posted here. I wrote code that went for either being small or being fast, but not for code that was designed to be portable and future proof.
Sorry, I am past that and don't buy into that horseshit anymore. These days the two most important properties I value is that code is a) cleanly written and comprehensible and b) does not overly depend on specific external libraries. Such external access should be abstracted well enough that it can be easily transitioned to another library if the first one ceases to be an option.

Share this post


Link to post
kb1 said:

You don't like academic code? I don't like code the compiler understands best? Oh, you were making a funny! ha.


No, I was being deadly serious. To try and be clear: I think that the primary audience of source code is humans, not compilers, in this day and age. Readability, maintainability are of the utmost importance.

Share this post


Link to post

Sorry for quoting a post from the previous page :)

kb1 said:

The simplest compiler simply converts the code as it has been typed.

x = y / 4;
A simple compiler might divide y by 4 and store it into x. A better compiler would multiply y by .25, or even right-shift y 2 places if the vars are integers.


Wrong. Think about a right shift if y = -1

Share this post


Link to post
jeff-d said:

Wrong. Think about a right shift if y = -1


Here's an old post I made on the subject.

Enter the wonderful world of logical shifts and arithmetic shifts. An arithmetic shift will preserve the sign on a twos-complement architecture (aka 99.999999% of practical CPUs in use today). Every CPU/ALU worth their salt can distinguish between the two (and some even allow you to ROTate bits ;-).

The tricky aspect is that most programming languages don't allow explicitly distinguishing between signed/unsigned shifts or even directly using bit rotations, and what the compiler will actually do when shifting certain kinds of variables (e.g. should shifts on unsigned integers be dealt with differently than those on signed ones?) is implementation-dependent.

FWIW, I always found it crazy that Java has an unsigned right shift operator (>>>) whereas C and C++ do not.

Share this post


Link to post
Maes said:

FWIW, I always found it crazy that Java has an unsigned right shift operator (>>>) whereas C and C++ do not.



C has unsigned variables, so it never needed an unsigned shift operator. I find it far more insane that shifts of negative numbers are officially undefined. Find me any code that doesn't take them for granted.

The fun thing is: How can Java implement those different operators if the underlying language doesn't even define them! Ultimately, I believe that this particular 'undefined' tidbit can be safely ignored, any system that does it differently than the established convention would be in for a very rough ride when trying to port software to it.

Share this post


Link to post
Graf Zahl said:

C has unsigned variables, so it never needed an unsigned shift operator. I find it far more insane that shifts of negative numbers are officially undefined. Find me any code that doesn't take them for granted.


Well, consider this situation:

unsigned int a=0xFFFFFFFF; // Maximum unsigned int value (2^31-1)
int b=0xFFFFFFFF; // This is just "-1" in two's complement arithmetic. Some compilers may throw a fit by trying to shoehorn this particular value in this way, but that's beyond the point.

printf("0x%8x 0x%8x\n",a>>1,b>>1);
What should be the "correct" values to print in such a situation?
  1. 0xFFFFFFFF 0xFFFFFFFF
  2. 0x7FFFFFFF 0xFFFFFFFF
  3. 0x7FFFFFFF 0x7FFFFFFF
OK, it's easy (if somewhat involved) to read up the documentation of each different compiler on each different platform and/or build the example and see what will actually happen, but to you, as a programmer, what should be, the ideal outcome? Which one would be closer to what you meant?
Spoiler

Both GCC and VS create output "B", which is "ideal" for me, at least. It shows that at least some attempt to distinguish between a logical and arithmetic shift is being made, based on the type of the variable being shifted. Interestingly, the outcome of a shift can be modified even at the last minute, with a an explicit cast of the shifted variable. So in that sense, yeah, an explicit distinction between operators is not required.

Share this post


Link to post

The sane outcome should be that signed values receive a signed shift and unsigned values receive an unsigned shift and that's what all good compilers do, and have been doing since I started using C.

Strictly speaking about the standard, the issue is moot because apparently the caveman-attitude of leaving it undefined always wins, because - oh God - it may break that compiler on some exotic 40-year-old system that may still be in use somewhere.

Share this post


Link to post

By the way - to go back to an earlier point (about optimizing) - I just remembered I found this comment while working with ZDoom's original drawers:

** Will I be able to even understand any of this if I come back to it later?
** Let's hope so. :-)
This, by itself, is exactly why over-optimizing the code is a bad idea. Again, I don't know who wrote that - whether it was Randy, or Killough, but either way, if you have to even consider such statements, chances are you're better off writing a "clean" version and just sticking with that, instead. Ironically, the statement is in direct reference to exactly what sounded ZDoom's death knell.

Share this post


Link to post
Rachael said:

By the way - to go back to an earlier point (about optimizing) - I just remembered I found this comment while working with ZDoom's original drawers:

** Will I be able to even understand any of this if I come back to it later?
** Let's hope so. :-)
This, by itself, is exactly why over-optimizing the code is a bad idea. Again, I don't know who wrote that - whether it was Randy, or Killough, but either way, if you have to even consider such statements, chances are you're better off writing a "clean" version and just sticking with that, instead. Ironically, the statement is in direct reference to exactly what sounded ZDoom's death knell.



FYI, that line first appears in ZDoom 1.17 and dates from early 1999. It's also in a file that contains no Boom or MBF code, so it's entirely Randi's doing.

And as is par for course for such code, it made sense on hardware from 18 years ago but somewhere in-between the entire optimization this was about was rendered utterly obsolete. I have to wonder when the scales tilted to the simpler and more readable version. But this is a phenomenon I have seen in a lot of code that the better the hardware and the compilers get the less this kind of old-school optimization matters.

Share this post


Link to post
Rachael said:

This, by itself, is exactly why over-optimizing the code is a bad idea. Again, I don't know who wrote that - whether it was Randy, or Killough, but either way, if you have to even consider such statements, chances are you're better off writing a "clean" version and just sticking with that, instead. Ironically, the statement is in direct reference to exactly what sounded ZDoom's death knell.

That's Randy's sense of humor. Killough would have just commented it as obfuscated.

Share this post


Link to post
Graf Zahl said:

And as is par for course for such code, it made sense on hardware from 18 years ago but somewhere in-between the entire optimization this was about was rendered utterly obsolete. I have to wonder when the scales tilted to the simpler and more readable version. But this is a phenomenon I have seen in a lot of code that the better the hardware and the compilers get the less this kind of old-school optimization matters.

I was pretty surprised that this gave no speed increase whatsoever and actually didn't believe it until Rachael proved it to me.

I'd still love to know the technical explanation for why it stopped making a difference on newer CPU architectures. Even today I find it fascinating that neither dword aligned drawing (pal mode) or 16-byte aligned drawing (truecolor mode) gains any speed from this. But the numbers don't lie - there was no speed gain.

Btw. about a compiler converting 1/4 floating point divide to a 0.25 multiplication, no compiler will do this as the outcome is not the same. When it comes to floating point such stuff must always be done by the developer, but even then unless you're in a critical loop it will be a waste of time.

For integer I always type "x / 256" instead of "x >> 8" because I find the former easier to read. If a compiler can't optimize something as trivial as this, then it is time to find a new compiler. :)

Share this post


Link to post
dpJudas said:

I was pretty surprised that this gave no speed increase whatsoever and actually didn't believe it until Rachael proved it to me.

I'd still love to know the technical explanation for why it stopped making a difference on newer CPU architectures. Even today I find it fascinating that neither dword aligned drawing (pal mode) or 16-byte aligned drawing (truecolor mode) gains any speed from this. But the numbers don't lie - there was no speed gain.


I was just as surprised - but I was also surprised how little difference the assembly drawers made - and that the ones where they had the biggest advantage was just poorly written C code that got its data directly from global variables and forfeited register optimization.
My guess is that CPU caches have become too good so that the additional copy loop is simply too expensive to make the subsequent optimized copying worthwile. But it's overall a good point to show that many textbook optimizations may become obsolete if the hardware gets better.

Btw. about a compiler converting 1/4 floating point divide to a 0.25 multiplication, no compiler will do this as the outcome is not the same. When it comes to floating point such stuff must always be done by the developer, but even then unless you're in a critical loop it will be a waste of time.


A good compiler allows to specify if it is allowed to aggressively optimize float math, and ZDoom uses this for the renderer because this kind of precision doesn't really matter.

For integer I always type "x / 256" instead of "x >> 8" because I find the former easier to read. If a compiler can't optimize something as trivial as this, then it is time to find a new compiler. :)


Be careful: -1 / 256 == 0; -1 >> 8 == -1! You may be surprised what the compiler actually does here.

Share this post


Link to post
Graf Zahl said:

My guess is that CPU caches have become too good so that the additional copy loop is simply too expensive to make the subsequent optimized copying worthwile.

That would explain it for the rt4 family of drawers, but the wall 4col drawers copy directly to the destination. Maybe there still was a slight speed drop for the wall drawers, but so little that it didn't matter.

Graf Zahl said:

Be careful: -1 / 256 == 0; -1 >> 8 == -1! You may be surprised what the compiler actually does here.

Ugh, missed that detail. Well, good thing I actually intend it to be a divide, so if I had shift-optimized it then I'd be generating broken code. (Except for stuff in the drawers, where the types are unsigned)

Share this post


Link to post

On the topic of readability, my actual opinion is the same as the people who have been talking about how nowadays we can afford to trust the compiler. Sacrificing the ability of future contributors being able to figure out what the heck is going on just isn't worth it, unless you can show me the performance increase and justify the performance increase is large enough when compared to the readability trade-off (I'll be more likely to sympathise with extreme optimisation if you're working on some embedded system with a crappy CPU).

From a personal standpoint, I personally would have been turned off of working on Eternity if the code weren't so clean. I still had a hard time figuring out what much of the code did when I first came on board (mostly due to being a C++ novice) but was supported through my first contribution, and was more able to help after that. Even now some of the code that's there I just can't read.

Without clarity of code you risk alienating people who are interested in helping. Documentation is tedious and commenting can be tiresome, but when there's inevitably a bug that's spotted somewhere down the line, you - or whoever is working on your code - will be thankful that you know what's going on and can fix it without a great deal of fuss.

Share this post


Link to post

@Rachael: Those ZDoom drawers were quite ingenious. The goal was to draw 4 columns at once, reducing the cache invalidation by the same factor, which is why they were fast. All but one column was halted until, one by one, they all lined up. Then, the renderer could push all 4 downwards in sync. The other advantage was that the different types of drawers could be written as a function that handled a special type of render of a single pixel, without all of the walk-down-a-column stuff around it, which presented the code in a uniquely clean way. Unconventional, yes. Convoluted, not at all, based on your skill set.

dpJudas said:

I'd like to remind you that QZDoom is targeting the following CPU architectures simultaneously: Netburst, Pentium M, Prescott, Intel Core, Nehalem, Bonnell, Sandy Bridge, Silvermont, Haswell, Skylake, Goldmont and the upcoming Kabylake. Add to that the Raspberry Pi uses an ARM architecture. And I didn't list all the AMD architectures also released in the same time period.

My point? It is not realistic for anyone to write assembly optimized for all of the above at the same time unless you have an army of developers. Clearly nobody in the Doom community has that. Virtually all the things you listed, except making sure that you're 16-byte aligned, is so architecture specific that optimizing for it will almost certainly be a loss in some of the other ones targeted.

If you think you can do it, more power to you. Unfortunately I'm quite convinced that 99.99% of all developers could not do this.

Army? Come on, man. Yeah, I can do it, and so can you. Intel publishes excellent manuals on their instruction sets, as do others. You have a compiler that spits out its best attempt - you can start with that. Build a nice profiling rig that paints a bunch of columns, profile, adjust, try again. It's not like you have to have a dedicated function for each and every processor ever made, in version 1.0. You add specialized support gradually, as you build it. I don't imagine you have a fleet of PCs, representing each of those architectures you listed. That's where the public can help you. Ask your users to run tests which generate files describing their processor, memory, etc, and the performance of each test you include. Use that info to tweak your engine. Or, better yet, build the profiler/function swapper directly into the engine, so it chooses the best functions for each PC it is run on.

The instruction sets are massive, and can be daunting at first, especially with their god-awful mnemonics, but it ain't as bad as you describe. I can't see each function being 100 instructions long - it's not like you're rewriting Doom in assembly!

If it's worth doing, it's worth doing right: that's my motto.


Gee, I don't know, guys. A lot of you are saying that you can't read tight code, or you forget how it works later, or it bogs you down later...that just doesn't happen to me. I don't feel special. Maybe I can anticipate problems down the road a bit more, maybe I just have a skill in reading code, I don't know. You're saying all of these problems are going to happen, but they don't for me. Make of that what you will.

It's difficult to make suggestions to people that experience all of these roadblocks and issues when faced with the task of moving forward, so I guess I cannot help, but to share my experiences. I don't experience these problems, and I don't have future programmers claim to have such issues with my old code. One guy told me that he learned how to program by working on my code - I'm not sure if that was a good or bad thing.

Yes, you can build yourself into a corner, and yes, it is easy to write fragile framework, but there's no need to. One thing I have learned is to flat out refuse to build a half-ass implementation, when a bit more work can produce a nice, self-contained, self-maintaining subsystem. You should always take the time to allow for expansion, and never hard-code "magic" values, and make backups, and name entities properly. Code should basically document itself. When it doesn't a comment or two can explain the purpose.

But, no, I don't forget what my old code is doing, and yes, most all of my old code still runs...even better today. The idea that optimizing is bad is at least as bad as the idea that optimizing is always good. Both are absolute, and there are no absolute rules. Every technique has a place, and rules you were taught should be thought of as guide lines, not absolute concrete rules set in stone. How else can you innovate?

I have to speak up when someone is laying out rules as if they should not be questioned, evaluated, and chosen (or not), in every situation. You must use your brain in everything you do, not follow someone's rules from the past.

dpJudas said:

I was pretty surprised that this gave no speed increase whatsoever and actually didn't believe it until Rachael proved it to me.

I'd still love to know the technical explanation for why it stopped making a difference on newer CPU architectures. Even today I find it fascinating that neither dword aligned drawing (pal mode) or 16-byte aligned drawing (truecolor mode) gains any speed from this. But the numbers don't lie - there was no speed gain.

Two big reasons:
1. It depends on the video mode. More specifically, the relationship between screen width and cache layout, or rather where subsequent lines fall within individual cache lines. Aliasing can occur, since specific cache lines are not tied to specific memory addresses, rather the memory address is masked, and the mask is what indexes the cache. The higher the resolution, the better chance you are taking advantage of more cache lines. And if you hit the same address mod 64, the better chance of invalidating cache, causing a cache flush. 1024 width is the worst.

2. Working with bytes incurs a penalty on newer processors, cause it wants to work in 32/64 bit chunks. Reading a byte put that 32/64-bit word in a half-used/half-unused state, as far as the memory subsystems see it. In other words there's a possible dependency on the other bytes in that word, so you incur a penalty. (Please read up on it - I can't explain it that well right now).

3. It's also possible that the speed benefit had been broken for a while. Or, the new code is somehow also optimal.

Share this post


Link to post
kb1 said:

The other advantage was that the different types of drawers could be written as a function that handled a special type of render of a single pixel, without all of the walk-down-a-column stuff around it, which presented the code in a uniquely clean way.

At the price of copying the data twice. First to the working buffer. Then to the frame buffer. Which is also why eventually the old drawers actually became faster as CPU hardware improved.

One guy told me that he learned how to program by working on my code - I'm not sure if that was a good or bad thing.

If you say so.

kb1 said:

1. It depends on the video mode. [..talk about cache lines..] 1024 width is the worst.

ZDoom already takes care of all of this when it allocates the frame buffer and calculates the ideal pitch. If anything, this should just have given the "optimized" rt drawers an even greater speed advantage. It did not on modern processors, which is the whole point.

kb1 said:

(Please read up on it - I can't explain it that well right now).

I think you should read up on it yourself. Modern Intel CPU's use 64 byte cache lines with 16-byte boundaries being the thing you should avoid.

Not that debating any of this changes any of the facts, which is that the old optimized code lost its speed advantage a long long time ago and had become nothing more than maintenance headache. Lets look at the facts for a moment:

1) The old rt drawers doubled the amount of drawer code. 1200 extra lines of code
2) The old 4col wall drawers doubled the drawer code once again. Approx 800 lines of code
3) The setup code for 4col wall drawers was additional 500 lines of code
4) The rt sprite setup code was copy and pasted over 7 different places, approx 150 more lines of code.
5) About 700 lines of assembly code added
6) It was only faster for a few years

Was it worth it? That's certainly up for debate. It is not that I want to piss all over Randi's contributions, because some of the stuff she did was truly cool. But part of being good at optimizing is also to know when to stop and do a proper cost/benefit analysis.

Just because something can theoretically be done doesn't mean you should do it.

Share this post


Link to post
kb1 said:

Gee, I don't know, guys. A lot of you are saying that you can't read tight code, or you forget how it works later, or it bogs you down later...that just doesn't happen to me. I don't feel special. Maybe I can anticipate problems down the road a bit more, maybe I just have a skill in reading code, I don't know. You're saying all of these problems are going to happen, but they don't for me. Make of that what you will.

Where's your source port again?

Share this post


Link to post
dpJudas said:

If you say so. (and other nuggets)

Geez, try to say something nice and provide a couple of ideas, and I get thrown a bowl of sarcasm and attitude - wow.

By the way, if ZDoom's renderer code was so enormous, complex, and doing so much double work, how do you figure that it practically doubled the frame rate over a more generic setup?

Taking this question a bit further, what you you suppose really happened? Did this code suddenly become slower? Is your code faster?

Stated differently, say you've got these results:

Code          Proc      Units of time (less is better)
----          --------  ----
Original:     P-II      250
Original:     i7        80
Randi's:      P-II      125
Randi's:      i7        100 (less than expected)
Your's:       P-II      ?   <- What do you suppose this would be?
Your's:       i7        100 (you said it was similar to Randi's on modern proc)
[the numbers are estimates, people - don't break out your calculators :)

So, what do you think would go in the missing slot?

Do you believe that your code is faster, by matching Randi's speed on a modern processor, since you're writing 32-bit vs. 8-bit?

There's a lot a possibilities for those results, and only a couple of them have anything to do with "Randi's algo being convoluted":

. New compiler fails on Randi's algo/succeeds on your code.
. Randi's code doesn't fit into code cache, which is a big problem these days, not so much earlier.
. Penalty for manipulating bytes
. Whole prog has been modified extensively, so maybe Randi's algo became broken along the way
. Lots of others

dpJudas said:

Was it worth it? That's certainly up for debate. It is not that I want to piss all over Randi's contributions, because some of the stuff she did was truly cool. But part of being good at optimizing is also to know when to stop and do a proper cost/benefit analysis.

Of course it was worth it. People benefitted for years. Randi did stop, and did analyze cost/benefit. And then moved forward with the right decision, and everyone got to play with high frame rate, and none of them gave two shits about the way it was written. And, why should they?

dpJudas said:

Just because something can theoretically be done doesn't mean you should do it.

Doesn't mean you shouldn't either. Why are you studying and experimenting with ever-faster renderers, then? Why are we even discussing this? You're optimizing, and it's a good thing. It was a good thing when Randi did it. Obviously, you were able to adapt it to modern processors, so it was unreadable either. That which can be done is being done. What point are you trying to make? That you should be able to stop whenever it exceeds your ability to understand it? When it exceeds someone else's ability?

You can stop whenever you want, of course. And, you can go further, if you want to. And you can write nice code, or you can write ugly code. What if really ugly code gave you a 20 fps boost - would you use it? 50 fps? What if it was only a bit ugly? Is your code the most beautiful code ever written?

What is this magic formula? Number of chars per line * fps divided by decibels_of_cheering_fans - pounds_per_square_inch of developers pounding the table?

Programmer Paul and Gamer Gary:

Paul: "Hey, check out my new game with a new renderer! I cleaned it up!"
Gary: "Cool! Let's play some co-op!"

(game starts up, in MAP01)

Paul: "Isn't this fun?"
Gary: "Eat it, monkey! Bam! Bam!"

Paul: "I'm going in to the arena!"
Gary: "I'm there, dude!"

Paul: "Woah! It's a zoo!"
Gary: "Ugh - LAG!"

Paul: "Huh?"
Gary: "I'm getting, like, 10 fps. This sucks..."

Paul: "But...but, I rewrote the renderer and made it clean!"
Gary: "Ugh!"

Paul: "But, it's easy to read!"
Gary: "Uzzz!"

Paul: "It's easy to maintain!!"
Gary: "So, maintain it. Cause it sucks."

---------------------------------------

Dude, write it how you want. Make it clean if need be, but make sure it performs, cause that's what everyone cares about, not your 4-space alignment, and not your beautiful var names. No one cares. Yeah, another coder might appreciate it, but they won't need to change it if it works well, and it's fast. And, if they do need to change it, they are going to have to understand all of it anyway, to be able to do a good job. I'm a coder, and I couldn't care less how you write it, if you're not deliberately trying to make it difficult. And, if I want to change it, and it's too tough, I'll just rewrite it anyway, like you did. Isn't it better? Aren't you more satisfied?

Besides, I not really learning anything with you telling me not to optimize...while you're optimizing. What's up with that?

Above all else, just do what you do, and enjoy it yourself - that's all that matters anyway.

Linguica said:

Where's your source port again?

It's at home. Come on by, we'll eat barbeque chicken and play co-op :) My port will then heal you and get your laid, or so they say.

Share this post


Link to post
kb1 said:

Geez, try to say something nice and provide a couple of ideas, and I get thrown a bowl of sarcasm and attitude - wow.

Insulting people by hinting that they are less skilled coders if they can't read any given code is a compliment? Wow. And after that you went on a sad parade of trying to hint that other people think your code is so great you can learn how to code from it. So yeah, you got a flame back - shouldn't really surprise you.

By the way, if ZDoom's renderer code was so enormous, complex, and doing so much double work, how do you figure that it practically doubled the frame rate over a more generic setup?

You'll have to show actual profiled evidence on a Pentium 2 that this was the case. I'm sorry, but I will not take your word for it.

Even if it DID literally double the frame rate, it could still be argued not to have been worth it in the long term the way it was implemented. What your table is missing is the loss of 15 years of useful modding features once the code complexity got so bad not even Randi could add new features to it. If we go by your hypothetical 100% speed improvement on a Pentium 2, how many features was that worth? And don't give me this bullshit about how you could added those features despite this, because clearly you didn't and nobody else did either.

Do you believe that your code is faster, by matching Randi's speed on a modern processor, since you're writing 32-bit vs. 8-bit?

I'm not sure I understand what you're asking here exactly, but the 4 column true color drawers I added wrote 128-bit whenever the palette version wrote 32-bit. For the sake of this entire discussion, both the palette and true color drawers showed no speed improvements on a modern CPU.

Of course it was worth it.

Your opinion. Obviously we disagree.

Why are we even discussing this?

My thoughts exactly. Think I'll go back and code on my dpdoom port. It can actually run the entire Doom engine on the GPU. Pretty cool, isn't it?

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×