Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
Maes

Source ports and fixed_t

Recommended Posts

After a long time thinking about it, I started some work on my pure Java source port, using a bottom-up approach.

Some design choices:

  • Similar to Jake2, organize everything into packages (e.g. m_ prefixed source files go to package "m" and lose the prefix
  • Merging .h and .c file contents into .java
  • Static imports instead of C's #includes for globals
  • C-style enums may become java enums or bunches of final consts, depending on how they are used/convenience.
  • Several reusable structs become classes/objects.
  • Per-case adaptation of struct copies/pointer copies with proper Java conventions.
  • etc.
So far, I'm grinding away at the basic defines, includes and functions, which of course soon lead to the dreaded fixed_t...

Now, I'm using the original GPL'd Doom source code as a reference. I thought that modern source ports did away with that and jumped straight to saner floats and doubles but no, ZDoom has it, prBoom has it, prBoom+ has it etc.

So I'm asking any source port developers: Is using that same exact fixed point type necessary? Because it may be trivial reusing it with C/C++ on different compilers, but it has several drawbacks when emulated in Java:
  • Java doesn't have an unsigned integer type on which fixed_t piggybacks (it's just a typedef), and so bit shifting will result in sign dragging.
  • Integer.MAX_VALUE and Integer.MIN_VALUE of course have different meanings and values than MAXINT and MININT that appear in the source code.
  • fixed_t is supposed to be unsigned, and relies on standard C-style unsigned int additions/subtractions, which are performed directly and overflows are blissfully ignord. This can't be done in Java for several reasons. C's "abs" function is used too at some point...
  • Multiplication/division is performed within special functions...these are easier to implement in Java if it wasn't for the bit shifting on signed numbers fucking things up a bit.
  • I can make a fixed_t object/wrapper that uses longs internally and special masks to emulate an unsigned 32-bit integer which will then be used to emulate a 16.16 fixed point...but no idea what the overhead will be.
just to list a few of the potential problems. As a side note, Jake2 followed Quake2 which had a "normal" floating point arithmetic inside, so none of all this fucked up shit.

Is it really necessary to use fixed_t? Will the mapping/coordinate system break entirely if e.g. 32-bit floats are used instead?

Share this post


Link to post

ZDoom had a floating point rewrite once. It was, however, a waste of time as Randy reverted to fixed_t. Using floats would, however, allow to remove more limits, notably for map size, because with 16.16 fixed point, you're limited to 16-bit for the integral value. The only way to expand these limits with the current architecture without breaking everything would be, I think, to use a 48.16 fixed value, and that's 64-bit. (Of course it's not too much of a concern because a map would need UDMF to reach such dimensions. See this thread.)

There is at least one point in the source code where overflows are not "ignored", they are required. I'm speaking about the angle_t.

Share this post


Link to post

Doomsday is currently in the process of switching to floating point. Its a lot of work updating and then re-tailoring to ensure original behavior but definitely worth it.

As for angle_t you can replace it with another representation also.

Share this post


Link to post

*sigh* it's going to be a long task....anyway, my priority is to keep the original ABI as much as possible to result into something that compiles and -hopefully- works, even if it just outputs still images.

Once that is done and a C-free codebase is a reality, I will release the SC so other, non-C ports can be based on. It's really going to be interesting, because in the SC there are parts that literally scream "TURN ME INTO AN OBJECT AND TURN THOSE FUGLY PARAMETRIZED INSTANCE METHODS INTO GLOBAL METHODS!!!" and others which are just...well...complete C-isms.

Edit: WTF?

I was looking into dstrings.h and found this:

// FinalDOOM?
"fuck you, pussy!\nget the fuck out!",
"you quit and i'll jizz\nin your cystholes!",
"if you leave, i'll make\nthe lord drink my jizz.",
"hey, ron! can we say\n'fuck' in the game?",
"i'd leave: this is just\nmore monsters and levels.\nwhat a load.",
"suck it down, asshole!\nyou're a fucking wimp!",
"don't quit now! we're \nstill spending your money!",

Probably it's already known but heh...is that something that made its way into the .exe/source ports or did they add it to the GPL'd code for shits 'n giggles?

Share this post


Link to post

fixed_t is signed, not unsigned. angle_t is the one you'll need to watch out for. There are also a lot of operations that rely on overflow and mixing signed/unsigned types. Java may handle these differently then C.

I don't see the point in switching to floats, they have their own problems. You'd also have to give up the speed of the lookup tables.

Share this post


Link to post
Scet said:

I don't see the point in switching to floats, they have their own problems. You'd also have to give up the speed of the lookup tables.


I didn't see if there are LUTs that use fixed_t as their index -that would cause problems. However if it was just their content, then there would be no difference if they contained floats, doubles or objects.

For example, the random table can be ported almost effortlessly in Java too, although with widened operands (short instead of char, because char is something entirely different in Java, and byte is signed). In fact it's perhaps the easiest thing to port.

Some of the most fucked up aspects to handle in Java are the "thinkers": Doom defines a think_t type...only to typedef it into action_t, which is further split into three possibilities. This would rougly translate into an interface and different implementing classes, in Java.

Java enums are also fugly to use: it's possible to use their elements as ordinals if you append an .ordinal() to them (which happens a lot in doom) but even with a static import you can't pull them into your namespace: you still have to prefix them with enumtype. so you end up with a fucking cluttered syntax. I think Jake just autogenerated a big-ass file with:

public static final int CONST#1= 1;
public static final int CONST#2= 2;
public static final int CONST#3= 4;
etc.

Making everything an "int" without a typedef facility makes reading difficult, ofc.

so I might have to do the same.

Share this post


Link to post
Maes said:

I didn't see if there are LUTs that use fixed_t as their index -that would cause problems.


Well they have to be right shifted first, the tables only have a few thousand elements, not enough to cover the whole range. For floats you'd have to convert to an integer and then shift that.

Maes said:

Some of the most fucked up aspects to handle in Java are the "thinkers": Doom defines a think_t type...only to typedef it into action_t, which is further split into three possibilities. This would rougly translate into an interface and different implementing classes, in Java.


I seriously suggest you look at the code I gave you on how to solve this, mainly Actor.hpp and Thinker.hpp. The three function pointer types can actually be made into one. Some functions called will receive an argument they won't use, but I thought it was the easiest approach.

Share this post


Link to post
Scet said:

I seriously suggest you look at the code I gave you on how to solve this, mainly Actor.hpp and Thinker.hpp. The three function pointer types can actually be made into one. Some functions called will receive an argument they won't use, but I thought it was the easiest approach.


I will do that too, but don't forget that I'm talking about Java here, so some C-isms that can be directly slipstreamed into C++ from C won't work well at all. Function pointers will have to become function objects (actually, that's one point where a Java implementation would gain in readability), and skip some intermediate passages.

Perversely, now Java supports some of the "advanced" C stuff like variable parameter functions etc. and there's even an semi-deprecated "Unsafe" library,

Anyway, my first priority is getting the thing to work ASAP, even if that means preserving some of the ugly (if it's easier) or rewriting parts (again, if it's easier). So far I've had a mixed impression: some things that are at a higher level (monsters, objects, etc.) translate well to pure OO, while some things like bit arithmetic, globals etc. not so much.

After that, someone may clean it up as much as he wants.

Share this post


Link to post

Please note that fixed_t is *not* unsigned. It represents values in the range of -32768 to 32767 with 16 bits of decimal precision.

angle_t is the type which is unsigned, and therefore has no direct analogue in Java.

AFAIK there are no lookup tables indexed by use of fixed_t's. angle_t, which is a Binary Angle Measure system (uses an implicit muliplier of 0xb60b60 == 2^32 / 360), is shifted back 19 bits using the define ANGLETOFINESHIFT in order to index the finesine, finecosine, and finetangent tables, which return fixed_t values.

Share this post


Link to post

Just watch out for how Java handles signed shifts, they may not be used in the tables, but I remember there being some. C likely just shifts and ignores the sign bit, but Java may sign extend the result.

Edit:

http://java.sun.com/docs/books/tutorial/java/nutsandbolts/op3.html
the leftmost position after ">>" depends on sign extension


Edit 2: Nevermind, seems most C implementations do the same thing.

Share this post


Link to post

Yeah java extends the sign on right shift, so I need to use masks wherever that occurs.

Interestingly, Java has an unsigned short 16-bit type: char which can be used as any other numeric type, the only exception being when it's printed.\

In any case, I'll make a test suite for fixed_t to see how accurately it's possible to emulate them, and also for performance comparison.

Perversely, the same workaround that gave doom more speed (?) on FPU-less machines back in 1993, is probably the least optimized hack ever on modern source ports, as a IEEE floating-point multiplication or division (as well as sine/cosine/exp etc.) are now done in 1 CPU/FPU cycle, while Doom substitutes a function call that in the best of cases is 5-6 CPU ops plus overhead...for less overall accuracy.

Share this post


Link to post
Maes said:

Yeah java extends the sign on right shift, so I need to use masks wherever that occurs.



Yes, but to my knowledge Java also has an unsigned shift operator ('>>>' instead of '>>')

Maes said:

Perversely, the same workaround that gave doom more speed (?) on FPU-less machines back in 1993, is probably the least optimized hack ever on modern source ports, as a IEEE floating-point multiplication or division (as well as sine/cosine/exp etc.) are now done in 1 CPU/FPU cycle, while Doom substitutes a function call that in the best of cases is 5-6 CPU ops plus overhead...for less overall accuracy.


Not quite. The FixedMul and FixedDiv functions are inlined and overall the speed difference is negligible. When profiling a larger block of code it's very hard to detect any differences because most of the time is apparently spent on cache stalls and other things that don't make a difference when using floats.

And in some cases the floating point code is considerably slower, a good example being R_PointToAngle. Doom's version of this function is twice as fast as the float version but the downside is a significant lack of precision which for gameplay purposes does not matter though.

Share this post


Link to post

Yay, a Maes source port. Another good news besides Kristus working on his Hexen project.

Share this post


Link to post

Doesn't FixedDiv almost always call FixedDiv2, which uses floating-point division?

I always thought that was somewhat strange. Considering how expensive the conversions and division of double precision floating-point must have been on those old computers. It's also called quite a bit from what I remember.

I've always wanted to see if a true fixed-point division could also work.

Edit: I just tried this alternate FixedDiv function:

fixed_t FixedDiv( fixed_t a, fixed_t b )
{
	if( ( abs( a ) >> 14 ) >= abs( b ) )
	{
		return ( ( ( a ^ b ) < 0 ) ? INT32_MIN : INT32_MAX );
	}
	return (fixed_t)( ( (int64_t)a << 16 ) / b );
}
Everything worked fine, I didn't see any problems. I don't know if it's faster though. On a modern computer floating-point division is much faster than integer division.

Share this post


Link to post
Scet said:

Doesn't FixedDiv almost always call FixedDiv2, which uses floating-point division?



No, it was (and still is in most ports) inline assembly.

Share this post


Link to post
Graf Zahl said:

Yes, but to my knowledge Java also has an unsigned shift operator ('>>>' instead of '>>')


Yes but only works on ints and longs. Smaller types are promoted to int and automatically sign extended before the shift :-(

Graf Zahl said:

Not quite. The FixedMul and FixedDiv functions are inlined and overall the speed difference is negligible.


Even Java does inline frequently used functions, but that doesn't change the fact that by now floating point multiplication is coded in one FPU opcode with two operands, at least in Intel.

While e.g. Lee Killough's disassembly of FixedMul (from the readme.asm file) is 5-6 opcodes. More opcodes = slower than = worse than, inline or not.5-6 times slower, to be precise, especially considering how the FPU can be pipelined separately and runs in parallel with ALU operations.

Graf Zahl said:

And in some cases the floating point code is considerably slower, a good example being R_PointToAngle.


Haven't come to that yet, but if floats are slower for single atomic operations like multiplication or division on a modern CPU (practically, on anything made after the Pentium I) then you're doing something wrong.

If that function is designed in order to sacrifice accuracy in favor of speed (not all that impossible in the case of Doom), then re-implementing it with floats and adding hacks to it so it can reproduce the same gross results as the original implementation will potentially be slower (gotta see it though).

Share this post


Link to post
Maes said:

Yes but only works on ints and longs. Smaller types are promoted to int and automatically sign extended before the shift :-(


Where would you need it for shorts?

Even Java does inline frequently used functions, but that doesn't change the fact that by now floating point multiplication is coded in one FPU opcode with two operands, at least in Intel.

While e.g. Lee Killough's disassembly of FixedMul (from the readme.asm file) is 5-6 opcodes. More opcodes = slower than = worse than, inline or not.5-6 times slower, to be precise, especially considering how the FPU can be pipelined separately and runs in parallel with ALU operations.


Still, gaining 5 cycles out of 100 is only 5% And if you consider that the code spends > 80% in the renderer which in ZDoom is highly optimized assembly you end up speeding up a fraction of the time used by a fraction. It doesn't amount to much in the end. And don't forget that floating point comparisons need to exchange data between FPU and CPU which often negates all advantages you might gain by speeding up the multiplications. In the end it just evens out roughly.


Haven't come to that yet, but if floats are slower for single atomic operations like multiplication or division on a modern CPU (practically, on anything made after the Pentium I) then you're doing something wrong.



R_PointToAngle's float version calls atan2 and that's not particularly fast, either as x87 or SEE2.


If that function is designed in order to sacrifice accuracy in favor of speed (not all that impossible in the case of Doom), then re-implementing it with floats and adding hacks to it so it can reproduce the same gross results as the original implementation will potentially be slower (gotta see it though).


You only need the inaccurary for demo compatibility. The game works just as fine if it's precise so there's no point emulating it. In fact, if you allow larger maps you need a fallback to the precise method because the original implementation is prone to overflows if the values being passed get too long.

Share this post


Link to post
Graf Zahl said:

Where would you need it for shorts?


Not for shorts, as much as for bytes/chars.

Graf Zahl said:

Still, gaining 5 cycles out of 100 is only 5% And if you consider that the code spends > 80% in the renderer which in ZDoom is highly optimized assembly you end up speeding up a fraction of the time used by a fraction.


[citation needed]

OK, so you're "The" authority regarding GZDoom, but where/how did you come up with those figures? And what makes you think that they would be valid across source ports? WHat if the rendered takes less time in other ports? (in vanilla, it was important but not dominant, level complexity and sprite number was much more dominant).

Graf Zahl said:

R_PointToAngle's float version calls atan2 and that's not particularly fast, either as x87 or SEE2.


I think that's still 1 CPU cycle ever since superscalar x86 became a reality (Pentium I) but I could be wrong. And can register->register copies present in x86<->x87 really be compared to memory <-> register transfers present in pure x86, unless you're referring to 8087?

Share this post


Link to post
Maes said:

OK, so you're "The" authority regarding GZDoom, but where/how did you come up with those figures?

Entryway did, actually.

Share this post


Link to post
Gez said:


Wouldn't trust those figures much after seeing how entryway confuses 1e8 (100 millions) with 1e11 (100 billions).

In any case, don't forget we're talking about a language (Java) where you can't make 100% certain assumptions on the final form of the code that will be executed. Will be be 100% compiled? Will it be 100% interpreted? Will it map 100% to the best possible opcodes on a given architecture?

Then again, Java may actually smoke both C and C++ in actual benchmarks -omitting links to several tests that are certain to spark vitriolic responses-.

Share this post


Link to post
Maes said:

I think that's still 1 CPU cycle ever since superscalar x86 became a reality (Pentium I) but I could be wrong.

lol. 1 cpu cycle for atan2? it's not possible even in theory. fpu trig functions like sinf(), cosf(), etc take about 150-200 cycles each. can be faster with SSE (about 30-40 cycles i think), but these instructions cannot be effective in all cases.

float division is also still slower than integer. at least in real life.

http://pastebin.com/d2j4uTgF

3400 msec for 'int' and 5200 for 'float'.

Share this post


Link to post
entryway said:

float division is also still slower than integer. at least in real life.
http://pastebin.com/d2j4uTgF
3400 msec for 'int' and 5200 for 'float'.


Calling shenanigans on this, something is messing up your results. You're not running this from the editor with debugging are you?

Share this post


Link to post
Scet said:

You're not running this from the editor with debugging are you?

lol. yes, i am running this from debugger and with three Crisys copies at background and few video encoding

Share this post


Link to post

According to an Intel instruction timing sheet, fcos *may* only use as few as 14 cycles. It depends on the value of the argument. Angles < pi/2 are frequently handled via a hardware lookup table.

Share this post


Link to post
entryway said:

lol. 1 cpu cycle for atan2? it's not possible even in theory. fpu trig functions like sinf(), cosf(), etc take about 150-200 cycles each


What if you take pipelining/superscalarity into account? Aren't modern CPUs supposed to pump out 2/3 istructions per cycle (on average) when everything is properly "primed", no matter what you throw at them (excluding pathological cases)? What are the chances that the branch prediction and cache will fail so badly that you'll need the full 150 cycles for an atan2 instruction?

In any case, FMUL takes betwen 1/7 cycles on the P5 architecture, depending on the input arguments. FDIV is less forgiving, since it may take 39 or 42 cycles.

Trigonometric and transcendental functions can take as few as a couple dozen cycles on special input arguments, but not in the general case, where indeed they travel on the 70-120, again on simple P5.

In any case, the key here is pipelining and branch prediction which allows those "incredible" throughputs of multiple instructions per cycle (IPC) rather than cycles per instruction (CPI). Unless you're programming on pre-P5 Intel or Pre-PPC motorola, you can assume as little as 1-2 cycles per instruction on average.

Share this post


Link to post

I ran entryway's test using clock() instead of SDL_GetTicks() and it had floats at ~280% faster than integers.

Maes: They execute several instructions per cycle, doesn't mean they'll all complete in one cycle. atan2 is one of the most expenisve operations.

Share this post


Link to post
Scet said:

I ran entryway's test using clock() instead of SDL_GetTicks() and it had floats at ~280% faster than integers.

AMD? Try with Intel then.

Share this post


Link to post
Scet said:

I ran entryway's test using clock() instead of SDL_GetTicks() and it had floats at ~280% faster than integers.


To make things more difficult, the performance may change dramatically betwen PI, PII, PIII etc. and Intel/AMD/Cyrix/whatever. What did you test it on?

Scet said:

They execute several instructions per cycle, doesn't mean they'll all complete in one cycle. atan2 is one of the most expenisve operations.


Yeah, but if you start a new instruction on every cycle, on average you can expect that a full, non-stalled pipeline will also start churning one completed instruction (or more) per cycle, with some latency.

E.g. if I pump 20 different instructions in my 20-cycle pipeline, after 20 cycles of nothing I will start getting an apparent 1-cycle-per-instruction performance. And if I have superscalarity and no stalls, I can get even more. And if I keep the pipeline full, then the miraculous one cycle per instruction will be an effective reality even without a RISC architecture, and without the 20-cycle latency. Unless ofc the pipeline stalls and flushes everything....

Welcome to 1995, where even Wintel platforms can beat RISC at general-purpose computing, with da powers of the Pentium!!!

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×