Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
Quasar

Future-safing C code?

Recommended Posts

4mer said:

The number of bits required to store your result 100000000 exactly = log2(100000000) = 26.5 = 27

32-bit floats don't have enough bits

int main(int argc, char* argv[])
{
  unsigned int i, k, size, count, t;
  double total;
  float f, val;
  float *a;

  count = 1000;
  size = 1000000;
  val = 100.0f;
  a = (float*)malloc(size * sizeof(a[0]));
  for (i = 0; i < size; i++)
    a[i] = val;

  total = 0.0;
  t = GetTickCount();
  for (k = 0; k < count; k++)
  {
    f = Accumulate<float, float, SumPolicy>::up<1>(a, 0, size);
    total += (double)f;
  }
  printf("1) sum: %f, total: %f; time %d\n", f, total, GetTickCount() - t);

  total = 0.0;
  t = GetTickCount();
  for (k = 0; k < count; k++)
  {
    f = 0.0;
    for (i = 0; i < size; i++)
      f += a[i];
    total += (double)f;
  }
  printf("2) sum: %f, total: %f; time %d\n", f, total, GetTickCount() - t);
  
  getchar();
  return 0;
}
1) sum: 100000000.000000, total: 100000000000.000000; time 2511
2) sum: 98684352.000000, total: 98684352000.000000; time 3042

Template code is precise and faster. And it still uses floats.

Share this post


Link to post

The template code is black magic. Unless someone can explain what this mess does it's close to useless, even if in this example it's more precise.

In the end it's just a convoluted workaround of having used the wrong data type: 32 bit floats are not supposed to be used for number crunching. Outside of graphics hardware they offer no advantage whatsoever aside from saving a few bytes.

Share this post


Link to post
4mer said:

That's cheating, you need to do this:

  size = 100000000;
  val = 1.0f;

  count = 10;
  size = 100000000;
  val = 1.0f;
1) sum: 100000000.000000, total: 1000000000.000000; time 2559
2) sum: 16777216.000000, total: 167772160.000000; time 3073

Share this post


Link to post

Code like that is worthless if not well commented. I'm sure there's some sound theory buried under all that verboseness but I sure can't see it.

Share this post


Link to post

Ok, ok. You can accumulate floats with PureC too.

float PureC(float *a, unsigned int count)
{
  if (count != 1)
  {
    unsigned int cnt = count>>1;
    return PureC(a, cnt) + PureC(a + cnt, count - cnt);
  }
  else
    return *a;
}

1) sum: 100000000.000000, total: 1000000000.000000; time 2589 (templates)
2) sum: 16777216.000000, total: 167772160.000000; time 3073 (linear sum)
3) sum: 100000000.000000, total: 1000000000.000000; time 13822 (PureC recursion)

13822 msec - so slooow.

Share this post


Link to post

Fastest (and precise) version if you want:

template <size_t indx>
inline float my_summ(const float* f) {
  return *f + my_summ<indx-1>(f+1);
}
template <>
inline float my_summ<1u>(const float* f) {
  return *f;
}
float my_t_summ(const float* f,size_t n) {
  float res = .0f;
  const size_t block_size = 16;
  const size_t blocks = n / block_size;
  for (size_t i=0;i<blocks;++i) {
    res+=my_summ<block_size>(f);  
    f+=block_size;
  }
  const size_t tail = n % block_size;
  for (size_t i=0;i<tail;++i) {
    res+=*f;
    f++;
  }
  return res;
}

  count = 10;
  size = 100000000;
  val = 1.0f;
  ...
  t = GetTickCount();
  for (k = 0; k < count; k++)
  {
    f = my_t_summ(a, size);
    total += (double)f;
  }
  printf("4) sum: %f, total: %f; time %d\n", f, total, GetTickCount() - t);
4) sum: 100000000.000000, total: 1000000000.000000; time 1763

Share this post


Link to post

You forgot to provide one important piece of info:

How large is the fucking code created by the template?

I already had the suspicion that its main purpose is to force full inline expansion on what it does and seeing the time difference between the recursive and the template method just confirms it.

Also, I'm sure the plain C version can be optimized without sacrificing readablility.

Share this post


Link to post
Graf Zahl said:

How large is the fucking code created by the template?

test02.exe - 15 360 bytes (includes all 4 methods)

Share this post


Link to post

For what it's worth...

public class floats {

public static void main(String[] argv)
{
  float f;
  long a,b;
 
  f = 0.0f;
  a=System.nanoTime();
  for (int i = 0; i < 100000000; i++)
    f += 1.0f;
  b=System.nanoTime();
  
  System.out.println("summ1: "+ f +" time "+((b-a)/1000000.0)+" ms");

  f = 0.0f;
  a=System.nanoTime();
  for (int i = 0; i < 1000000; i++)
    f += 100.0f;
  b=System.nanoTime();
  System.out.println("summ2: "+ f+" time "+((b-a)/1000000.0)+" ms");
}
}
Output:

summ1: 1.6777216E7 time 122.819241 ms
summ2: 9.8684352E7 time 1.198755 ms

Share this post


Link to post
entryway said:

When you want to sum array of floats you must use well designed understandable C++ code:

scary gobbledygook
NOT SOMETHING STUPID LIKE THAT!
trivially simple code


Haha!


Wait, you're serious?

entryway said:

1) sum: 100000000.000000, total: 100000000000.000000; time 2511
2) sum: 98684352.000000, total: 98684352000.000000; time 3042

Template code is precise and faster. And it still uses floats.


Hot damn, he's serious.

Share this post


Link to post
Gez said:

Hot damn, he's serious.


What he did was using "split accumulators" -aka, limiting the sum to smaller buckets where the error isn't as likely to accumulate that much.

Even if you split the sum five or tenfold between loops, you can optain great improvements.

public class floats {

final static int BIG_SUM=100000000;
final static int SMALL_SUM=1000000;
final static int PARTIAL_SUMS=10;

public static void main(String[] argv)
{
  float f,fp[];
  long a,b;
	
  f = 0.0f;
  fp=new float[PARTIAL_SUMS];
  
  a=System.nanoTime();
  for (int p=0;p<PARTIAL_SUMS;p++){
  for (int i = 0; i < BIG_SUM/PARTIAL_SUMS; i++)
    fp[p] += 1.0f;
	f+=fp[p];
	}

	b=System.nanoTime();
  
  System.out.println("summ1: "+ f +" time "+((b-a)/1000000.0)+" ms");

  f = 0.0f;
  java.util.Arrays.fill(fp,0.0f);
  
  a=System.nanoTime();
  for (int p=0;p<PARTIAL_SUMS;p++){
  for (int i = 0; i < SMALL_SUM/PARTIAL_SUMS; i++)
    fp[p] += 1.0f;
	f+=fp[p];
	}
  b=System.nanoTime();
  System.out.println("summ2: "+ f+" time "+((b-a)/1000000.0)+" ms");
}
}
summ1: 1.0E8 time 368.46359 ms
summ2: 1000000.0 time 3.566375 ms

;-)

Share this post


Link to post

Ah indeed, sorry but earlier I thought you were asking for an explanation rather than demonstrating something.

Share this post


Link to post
entryway said:

Not me btw. From gamedev.ru topic.


I had actually encountered that same problem during my master thesis -using floats caused incoherent results when summatories where split across different number of parallel threads (using 2 threads was particularly nasty, as the splits occured on error maxima points).

The best course of action was to use doubles, in my case, since having threaded split summatories for EVERY possible loop that used them would make the code harder to maintain, plus there were operations that involved matrix-vector operations too.

But if you really need all the precision you can afford from just floats, then you must split. In the case of 100.000.000 sums, a split factor of 10 was required.

Also, what's with the times everybody's posting? Are they ms or us?
With gcc I got 375.0 and 15.0 ms on the simple summatories (close to Java when using split sum, otherwise Java can be even faster (!). Of course there's the 15 ms timer resolution limit in Windows, which skews times a bit... but the longer summatory is clearly slower in C, and -O3 makes it actually slower.

BTW...

      program floats
      implicit none
      
      integer,parameter:: BIG_SUM=100000000, SMALL_SUM=1000000
      integer i
      real f,a,b
      
      f=0.0
      
      call cpu_time(TIME=a)
      do i=1,BIG_SUM
           f=f+1.0
      end do
      call cpu_time(TIME=b)

      print*,"summ1",f,"time",(b-a)
      
      f=0.0
      call cpu_time(a)
      do i=1,SMALL_SUM
           f=f+1.0
      end do
      call cpu_time(b)
      
      print*,"summ2",f,"time",(b-a)
      end program floats
Even that nasty piece of code results in:

summ1 1.6777216E+7 time 0.359375
summ2 1000000. time 0.

Which mean seconds (so again, we see the 15 ms resolution at play, and near 360ms for the 1000000000 stuff, while the smaller summatory is unmeasurable). Surprisingly, good old fortran did better on the small summatory without any compensation/split computation.

Share this post


Link to post
Maes said:

summ1: 1.6777216E7 time 122.819241 ms
summ2: 9.8684352E7 time 1.198755 ms

lol, so fast... I had to use Intel Compiler to beat Java :)

  f = 0.0;
  t1 = GetTickCount();
  for (i = 0; i < 100000000; i++)
    f += 1.0f;
  t2 = GetTickCount();
  printf("sum: %f; time %d\n", f, t2 - t1);
msvc 2008
sum: 16777216.000000; time 312

Intel Compiler 10.x
sum: 100000000.000000 (lol!); time 94

MSVC does loop unrolling:
loc_40101A:
sub     eax, 1
fld     [esp+0Ch+var_4]
fadd    st, st(1)
fstp    [esp+0Ch+var_4]
fld     [esp+0Ch+var_4]
fadd    st, st(1)
fstp    [esp+0Ch+var_4]
fld     [esp+0Ch+var_4]
fadd    st, st(1)
fstp    [esp+0Ch+var_4]
fld     [esp+0Ch+var_4]
fadd    st, st(1)
fstp    [esp+0Ch+var_4]
fld     [esp+0Ch+var_4]
fadd    st, st(1)
fstp    [esp+0Ch+var_4]
fld     [esp+0Ch+var_4]
fadd    st, st(1)
fstp    [esp+0Ch+var_4]
fld     [esp+0Ch+var_4]
fadd    st, st(1)
fstp    [esp+0Ch+var_4]
fld     [esp+0Ch+var_4]
fadd    st, st(1)
fstp    [esp+0Ch+var_4]
fld     [esp+0Ch+var_4]
fadd    st, st(1)
fstp    [esp+0Ch+var_4]
fld     [esp+0Ch+var_4]
fadd    st, st(1)
fstp    [esp+0Ch+var_4]
jnz     short loc_40101A

ASM listing by Intel Compiler is simple:
loc_401029:
fadd    st, st(1)
add     ecx, 1
cmp     ecx, 5F5E100h
jb      short loc_401029

Share this post


Link to post

MSVC is only so slow because its floating point precision model is set to 'strict' or 'precise'. Set it to 'fast' and you'll be surprised what kind of difference it makes. Of course it will also mean that many internal results are kept as doubles (which, btw, is why the Intel compiler's result is correct.)


Actually, with the 'precise' model the use of single precision floats is magnitudes more inefficient than using doubles.

Share this post


Link to post
Graf Zahl said:

MSVC is only so slow because its floating point precision model is set to 'strict' or 'precise'. Set it to 'fast' and you'll be surprised what kind of difference it makes. Of course it will also mean that many internal results are kept as doubles (which, btw, is why the Intel compiler's result is correct.)

Heh. MSVC 'fast' model:
sum: 100000000.000000; time 93

ASM listing

loc_40101E:
sub     eax, 1
fadd    st, st(1)
fadd    st, st(1)
fadd    st, st(1)
fadd    st, st(1)
fadd    st, st(1)
fadd    st, st(1)
fadd    st, st(1)
fadd    st, st(1)
fadd    st, st(1)
fadd    st, st(1)
jnz     short loc_40101E

Share this post


Link to post

Wow, Intel sure is fast, no wonder they charge a pretty buck for their compilers. I doubt the Intel Fortran compiler itself could do better here -the GNU Fortran one is no better than gcc, but somehow it managed to get better precision from simple "real" types -that would be 32-bit floats.

Java can only come close to that with the non-split version, and then again there's the timing uncertainty of -/+ 15 ms for Intel (Java is timed with nanoTime), but as far as GCC is concerned, it's not the first time it got its ass handed to it by Java.

BTW, my tests were run on a Core Duo T8300 (2.40 GHz, single threaded).

Share this post


Link to post
Maes said:

Wow, Intel sure is fast

In this test - only because it uses 'fast' math model by default.

Without
fstp [esp+0Ch+var_4]
fld [esp+0Ch+var_4]
after each
fadd st, st(1)

I am using Core2Duo E6750 @ 2.66GHz

Share this post


Link to post
Maes said:

But if you really need all the precision you can afford from just floats, then you must split. In the case of 100.000.000 sums, a split factor of 10 was required.

Depends from data I think. You should sort array with floats and sum them in pairs from smallest to biggest.

0.01
0.02
0.02
0.05
10.0
1000.0

((0.01+0.02) + (0.02+0.05)) + (10.0+1000.0)

Or smarter

Share this post


Link to post

Yeah the actual splits should be done whenever there are cumulative error minima, which in turn depend on the data being summed.

Also the cumulative error from the different splicings should ideally cancel each other out -I would't be too surprised to learn that doing this with the maximum possible accuracy is probably a NP-complete optimization problem.

However you can't know that a-priori, nor am I sure if there is a general estimation method that doesn't take too much time itself, compared to the sum.

Also a specific splicing that's designed to "best fit" a specific set of data -or similar ones- isn't likely very useful.

Share this post


Link to post

So if Quasar is going to use the overly complex and cumbersome STL templates, what about when one of his favorite compiler or aspiring favorite compiler has a broken implementation of it?

Is he going to use STL at all beyond the basic stuff?

Share this post


Link to post
GhostlyDeath said:

So if Quasar is going to use the overly complex and cumbersome STL templates, what about when one of his favorite compiler or aspiring favorite compiler has a broken implementation of it?

Is he going to use STL at all beyond the basic stuff?

First off, nobody said a word about me using STL, and so far I have not in fact used it. Why? Because STL containers do not implement the kind of access time complexity gurantees OR, most importantly, cache performance that is needed in the code I am redesigning.

But I object to your attitude toward the STL anyway. There's nothing particularly cumbersome or poorly designed about vector, set, map, multimap, etc. The entire program I work on at my job is built on almost nothing but STL objects, outside of the VCL framework that's used for the UI. maps on string to string and even vectors of maps on string to string are regularly used, and even passed by value in a lot of instances, and the program runs with excellent responsiveness and a minimum of memory leaks (most of those that exist are in the VCL framework itself).

STL, like all programming tools, is fine when used in the proper circumstances. That does not include inside the inner loops of a game engine, or in game sim code that must run thousands of times per second IMO.

Share this post


Link to post
Quasar said:

But I object to your attitude toward the STL anyway. There's nothing particularly cumbersome or poorly designed about vector, set, map, multimap, etc. The entire program I work on at my job is built on almost nothing but STL objects, outside of the VCL framework that's used for the UI. maps on string to string and even vectors of maps on string to string are regularly used, and even passed by value in a lot of instances, and the program runs with excellent responsiveness and a minimum of memory leaks (most of those that exist are in the VCL framework itself).


I'm not saying that STL is bad (I do use it! yeah oh yeah). However, since STL uses templates it gains advantages and disadvantages of templates. Templates cause more code bloat especially if you only use it for a single type once and increase the binary complexity of the code. The STL has useful and not so useful things in it (to me anyway), but in the time tradeoff I prefer to write my own implementations that can be very simple and where it's defined so I know what's going on and can debug it without worrying about third party code. So in short I don't have anything against the core of STL except that the templates mess it up a bit.

At least the AP C++ Library was not standardized, that completely sucks, trust me. My school was throwing away old C++ books so I took one. Of course the book is an AP C++ Book, but AP sucks. The only thing AP was good for was saving me money in College by allowing me to use my high score in the test to skip a course and have it count as a transfer, thus throwing me in the next Java course. Regardless I flew through that and the next Java class with simple ease (and I was one of the few Linux users in the Room, however they were not using Debian). But I digress, maybe...

There's also the problem of broken STL implementations on some compilers, I've seen it happen and usually writing the code to make it work on them all is very very ugly.

On the C++ side of things, I prefer to do OOP such as I do OOP in Java (I don't mix OOPes though). If you're writing a C program and you just use C++ stuff like new and delete and a small set of things you're better off sticking with pure C.

Uh oh, I hope no library war is brewing over the horizon in the coffee pot.

Share this post


Link to post

I think the STL is like any library. If it suits your needs then use it, if it doesn't then don't. Generally, unless you're writing extremely tight code (like a software graphics renderer), the STL is probably fine. In fact, outside the renderer does any of this performance stuff really affect anything? Plus, Eternity's had its own data structure definitions for a long time, mdllist and e_hash work for basically everything, and there's QStr too (or qstr, I forget what it is).

Share this post


Link to post
Ladna said:

I think the STL is like any library. If it suits your needs then use it, if it doesn't then don't. Generally, unless you're writing extremely tight code (like a software graphics renderer), the STL is probably fine. In fact, outside the renderer does any of this performance stuff really affect anything? Plus, Eternity's had its own data structure definitions for a long time, mdllist and e_hash work for basically everything, and there's QStr too (or qstr, I forget what it is).

qstring, actually - Quasar's String Module ;) I'll probably be giving that some class lovin' soon as well. I think it is actually more featured than std::string at this point :P

Share this post


Link to post
GhostlyDeath said:

If you're writing a C program and you just use C++ stuff like new and delete and a small set of things you're better off sticking with pure C.



No, you are not!

One of the bigger advantages of C++ over C is stricter type checking. That alone is worth a lot, even if you don't need OOP at all.


BTW, ZDoom also does not use STL but its own implementations of vector, map and string. Why?

1. To guarantee that all platforms use the same code
2. To not depend on potentially buggy and incompatible implementations

And if there's one big thing against STL it's that STL code is almost impossible to debug. The STL libraries that come with MSVC look like utter Nazi-OOP and are close to unreadable, building template upon template upon template. And if you need to debug such code for whatever reason it's really going to get nasty.

Share this post


Link to post

STL is childish babble in comparison with boost. Eternity needs more boost! boost::thread, boost::asio for network, boost::spirit for parser, boost::regex, algorithms, smart pointers, etc, etc.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×