![]() |
|
Fastest Sine / Cosine Table Possible |
Kris Asick
Member #1,424
July 2001
|
NOTE: Accidentally put this here instead of under "Programming Questions"... might want to move it... After some experimenting I've deduced that the following is the fastest way to implement sine/cosine tables through a function that is interchangeable with the built-in sin() and cos() functions included in MATH.H. // NOTE: Code to build sine table not included in this excerpt. #define SIN_TABLE_SIZE 4096 #define SIN_TABLE_BITMASK 4095 double _sin_t[SIN_TABLE_SIZE]; double Get_SinT (double d) { return _sin_t[(int)(d * 651.8986469f) & SIN_TABLE_BITMASK]; } My tests showed this routine to be almost twice as fast as the normal sin() function with no compiler optimizations enabled. Does anyone have any ideas how to make this even faster, or is that about as fast as it's going to get while still being compatible with radians? --- Kris Asick (Gemini) --- Kris Asick (Gemini) |
orz
Member #565
August 2000
|
You cast a floating point value to an integer. C requires that use round-towards-zero rules. Most x86 hardware, last I checked, required a slow switch of rounding modes to enter that mode, then a float->int conversion, then a slow switch back to normal mode (round nearest I think). Possibly that's been improved in one of the more recent instruction set extensions, I don't know. Some compilers support a rint() or something for float->int conversion with different rounding rules, but I can't find it atm. My inline asm code for float->int conversion:
Benchmark that against a regular C float->int cast, see if it helps. edit: note, for supporting sin and cos, you can either add a constant before the masking or you can use a table 25% larger and add a constant after the masking, depending upon whether you want to reduce instruction count to the minimum or save a little extra space (memory, cache, and TLB usage). Also, for your table size, there's no real advantage to using doubles instead of floats for the table. |
HoHo
Member #4,534
April 2004
![]() |
Didn't most newer CPU's have special instruction that returned sin/cos value from HW table? I think I saw something about it in the Intel architecture manuals. __________ |
tobing
Member #5,213
November 2004
![]() |
You might want to use functions from FFTW: Which has superfast implementations of sin and cos, but also much more... Edit: Beware, it's GPL! |
Kris Asick
Member #1,424
July 2001
|
orz: I tired implementing your code in place of the integer type-casting. There was a calculable drop in speed. My benchmark (100,000,000 sin-table operations with no compiler optimizations) went from about 4.5 seconds to complete its calculations to 6 seconds. Though that's still better than the 7.8 seconds the regular sin() and cos() routines take. In case it makes a difference (since your code was assembler after all) I am using MSVC 6.0 under Windows 98SE. My processor is an AMD Sempron 2.0 GHz. And I'm only using doubles because the sin() and cos() routines do. I plan to switch to regular floats when I add this to my game engine. tobing: I always prefer to do things myself. A testament to that is the random number generator I made which I use in my games. --- Kris Asick (Gemini) --- Kris Asick (Gemini) |
HoHo
Member #4,534
April 2004
![]() |
Quote: ... with no compiler optimizations Why don't you allow compiler optimizations? It's kind of stupid to not let compilers do their work. After all un-optimized code is not what you will use in release, is it? Quote: I am using MSVC 6.0 That might explain some things. This compiler is one of the slowest there is. I suggest getting GCC 4.1 (MinGW) and do the benchmarks again. Quote: And I'm only using doubles because the sin() and cos() routines do For floats there is sinf and cosf that should be quite a bit faster than their double counterparts. I for one haven't yet seen a place where floats don't give enough percision, especially as you only use x87. I would understand that the 32bit SSE floats can cause some problems when you are not careful but even those are usable if you are careful. Also if you decide to use gcc you can try out some of its builtin functions that should map directly to machine instructions possibly providing significant boost to the standard library implementation. Also I would like to see your benchmark program. It would be interesting to see how it behaves on different CPU's. __________ |
tobing
Member #5,213
November 2004
![]() |
Quote: I always prefer to do things myself. Uh-oh. Good luck with that. I would go nowhere, if I had to do all for myself, without relying on powerful libraries, allegro being just one of many. I would also not be keen to write my own compiler, os etc... |
orz
Member #565
August 2000
|
Quote: : I tired implementing your code in place of the integer type-casting. There was a calculable drop in speed. My benchmark (100,000,000 sin-table operations with no compiler optimizations) went from about 4.5 seconds to complete its calculations to 6 seconds. Though that's still better than the 7.8 seconds the regular sin() and cos() routines take.
Benchmarking should be done with full optimization enabled. The limiting factors in debug mode can be wildly different than the limiting factors in release mode in some cases. Of course, you also need to take care that your code does something that can't be optimized away to nothing, otherwise release mode will be even further from what you want that debug mode. Quote: In case it makes a difference (since your code was assembler after all) I am using MSVC 6.0 under Windows 98SE. My processor is an AMD Sempron 2.0 GHz. I used MSVC 7.1 under win2k on an Athlon XP 2600+. Quote: And I'm only using doubles because the sin() and cos() routines do. I plan to switch to regular floats when I add this to my game engine. The table itself can be floats without effecting the type of the return value of your lookup function. I should further note that I prefer to not use lookup table sin values in my own code, except when I'm doing everything in fixed point math. I avoid them because the imprecision, while minor most of the time, sometimes becomes quite relevant, and such times may not always be immediately obvious ; I consider it not worth the effort involved in figuring out when sin tables won't hurt, so I never use them in regular code. A more accurate version can be done using table lookups and interpolating between adjacent table entries, but that is loses much of the speed advantage and still, on very rare occaisons, has precision issues. Instead, to optimize my sin() and cos() calls, I use the fsincos opcode, which calculates both values with a single opcode, and is faster than calling regular sin() and cos(), though not nearly as fast as simple table lookups (almost as fast as interpolated lookups though):
note that MSVC does not by default define the _i386_ symbol ; I don't know how you're supposed to recognize target instruction set on that compiler, so I just define it myself. edit: Quote: For floats there is sinf and cosf that should be quite a bit faster than their double counterparts. I for one haven't yet seen a place where floats don't give enough percision, especially as you only use x87. I would understand that the 32bit SSE floats can cause some problems when you are not careful but even those are usable if you are careful. I have never heard of sinf or cosf before. These are standard, portable? My benchmarking finds them to be identical in speed to regular sin() and cos() on my compiler. |
Kris Asick
Member #1,424
July 2001
|
HoHo: 1. MSVC vs. GCC is one of those arguments I am not going to be dragged into. It's like comparing MS Word to Wordperfect. 2. sinf() benchmarks at about 11 seconds using my program... worse than sin(). 3. All my benchmark program is is a for-loop and the command who's speed I wish to test, being assigned to a variable. My program doesn't actually do the timing part, I time it myself using a stopwatch a few times then figure out the average. It was faster to program that way and at the time, I wanted it programmed in a few minutes using the console instead of a half hour using Allegro's timer functions. tobing: Well, obviously I use Allegro, but the only thing stopping me from learning to write my own library of functions is the fact that I don't have access to enough variations of computers to test such an implementation, whereas Allegro has been through its paces on so many computers I've never heard of my games crashing horribly on anyone's system due to Allegro's fault. (Let alone my own actually.) orz: MSVC is optimizing the area of my benchmark that is just a fixed command over and over to only process once, thus no matter how many times I loop for, it's only taking a split second to execute with either the built-in commands or the table commands, thus it's making it impossible for me to figure out how my functions will work when optimized. --- Kris Asick (Gemini) --- Kris Asick (Gemini) |
gnolam
Member #2,030
March 2002
![]() |
HoHo said: I suggest getting GCC 4.1 (MinGW) and do the benchmarks again. Err... there is no GCC 4.1 for MinGW. -- |
orz
Member #565
August 2000
|
Quote: MSVC is optimizing the area of my benchmark that is just a fixed command over and over to only process once, thus no matter how many times I loop for, it's only taking a split second to execute with either the built-in commands or the table commands, thus it's making it impossible for me to figure out how my functions will work when optimized. The problem is that your code is written in such way that discarding your calls is legal for the compiler, because your calls don't actually do anything except set a variable which is never looked at. The trick is to do something such that the code cannot behave correctly without calling your function the correct number of times. Trying adding up all return values and printing the sum. This is a sample of the code I benchmarked with: sum = 3.1; for (int i = 0; i < n; i++) { for (int j = 0; j < 1000; j++) { sum += sinf(sum); sum += sinf(sum); sum += sinf(sum); sum += sinf(sum); } } double time4 = get_time2(); double dt3 = (time4 - time3) / (n * 4000.0) * 1000000; printf("algorithm3: %.3f ns / sin; (check:%.9f)\n", dt3, sum);
Quote: 1. MSVC vs. GCC is one of those arguments I am not going to be dragged into. It's like comparing MS Word to Wordperfect. It's not just MSVC vs GCC. MSVC 6.0 is old, and extremely buggy if you use C++ templates beyond STL. If you don't use C++ templates, it's probably adequate. MSVC 7.1 is much less buggy in that regard, as are most recent and semi-recent versions of GCC. edit: more recent MSVCs and GCCs are also faster than MSVC 6.0 edit: just for fun, this is the output of my current code: Get_SinT 27.525 ns / call; (check:2.510948387) get_sin_via_table2 18.895 ns / call; (check:1.646245552) get_sin_via_table2_with_interpolation 34.012 ns / call; (check:2.000266046) _sinf 74.413 ns / call; (check:2.000186454) sin 84.446 ns / call; (check:2.000186454) _fsincos 95.225 ns / call; (check:2.000186454) I've modified it to make imprecisions in the results more obvious in the "check" value, and to simplify the code and make the output more readable. Notice how the top two (your original code and a modification of it to use inline assembly float->int conversion) produce large deviations, while the interpolating version (which also uses asm float->int) produces small deviations, and the remaining three produce identical results? Also notice that _fsincos is only slightly slower than sin despite calculating both sin and cos. |
Kris Asick
Member #1,424
July 2001
|
Well, at the moment, MSVC 6.0 is what I use. I've never had any problems with it and to use any higher version I would have to upgrade my OS as well. And in the past, GCC hated me. Partly why I stuck with MSVC. Making the sin/cos table larger would probably help reduce the deviation values without having to add that extra little overhead of interpolation. After all, a 4096 entry table only takes up 16 KB of memory using floats and 32 KB using doubles. Making the table 16,384 entries big would only take 64 KB or 128 KB, which is still not that much when you consider MBs worth of graphics, sounds and music. And I get what you mean by what the compiler's doing with optimizations. Actually, I never really paid much attention to optimizations. I just normally turn 'em to full-speed and let them do their thing. --- Kris Asick (Gemini) --- Kris Asick (Gemini) |
Paul Pridham
Member #250
April 2000
![]() |
Not trying to derail your thread Kris, but... HoHo said: That might explain some things. This compiler is one of the slowest there is. I suggest getting GCC 4.1 (MinGW) and do the benchmarks again. I recently attempted to switch from MSVC6 to the latest mingw32 package. My current game which uses a lot of sin/cos and software polygon rendering experienced a noticeable drop in framerate, in the neighbourhood of 10FPS, with mingw32. Otherwise, the switch went well, using MinGW Visual Studio which is a lot like MVC6. Unless I can see some metrics on the supposed superiority of mingw32 over MSVC6, and an explanation on how to access this superior performance, I'll be sticking with MSVC6. STL and C++ I avoid like the plague anyway, so MSVC6 works perfectly fine. ---- |
Bob
Free Market Evangelist
September 2000
![]() |
So, what happens if you supply a value outside the range [0..2pi]? What precision do you really get with a table lookup? -- |
orz
Member #565
August 2000
|
Kris Asick: Get_SinT 26.865 ns / call; (check:2.141740928) get_sin_via_table2 18.751 ns / call; (check:1.982911766) get_sin_via_table2_with_interpolation 33.851 ns / call; (check:2.000186454) _sinf 72.959 ns / call; (check:2.000186454) sin 85.588 ns / call; (check:2.000186454) _fsincos 96.583 ns / call; (check:2.000186454) The deviations by this measure got noticably smaller. In fact, for the interpolating version, the deviation disappeared. But, I am not a believe in large table sizes, at least for general purpose things; they may do fine on benchmarks until they start overflowing the cache, but in real world circumstances there's other stuff that wants a share of the cache too. Bob: Quote: So, what happens if you supply a value outside the range [0..2pi]? What precision do you really get with a table lookup?
edit: corrected these results, and the description; a bug initially had the table based stuff performing better Get_SinT 29.411 ns / call; check:1.015412770 get_sin_via_table2 22.424 ns / call; check:0.738985208 get_sin_via_table2_with_interpolation 36.800 ns / call; check:0.778233193 sin 89.889 ns / call; check:0.778233193 bias: PI * 256 Get_SinT 31.876 ns / call; check:1.015412770 get_sin_via_table2 23.182 ns / call; check:0.738985208 get_sin_via_table2_with_interpolation 39.232 ns / call; check:0.778233193 sin 92.577 ns / call; check:0.778233193 bias: PI * 65536 Get_SinT 29.303 ns / call; check:1.015412770 get_sin_via_table2 22.898 ns / call; check:0.738985208 get_sin_via_table2_with_interpolation 36.886 ns / call; check:0.778233179 sin 93.601 ns / call; check:0.778233353 bias: PI * 4294967296 Get_SinT 29.197 ns / call; check:0.962381428 get_sin_via_table2 22.253 ns / call; check:4.964428596 get_sin_via_table2_with_interpolation 36.993 ns / call; check:-1.#IND00000 sin 98.563 ns / call; check:0.789689926 bias: PI * 1099511627776 Get_SinT 29.090 ns / call; check:0.956744473 get_sin_via_table2 22.055 ns / call; check:4.964428596 get_sin_via_table2_with_interpolation 37.330 ns / call; check:-1.#IND00000 sin 100.243 ns / call; check:3.776719549 bias: PI * 281474976710656 Get_SinT 29.282 ns / call; check:2.899386226 get_sin_via_table2 22.070 ns / call; check:4.964428596 get_sin_via_table2_with_interpolation 37.951 ns / call; check:-1.#IND00000 sin 103.442 ns / call; check:-4.451114322
|
HoHo
Member #4,534
April 2004
![]() |
Just one remark: orz, could you attach your code? It would be interesting to test on other architectures and compilers. I can provide 32bit Core2Duo with GCC 4.1. I'm too lazy to write it myself If you want to know how badly MSVC sucks then go here and search for posts containing "msvc". Also keep in mind that there they talk about MSVC8. According to MS it is a giant leap over MSVC6 __________ |
Arthur Kalliokoski
Second in Command
February 2005
![]() |
Quote: slow switch of rounding modes to enter that mode (cast float to int) I don't have the code here on this computer, but you can multiply the number by some magic value and grab the int off the mantissa and mask off the fp exponent & sign bits. If this doesn't allow enough range, you can cast to a double to get more accuracy. (store to double is still faster than fistp) Quote: My current game which uses a lot of sin/cos and software polygon rendering experienced a noticeable drop in framerate, in the neighbourhood of 10FPS, with mingw32. Using a particular compiler can be like using a particular language, you have to know your way around to get best performance. Did you try -ffast-math, -fomit-frame-pointer, -O3, -march=pentium4, -fpumath=sse,387, and countless others? I haven't been able to outrun (within 2%) Mingw 3.42 with the free VC++ download thing if I spent a half hour on a particular program. Quote: value outside the range [0..2pi] Even the fpu sin/cos functions have limits, huge to be sure, but there nonetheless. They all watch too much MSNBC... they get ideas. |
Kikaru
Member #7,616
August 2006
![]() |
orz
Member #565
August 2000
|
Quote: lookup table based things work well only if they sit in L1 or at least L2 cache. When they go from L1 to L2 speed drops around 3-4x, if they go to main memory speed drops about 100x compared to L1 assuming that memory prefetch doesn't work. With AMD64 things are a bit better but not that much. Indeed. And I forgot to make sure my memory accesses were unpredictable. My performance numbers were based upon the assumption that memory access mostly went to the L1 (though, on an Athlon-based system, L1 is big enough to fit decent tables; on an intel system IIRC L1 is generally smaller but the speed hit for dropping to L2 is a lot better). Unfortunately, it's hard to make sure it can't prefetch my memory without sticking in frequent random number generator calls, which slows things down a bit independantly of the trig speed. Quote: I don't have the code here on this computer, but you can multiply the number by some magic value and grab the int off the mantissa and mask off the fp exponent & sign bits. If this doesn't allow enough range, you can cast to a double to get more accuracy. (store to double is still faster than fistp) I don't really like that method, as the behavior is kinda stupid when the input goes outside of the intended range. But, it does seem to improve speed a little more. #elif defined USE_BINARY_FLOAT_CONVERSION # define BIGDOUBLE 6755399441055744.0 int iround(double a) { int i; a = a + (BIGDOUBLE + 0); i = *((int*)&a); return i; }
Quote: orz, could you attach your code? It would be interesting to test on other architectures and compilers. I can provide 32bit Core2Duo with GCC 4.1. I'm too lazy to write it myself Done. Be warned though, this is kinda a mess that was quickly assembled from pieces of several different projects. There's a ton of code you'll just want to skip over to get to the relevant stuff. edit: The output on my computer from the code as posted: dummy 5.672 ns / call; check:2.1907804172 Get_SinT 33.177 ns / call; check:1.7299634784 get_sin_via_table2 23.802 ns / call; check:1.7279857059 get_sin_via_table2_binary_iround 18.555 ns / call; check:1.7279857059 get_sin_via_table2_with_interpolation 36.615 ns / call; check:1.7273836151 _sinf 61.737 ns / call; check:1.7273835937 sin 71.317 ns / call; check:1.7273835937 _fsincos 82.553 ns / call; check:1.7273835937
Dummy does nothing except the overhead - it calls the random number generator, adds things up, but no sin/cos stuff. |
Fladimir da Gorf
Member #1,565
October 2001
![]() |
Does any of these benchmarks take in account that such lookup tables rely on the CPU cache... OpenLayer has reached a random SVN version number ;) | Online manual | Installation video!| MSVC projects now possible with cmake | Now alvailable as a Dev-C++ Devpack! (Thanks to Kotori) |
orz
Member #565
August 2000
|
Quote: Does any of these benchmarks take in account that such lookup tables rely on the CPU cache... The timing numbers in the last set include plenty of L1 misses. It doesn't include many L2 misses though, as even the 256 kilobyte table fits into the L2 cache footprint. The L1 miss rate is worse than real world in most cases, since it's largely based upon randomly generated angles. The L2 miss rate is better than real world in most cases, since no major other program data is used simultaneously. If you want a graph of log(table size) vs performance ala linpack you could simply adjust the table size and run again a few times. |
Tobias Dammers
Member #2,604
August 2002
![]() |
Not quite. Most real-world applications do a lot of other stuff between calls to sin() or cos(), and the more that happens, the more cache misses you get. --- |
|