|
double precision for al_transform_coordinates? |
Edgar Reynaldo
Major Reynaldo
May 2007
|
I'm guessing this is moot because of implementation details, but I was just wondering if there could be a use for a double precision version of al_transform_coordinates? I'm guessing allegro uses floats in its transformation matrices so there wouldn't be much point if that is true. It might matter if there was a version that took double pointers instead of float pointers, because right now I have to declare two floats and perform assignment to get the data back into my double types. It's a data intensive operation in this case, so it might matter at least a little bit. My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
Mark Oates
Member #1,146
March 2001
|
What are you working on that needs double instead of float? -- |
Edgar Reynaldo
Major Reynaldo
May 2007
|
I'm working on my Spiraloid program again, and I need super high precision angles for the spiral's theta value and theta offset, as well as rotation. Something I'm doing now is using integer decimals and exponenents to keep the values the same and prevent precision loss when adding values, then I convert to doubles when I go to actually use the value. But I need high precision for the transformations that I'm applying. I suppose I have a matrix class lying around here somewhere that I could use...., but I really like Allegro's TRANSFORMs. Edit Ex, with a radial_delta of 1 and a theta_delta of 1 that is 815,000 data points running at 60 Hz gives about 2*2*50 million float to double assignments and 2*50 million transformation calculations per second, which is enough to stress the cpu. Edit 2 Here's some 11x17 prints on the wall I made of some of my Spiraloid images today using the Color copier print service at Staples. Only about $15 bucks for 10 images, and the lady was nice enough to give me 10 free sheets of glossy photo paper to use. {"name":"610268","src":"\/\/djungxnpq2nug.cloudfront.net\/image\/cache\/9\/6\/960872a39afeb072ee8fc19c09dde637.png","w":800,"h":450,"tn":"\/\/djungxnpq2nug.cloudfront.net\/image\/cache\/9\/6\/960872a39afeb072ee8fc19c09dde637"} My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
Elias
Member #358
May 2000
|
I actually use my own version which uses double - float just doesn't really work at all above values of about >20,000 or you lose 1-pixel accuracy. And that's when not manipulating coordinates - longer chains of transformations basically don't work with float, period. Even with double it's easy to hit accuracy problems when you're not careful about the order of operations. So basically, I'd be for converting all floats in Allegro do double -- |
Edgar Reynaldo
Major Reynaldo
May 2007
|
Would that impact the FPU performance at all? Are floats significantly faster than doubles? My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
Chris Katko
Member #1,881
January 2002
|
Edgar Reynaldo said: Are floats significantly faster than doubles? Everyone on the web keeps giving out B.S. answers. I think we need to do an actual benchmark to get an answer to that. The best I could find is this: http://brandon.northcutt.net/article/double+VS+float+Speed+Comparison/20150625.html In synthetic test, ever-so-slightly slower. In a "real world" test, it was twice as slow. Of course, "twice is slow" is meaningless to a 4.0 GHZ server with 802,351 cores. [edit] Someone linked this talk on a Reddit post: I'm gonna watch it when I get back home. Supposedly it covers float/double performance. -----sig: |
Edgar Reynaldo
Major Reynaldo
May 2007
|
Chris Katko said: Everyone on the web keeps giving out B.S. answers. I think we need to do an actual benchmark to get an answer to that. The best I could find is this: http://brandon.northcutt.net/article/double+VS+float+Speed+Comparison/20150625.html The first thing I saw was -pg and gprof. That did not inspire confidence in me. gprof is hopelessly broken and no longer in development AFAIK (at least for MinGW). Brandon Northcutt said:
I used a second method to evaluate a more "in the wild" performance and it yielded interesting results. For this method I compiled the program without the CPU profiling switch "-pg" and then made two binaries, one which ran only the float benchmark and one that ran only the double benchmark. $ time ./float_bench $ time ./double_bench These results carry far more weight with me. But do they mean I should sacrifice the precision of doubles for the speed of floats? I don't know. Something to note is that there were not any optimization flags passed to the compiler. It might be worth retesting the second method with optimizations enabled. I'm not on Linux so I can't use 'time' to measure it though, and I dont' know how to use high performance counters on Windows yet. Edit Chris Katko said: [edit] Someone linked this talk on a Reddit post: I'm gonna watch it when I get back home. Supposedly it covers float/double performance. I watched the slideshow, and it gave some juicy tidbits about new instruction sets like AVX and AVX2 and about how 'optimizations' on one architecture can be 'stalls' on another. My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
Erin Maus
Member #7,537
July 2006
|
Double precision will be anywhere from slow to glacier on the GPU when compared to single precision; on CPUs (x86 at least), not so much. GLM is a great math library. It's pretty much standalone, very portable, and has support for most everything you'd need for rendering. And it supports single and double precision matrices (and vectors, and so on). Since Allegro's transforms are geared towards GPUs, or so I think, single precision is probably best. --- |
Edgar Reynaldo
Major Reynaldo
May 2007
|
I modified Brandon's benchmarking program (the second method) slightly and fixed a minor bug (he was initializing a float array with 0.0 ((not 0.0f))) and then compiled it with different optimization levels and ran the tests with 1000 calls. Zip file of code and batch scripts : Here are the results : 1
2c:\ctwoplus\progcode\BenchmarksAndProfiling>CompileFPUbenchmark.bat
3
4c:\ctwoplus\progcode\BenchmarksAndProfiling>echo on
5
6c:\ctwoplus\progcode\BenchmarksAndProfiling>rem Compiling fpubenchmark.cpp
7ECHO is on.
8
9c:\ctwoplus\progcode\BenchmarksAndProfiling>mingw32-g++ -Wall -m32 -O0 -o fpu32-0.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
10
11c:\ctwoplus\progcode\BenchmarksAndProfiling>mingw32-g++ -Wall -m32 -O1 -o fpu32-1.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
12
13c:\ctwoplus\progcode\BenchmarksAndProfiling>mingw32-g++ -Wall -m32 -O2 -o fpu32-2.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
14
15c:\ctwoplus\progcode\BenchmarksAndProfiling>mingw32-g++ -Wall -m32 -O3 -o fpu32-3.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
16ECHO is on.
17
18c:\ctwoplus\progcode\BenchmarksAndProfiling>rem -m64 not supported on mingw32
19
20c:\ctwoplus\progcode\BenchmarksAndProfiling>rem mingw32-g++ -Wall -m64 -O0 -o fpu64-0.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
21
22c:\ctwoplus\progcode\BenchmarksAndProfiling>rem mingw32-g++ -Wall -m64 -O1 -o fpu64-1.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
23
24c:\ctwoplus\progcode\BenchmarksAndProfiling>rem mingw32-g++ -Wall -m64 -O2 -o fpu64-2.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
25
26c:\ctwoplus\progcode\BenchmarksAndProfiling>rem mingw32-g++ -Wall -m64 -O3 -o fpu64-3.exe -Ic:\mingw\LIBS\A5113distro\include -Lc:\mingw\LIBS\A5113distro\lib fpubenchmark.cpp -lallegro_monolith.dll
27c:\ctwoplus\progcode\BenchmarksAndProfiling>RunFPUbenchmark.bat
28
29c:\ctwoplus\progcode\BenchmarksAndProfiling>echo on
30ECHO is on.
31
32c:\ctwoplus\progcode\BenchmarksAndProfiling>rem Running 32 bit fpu benchmarks
33
34c:\ctwoplus\progcode\BenchmarksAndProfiling>fpu32-0.exe
35Testing 1000 calls and 6220800 memory allocations :
36float result 9679922003968.000000
37Float results ( 43.69694183 seconds) : Total allocation time 18.57704912 seconds , total math time 23.58766864 seconds , total dealloc time 1.53222407
38Float result averages : Allocation average 0.01857705 , math average 0.02358767 , dealloc average 0.00153222
39double result 9674583494400.500000
40Double results ( 54.04342045 seconds) : Total allocation time 19.55067594 seconds , total math time 31.37647679 seconds , total dealloc time 3.11626772
41Double result averages : Allocation average 0.01955068 , math average 0.03137648 , dealloc average 0.00311627
42
43c:\ctwoplus\progcode\BenchmarksAndProfiling>fpu32-1.exe
44Testing 1000 calls and 6220800 memory allocations :
45float result 9679922003968.000000
46Float results ( 25.86970711 seconds) : Total allocation time 6.87207194 seconds , total math time 17.43963017 seconds , total dealloc time 1.55800499
47Float result averages : Allocation average 0.00687207 , math average 0.01743963 , dealloc average 0.00155800
48double result 9674583494400.500000
49Double results ( 33.06386690 seconds) : Total allocation time 12.41526336 seconds , total math time 17.54913144 seconds , total dealloc time 3.09947210
50Double result averages : Allocation average 0.01241526 , math average 0.01754913 , dealloc average 0.00309947
51
52c:\ctwoplus\progcode\BenchmarksAndProfiling>fpu32-2.exe
53Testing 1000 calls and 6220800 memory allocations :
54float result 9679922003968.000000
55Float results ( 24.68116194 seconds) : Total allocation time 6.41169159 seconds , total math time 16.69645412 seconds , total dealloc time 1.57301624
56Float result averages : Allocation average 0.00641169 , math average 0.01669645 , dealloc average 0.00157302
57double result 9674583494400.500000
58Double results ( 32.01803220 seconds) : Total allocation time 12.20064342 seconds , total math time 16.72717455 seconds , total dealloc time 3.09021423
59Double result averages : Allocation average 0.01220064 , math average 0.01672717 , dealloc average 0.00309021
60
61c:\ctwoplus\progcode\BenchmarksAndProfiling>fpu32-3.exe
62Testing 1000 calls and 6220800 memory allocations :
63float result 9679922003968.000000
64Float results ( 24.45988656 seconds) : Total allocation time 6.31435281 seconds , total math time 16.57656673 seconds , total dealloc time 1.56896702
65Float result averages : Allocation average 0.00631435 , math average 0.01657657 , dealloc average 0.00156897
66double result 9674583494400.500000
67Double results ( 32.33114045 seconds) : Total allocation time 12.42040700 seconds , total math time 16.77953203 seconds , total dealloc time 3.13120142
68Double result averages : Allocation average 0.01242041 , math average 0.01677953 , dealloc average 0.00313120
69ECHO is on.
70
71c:\ctwoplus\progcode\BenchmarksAndProfiling>rem Running 64 bit fpu benchmarks
72
73c:\ctwoplus\progcode\BenchmarksAndProfiling>rem fpu64-0.exe
74
75c:\ctwoplus\progcode\BenchmarksAndProfiling>rem fpu64-1.exe
76
77c:\ctwoplus\progcode\BenchmarksAndProfiling>rem fpu64-2.exe
78
79c:\ctwoplus\progcode\BenchmarksAndProfiling>rem fpu64-3.exe
80
81c:\ctwoplus\progcode\BenchmarksAndProfiling>
Here's the code I used : 1//CREATED: 2015-06-25 15:11 - -BDN
2//UPDATED: 2015-06-25 15:11 - -BDN
3//AUTHORS: Brandon D. Northcutt (brandon@northcutt.net)
4//
5//This is a program intended to illustrate the relative efficiency of double versus single precision floating point numbers.
6#include "allegro5/allegro.h"
7
8
9
10#include <cstdio>
11#include <cstdlib>
12
13
14
15volatile int MEM = 6220800;///2015-06-25 14:27 - An RGB 1920x1080 image. -BDN
16const int CALLS = 1000;
17
18double * ad;
19float * af;
20
21int CALLNUM = 0;
22
23double double_alloc_time[CALLS];
24double double_math_time[CALLS];
25double double_dealloc_time[CALLS];
26double total_double_alloc_time = 0.0;
27double total_double_math_time = 0.0;
28double total_double_dealloc_time = 0.0;
29
30double float_alloc_time[CALLS];
31double float_math_time[CALLS];
32double float_dealloc_time[CALLS];
33double total_float_alloc_time = 0.0;
34double total_float_math_time = 0.0;
35double total_float_dealloc_time = 0.0;
36
37void double_memory_allocation()
38{
39 ad=new double[MEM];
40 for(int i=0;i<MEM;i++) ad[i]=0.0;
41}
42
43double double_math(void)
44{
45 double t;
46 for(int i=1;i<MEM;i++)
47 {
48 t=(double)i;
49 ad[i]=(t*t - t)/(t + t);
50 ad[0]+=ad[i];
51 }
52 return ad[0];
53}
54
55void double_memory_deallocation()
56{
57 delete ad;
58}
59
60double double_benchmark(void)
61{
62 double r;
63 double t1 = al_get_time();
64 double_memory_allocation();
65 double t2 = al_get_time();
66 r=double_math();
67 double t3 = al_get_time();
68 double_memory_deallocation();
69 double t4 = al_get_time();
70
71 total_double_alloc_time += double_alloc_time[CALLNUM] = t2 - t1;
72 total_double_math_time += double_math_time[CALLNUM] = t3 - t2;
73 total_double_dealloc_time += double_dealloc_time[CALLNUM] = t4 - t3;
74
75 return r;
76}
77
78void float_memory_allocation()
79{
80 af=new float[MEM];
81 for(int i=0;i<MEM;i++) af[i]=0.0f;
82}
83
84float float_math(void)
85{
86 float t;
87 for(int i=1;i<MEM;i++)
88 {
89 t=(float)i;
90 af[i]=(t*t - t)/(t + t);
91 af[0]+=af[i];
92 }
93 return af[0];
94}
95
96void float_memory_deallocation()
97{
98 delete af;
99}
100
101float float_benchmark(void)
102{
103 float r;
104 double t1 = al_get_time();
105 float_memory_allocation();
106 double t2 = al_get_time();
107 r=float_math();
108 double t3 = al_get_time();
109 float_memory_deallocation();
110 double t4 = al_get_time();
111
112 total_float_alloc_time += float_alloc_time[CALLNUM] = t2 - t1;
113 total_float_math_time += float_math_time[CALLNUM] = t3 - t2;
114 total_float_dealloc_time += float_dealloc_time[CALLNUM] = t4 - t3;
115
116 return r;
117}
118
119int main (void)
120{
121 al_init();
122
123 float tmpf;
124 double tmpd;
125
126 printf("Testing %d calls and %d memory allocations :\n" , CALLS , MEM);
127
128 for(CALLNUM=0;CALLNUM<CALLS;CALLNUM++) tmpf=float_benchmark();
129 printf("float result %f\n",tmpf);
130
131 double total_float_time = total_float_alloc_time + total_float_math_time + total_float_dealloc_time;
132 printf("Float results (%14.8lf seconds) : Total allocation time %14.8lf seconds , total math time %14.8lf seconds , total dealloc time %14.8lf\n",
133 total_float_time , total_float_alloc_time , total_float_math_time , total_float_dealloc_time);
134 printf("Float result averages : Allocation average %14.8lf , math average %14.8lf , dealloc average %14.8lf\n",
135 total_float_alloc_time/CALLS , total_float_math_time/CALLS , total_float_dealloc_time/CALLS);
136
137
138 for(CALLNUM=0;CALLNUM<CALLS;CALLNUM++) tmpd=double_benchmark();
139 printf("double result %lf\n",tmpd);
140
141 double total_double_time = total_double_alloc_time + total_double_math_time + total_double_dealloc_time;
142 printf("Double results (%14.8lf seconds) : Total allocation time %14.8lf seconds , total math time %14.8lf seconds , total dealloc time %14.8lf\n",
143 total_double_time , total_double_alloc_time , total_double_math_time , total_double_dealloc_time);
144 printf("Double result averages : Allocation average %14.8lf , math average %14.8lf , dealloc average %14.8lf\n",
145 total_double_alloc_time/CALLS , total_double_math_time/CALLS , total_double_dealloc_time/CALLS);
146
147 return 0;
148}
As expected, -O0 took the longest. -O1, -O2, and -O3 were all comparable. Memory allocation and deallocation generally took twice the time for doubles as it did for floats (because they are twice as big). Deallocation times were constant across optimizations. Something important to note is that I used volatile for the memory allocation size so it couldn't be optimized away. I used al_get_time for measurements. Allocation and deallocation can be quite costly, and should be avoided if possible. The math times are comparable on my laptop with any optimization other than -O0 (Intel i7-5700HQ @ 2.70 GHz). I'm running Windows 10 64 bit and I wanted to test with -m64 architecture but mingw32 doesn't support it. Edit TL;DR; -O0 float : 43.70ms per op = 22.88FPS -O0 double : 54.04ms per op = 18.50FPS -O1 float : 25.87ms per op = 38.65FPS -O1 double : 33.06ms per op = 30.25FPS -O2 float : 24.68ms per op = 40.52FPS -O2 double : 32.02ms per op = 31.23FPS -O3 float : 24.46ms per op = 40.88FPS -O3 double : 32.33ms per op = 30.93FPS And a table of the results for just the computations : -O0 float : 23.59ms per op = 42.39FPS -O0 double : 31.38ms per op = 31.87FPS -O1 float : 17.44ms per op = 57.34FPS -O1 double : 17.55ms per op = 56.98FPS -O2 float : 16.70ms per op = 59.88FPS -O2 double : 16.73ms per op = 59.77FPS -O3 float : 16.58ms per op = 60.31FPS -O3 double : 16.78ms per op = 59.59FPS So you can see that if you wanted to process 6220800 (1920x1200x3) floating point elements per second on my laptop's cpu it would just barely keep up with a 60HZ refresh rate with optimizations enabled. But the difference between single precision floating point math and double precision floating point math is almost negligible. My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
Chris Katko
Member #1,881
January 2002
|
Aaron Bolyard said: Since Allegro's transforms are geared towards GPUs, or so I think, single precision is probably best. OpenGL also supports half-precision floats and integer coordinate systems. I don't see any clear reason why Allegro shouldn't support them. The Gamecube runs with integer math. Now that OpenGL supports it, the Dolphin emulator was ported to integer math and tons of bugs have gone away. https://dolphin-emu.org/blog/2014/03/15/pixel-processing-problems/ [edit] ALSO, I had no idea there was a different between 0.0 and 0.0f / 0.0. There's REALLY such a thing as a float vs double literal, and the compiler will silently convert them if you have the wrong one. ... I think? This is insanity! Bringing back to another of my threads: Somehow, a std::string implicitly converting to a c_string is terrible, but doubles to floats, and floats to ints are OKAY being implicit?! COME ON C++. COME ON. -----sig: |
Edgar Reynaldo
Major Reynaldo
May 2007
|
See my last edit for FPS results of ops with and without allocations included. My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
SiegeLord
Member #7,827
October 2006
|
ALLEGRO_TRANSFORM indeed has floats inside it, and since its internals are public, we're kind of stuck with it that way. It is that way primarily because that's what is supported across platforms (the culprit in this case is Direct3D). "For in much wisdom is much grief: and he that increases knowledge increases sorrow."-Ecclesiastes 1:18 |
Erin Maus
Member #7,537
July 2006
|
Chris Katko said: OpenGL also supports half-precision floats and integer coordinate systems. I don't see any clear reason why Allegro shouldn't support them. If I remember correctly, half precision is only useful on mobile platforms. It's a no-op on most desktop GPUs. Similarly, native integer support is slow, like doubles. But most of all, such features are useless for anyone using Allegro for rendering. Quote: The Gamecube runs with integer math. The classic Xbox had a bizarre programmable GPU unlike otherwise equivalent Nvidia chips before and after. The SNES had a terribly weak CPU, only a minor step up from the NES. The Nintendo 64 was pretty much a SGI workstation. The Wii has a small ARM processor on the same die as the GPU that controls various security and I/O processes. Consoles used to have strange quirks unlike PCs, and that was nice, but that doesn't have any relevance to modern hardware. --- |
Edgar Reynaldo
Major Reynaldo
May 2007
|
It would be possible to create a function called al_transform_coordinates_d that took double pointers though. That would at least save the allocation of two floats. But I guess if they're on the stack it wouldn't matter, even in a heavy loop. Don't mind me. Just thinking out loud. My only concern is this part of my code : GeneratePlotData only gets called if the theta_delta or the radial_delta change, as that affects the number of data points in the spiral. But the transform and the modified coordinates change every time the rotation changes, which is quite often in my program. 1void Spiral2D::Refresh() {
2 if (spiral_needs_refresh) {
3 GeneratePlotData();
4 }
5 if (transform_needs_refresh) {
6 /// Refresh modified data from original using transform
7 al_identity_transform(&transform);
8 al_rotate_transform(&transform , rotation_degrees*(M_PI/180.0));
9 al_scale_transform(&transform , scalex , scaley);
10 al_translate_transform(&transform , centerx , centery);
11 for (unsigned int i = 0 ; i < Size() ; ++i) {
12 Pos2D mod = DataOriginal(i);
13 /// TODO : This is a hack
14 float x = mod.x;
15 float y = mod.y;
16 al_transform_coordinates(&transform , &x , &y);
17 mod.x = x;
18 mod.y = y;
19 DataModified(i) = mod;
20 }
21 transform_needs_refresh = false;
22 }
23}
My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
|