How to generate simd code for math function “exp” using openmp?
I am having a simple c code as follows
void calculate_exp(float *out, float *in, int size) {
for(int i = 0; i < size; i++) {
out[i] = exp(in[i]);
}
}
I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?
openmp simd
add a comment |
I am having a simple c code as follows
void calculate_exp(float *out, float *in, int size) {
for(int i = 0; i < size; i++) {
out[i] = exp(in[i]);
}
}
I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?
openmp simd
2
This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.
– tim18
Nov 13 '18 at 13:53
I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).
– mandar s
Nov 14 '18 at 5:10
1
Exampleexp_vect_d
is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same#pragma omp simd
capabilities. What works with one compiler, does not necessarily work with the other.
– wim
Nov 14 '18 at 14:25
You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with-fveclib
– Z boson
Nov 15 '18 at 8:18
add a comment |
I am having a simple c code as follows
void calculate_exp(float *out, float *in, int size) {
for(int i = 0; i < size; i++) {
out[i] = exp(in[i]);
}
}
I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?
openmp simd
I am having a simple c code as follows
void calculate_exp(float *out, float *in, int size) {
for(int i = 0; i < size; i++) {
out[i] = exp(in[i]);
}
}
I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?
openmp simd
openmp simd
asked Nov 13 '18 at 11:58
mandar smandar s
61
61
2
This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.
– tim18
Nov 13 '18 at 13:53
I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).
– mandar s
Nov 14 '18 at 5:10
1
Exampleexp_vect_d
is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same#pragma omp simd
capabilities. What works with one compiler, does not necessarily work with the other.
– wim
Nov 14 '18 at 14:25
You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with-fveclib
– Z boson
Nov 15 '18 at 8:18
add a comment |
2
This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.
– tim18
Nov 13 '18 at 13:53
I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).
– mandar s
Nov 14 '18 at 5:10
1
Exampleexp_vect_d
is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same#pragma omp simd
capabilities. What works with one compiler, does not necessarily work with the other.
– wim
Nov 14 '18 at 14:25
You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with-fveclib
– Z boson
Nov 15 '18 at 8:18
2
2
This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.
– tim18
Nov 13 '18 at 13:53
This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.
– tim18
Nov 13 '18 at 13:53
I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).
– mandar s
Nov 14 '18 at 5:10
I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).
– mandar s
Nov 14 '18 at 5:10
1
1
Example
exp_vect_d
is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd
capabilities. What works with one compiler, does not necessarily work with the other.– wim
Nov 14 '18 at 14:25
Example
exp_vect_d
is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd
capabilities. What works with one compiler, does not necessarily work with the other.– wim
Nov 14 '18 at 14:25
You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with
-fveclib
– Z boson
Nov 15 '18 at 8:18
You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with
-fveclib
– Z boson
Nov 15 '18 at 8:18
add a comment |
1 Answer
1
active
oldest
votes
You can use one of the following four alternatives to vectorize the exp
function.
Note that I have used expf
(float) instead of exp
, which is a double
function.
This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite
in the compiler generated code.
#include<math.h>
int exp_vect_a(float* x, float* y, int N) {
/* Inform the compiler that N is a multiple of 8, this leads to shorter code */
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
int exp_vect_b(float* restrict x, float* restrict y, int N) {
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_c(float* restrict x, float* restrict y, int N) {
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
Note that Peter Cordes' comment is very relevant here:
Function _ZGVdN8v___expf_finite
might give slightly different results than expf
because its focus is on speed, and not on special cases such as inputs which are
infinite, subnormal, or not a number.
Moreover, the accuracy is 4-ulp maximum relative error,
which is probably slightly less accurate than the standard expf
function.
Therefore you need optimization level -Ofast
(which allows less accurate code)
instead of -O3
to get the code vectorized with gcc.
See this libmvec page for futher details.
The following test code compiles and runs successfully with gcc 7.3:
#include <math.h>
#include <stdio.h>
/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
int main(){
float x[32];
float y[32];
int i;
int N = 32;
for(i = 0; i < N; i++) x[i] = i/100.0f;
x[10]=-89.0f; /* exp(-89.0f)=2.227e-39 which is a subnormal number */
x[11]=-1000.0f; /* output: 0.0 */
x[12]=1000.0f; /* output: Inf. */
x[13]=0.0f/0.0f; /* input: NaN: Not a number */
x[14]=1e20f*1e20f; /* input: Infinity */
x[15]=-1e20f*1e20f; /* input: -Infinity */
x[16]=2.3025850929940f; /* exp(2.3025850929940f)=10.0... */
exp_vect_d(x, y, N);
for(i = 0; i < N; i++) printf("x=%11.8e, y=%11.8en", x[i], y[i]);
return 0;
}
2
Important to point out that you had to use-Ofast
(-O3 -ffast-math
) to enable auto-vectorization ofexpf
, and that's why it's directly calling_ZGVdN8v___expf_finite
which only works for finite non-NaN inputs. With just-O3
, you getvmovss
scalar loads/stores.
– Peter Cordes
Nov 13 '18 at 16:29
1
@PeterCordes: Unfortunately, the accuracy of the standardexpf
is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precisionexp
is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.
– wim
Nov 14 '18 at 0:21
1
Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalarexpf
is less accurate than scalar_expf_finite
(non-vectorized-ffast-math
), and/or vector_ZGVdN8v___expf_finite
. I thoughtexpf
and_expf_finite
gave the same results for finite values (and that scalar_expf_finite
was actually used internally byexpf
), but I'm not sure and haven't actually checked.
– Peter Cordes
Nov 14 '18 at 14:07
1
Yes, the question about the accuracy ofexpf
vs.expf_finite
vs._ZGVdN8v___expf_finite
is quite interesting. Maybe i'll have time to figure this out later on.
– wim
Nov 14 '18 at 15:33
1
godbolt.org/z/JUCVfW
– Z boson
Nov 15 '18 at 8:23
|
show 11 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280563%2fhow-to-generate-simd-code-for-math-function-exp-using-openmp%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can use one of the following four alternatives to vectorize the exp
function.
Note that I have used expf
(float) instead of exp
, which is a double
function.
This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite
in the compiler generated code.
#include<math.h>
int exp_vect_a(float* x, float* y, int N) {
/* Inform the compiler that N is a multiple of 8, this leads to shorter code */
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
int exp_vect_b(float* restrict x, float* restrict y, int N) {
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_c(float* restrict x, float* restrict y, int N) {
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
Note that Peter Cordes' comment is very relevant here:
Function _ZGVdN8v___expf_finite
might give slightly different results than expf
because its focus is on speed, and not on special cases such as inputs which are
infinite, subnormal, or not a number.
Moreover, the accuracy is 4-ulp maximum relative error,
which is probably slightly less accurate than the standard expf
function.
Therefore you need optimization level -Ofast
(which allows less accurate code)
instead of -O3
to get the code vectorized with gcc.
See this libmvec page for futher details.
The following test code compiles and runs successfully with gcc 7.3:
#include <math.h>
#include <stdio.h>
/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
int main(){
float x[32];
float y[32];
int i;
int N = 32;
for(i = 0; i < N; i++) x[i] = i/100.0f;
x[10]=-89.0f; /* exp(-89.0f)=2.227e-39 which is a subnormal number */
x[11]=-1000.0f; /* output: 0.0 */
x[12]=1000.0f; /* output: Inf. */
x[13]=0.0f/0.0f; /* input: NaN: Not a number */
x[14]=1e20f*1e20f; /* input: Infinity */
x[15]=-1e20f*1e20f; /* input: -Infinity */
x[16]=2.3025850929940f; /* exp(2.3025850929940f)=10.0... */
exp_vect_d(x, y, N);
for(i = 0; i < N; i++) printf("x=%11.8e, y=%11.8en", x[i], y[i]);
return 0;
}
2
Important to point out that you had to use-Ofast
(-O3 -ffast-math
) to enable auto-vectorization ofexpf
, and that's why it's directly calling_ZGVdN8v___expf_finite
which only works for finite non-NaN inputs. With just-O3
, you getvmovss
scalar loads/stores.
– Peter Cordes
Nov 13 '18 at 16:29
1
@PeterCordes: Unfortunately, the accuracy of the standardexpf
is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precisionexp
is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.
– wim
Nov 14 '18 at 0:21
1
Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalarexpf
is less accurate than scalar_expf_finite
(non-vectorized-ffast-math
), and/or vector_ZGVdN8v___expf_finite
. I thoughtexpf
and_expf_finite
gave the same results for finite values (and that scalar_expf_finite
was actually used internally byexpf
), but I'm not sure and haven't actually checked.
– Peter Cordes
Nov 14 '18 at 14:07
1
Yes, the question about the accuracy ofexpf
vs.expf_finite
vs._ZGVdN8v___expf_finite
is quite interesting. Maybe i'll have time to figure this out later on.
– wim
Nov 14 '18 at 15:33
1
godbolt.org/z/JUCVfW
– Z boson
Nov 15 '18 at 8:23
|
show 11 more comments
You can use one of the following four alternatives to vectorize the exp
function.
Note that I have used expf
(float) instead of exp
, which is a double
function.
This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite
in the compiler generated code.
#include<math.h>
int exp_vect_a(float* x, float* y, int N) {
/* Inform the compiler that N is a multiple of 8, this leads to shorter code */
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
int exp_vect_b(float* restrict x, float* restrict y, int N) {
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_c(float* restrict x, float* restrict y, int N) {
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
Note that Peter Cordes' comment is very relevant here:
Function _ZGVdN8v___expf_finite
might give slightly different results than expf
because its focus is on speed, and not on special cases such as inputs which are
infinite, subnormal, or not a number.
Moreover, the accuracy is 4-ulp maximum relative error,
which is probably slightly less accurate than the standard expf
function.
Therefore you need optimization level -Ofast
(which allows less accurate code)
instead of -O3
to get the code vectorized with gcc.
See this libmvec page for futher details.
The following test code compiles and runs successfully with gcc 7.3:
#include <math.h>
#include <stdio.h>
/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
int main(){
float x[32];
float y[32];
int i;
int N = 32;
for(i = 0; i < N; i++) x[i] = i/100.0f;
x[10]=-89.0f; /* exp(-89.0f)=2.227e-39 which is a subnormal number */
x[11]=-1000.0f; /* output: 0.0 */
x[12]=1000.0f; /* output: Inf. */
x[13]=0.0f/0.0f; /* input: NaN: Not a number */
x[14]=1e20f*1e20f; /* input: Infinity */
x[15]=-1e20f*1e20f; /* input: -Infinity */
x[16]=2.3025850929940f; /* exp(2.3025850929940f)=10.0... */
exp_vect_d(x, y, N);
for(i = 0; i < N; i++) printf("x=%11.8e, y=%11.8en", x[i], y[i]);
return 0;
}
2
Important to point out that you had to use-Ofast
(-O3 -ffast-math
) to enable auto-vectorization ofexpf
, and that's why it's directly calling_ZGVdN8v___expf_finite
which only works for finite non-NaN inputs. With just-O3
, you getvmovss
scalar loads/stores.
– Peter Cordes
Nov 13 '18 at 16:29
1
@PeterCordes: Unfortunately, the accuracy of the standardexpf
is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precisionexp
is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.
– wim
Nov 14 '18 at 0:21
1
Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalarexpf
is less accurate than scalar_expf_finite
(non-vectorized-ffast-math
), and/or vector_ZGVdN8v___expf_finite
. I thoughtexpf
and_expf_finite
gave the same results for finite values (and that scalar_expf_finite
was actually used internally byexpf
), but I'm not sure and haven't actually checked.
– Peter Cordes
Nov 14 '18 at 14:07
1
Yes, the question about the accuracy ofexpf
vs.expf_finite
vs._ZGVdN8v___expf_finite
is quite interesting. Maybe i'll have time to figure this out later on.
– wim
Nov 14 '18 at 15:33
1
godbolt.org/z/JUCVfW
– Z boson
Nov 15 '18 at 8:23
|
show 11 more comments
You can use one of the following four alternatives to vectorize the exp
function.
Note that I have used expf
(float) instead of exp
, which is a double
function.
This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite
in the compiler generated code.
#include<math.h>
int exp_vect_a(float* x, float* y, int N) {
/* Inform the compiler that N is a multiple of 8, this leads to shorter code */
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
int exp_vect_b(float* restrict x, float* restrict y, int N) {
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_c(float* restrict x, float* restrict y, int N) {
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
Note that Peter Cordes' comment is very relevant here:
Function _ZGVdN8v___expf_finite
might give slightly different results than expf
because its focus is on speed, and not on special cases such as inputs which are
infinite, subnormal, or not a number.
Moreover, the accuracy is 4-ulp maximum relative error,
which is probably slightly less accurate than the standard expf
function.
Therefore you need optimization level -Ofast
(which allows less accurate code)
instead of -O3
to get the code vectorized with gcc.
See this libmvec page for futher details.
The following test code compiles and runs successfully with gcc 7.3:
#include <math.h>
#include <stdio.h>
/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
int main(){
float x[32];
float y[32];
int i;
int N = 32;
for(i = 0; i < N; i++) x[i] = i/100.0f;
x[10]=-89.0f; /* exp(-89.0f)=2.227e-39 which is a subnormal number */
x[11]=-1000.0f; /* output: 0.0 */
x[12]=1000.0f; /* output: Inf. */
x[13]=0.0f/0.0f; /* input: NaN: Not a number */
x[14]=1e20f*1e20f; /* input: Infinity */
x[15]=-1e20f*1e20f; /* input: -Infinity */
x[16]=2.3025850929940f; /* exp(2.3025850929940f)=10.0... */
exp_vect_d(x, y, N);
for(i = 0; i < N; i++) printf("x=%11.8e, y=%11.8en", x[i], y[i]);
return 0;
}
You can use one of the following four alternatives to vectorize the exp
function.
Note that I have used expf
(float) instead of exp
, which is a double
function.
This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite
in the compiler generated code.
#include<math.h>
int exp_vect_a(float* x, float* y, int N) {
/* Inform the compiler that N is a multiple of 8, this leads to shorter code */
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
int exp_vect_b(float* restrict x, float* restrict y, int N) {
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_c(float* restrict x, float* restrict y, int N) {
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
Note that Peter Cordes' comment is very relevant here:
Function _ZGVdN8v___expf_finite
might give slightly different results than expf
because its focus is on speed, and not on special cases such as inputs which are
infinite, subnormal, or not a number.
Moreover, the accuracy is 4-ulp maximum relative error,
which is probably slightly less accurate than the standard expf
function.
Therefore you need optimization level -Ofast
(which allows less accurate code)
instead of -O3
to get the code vectorized with gcc.
See this libmvec page for futher details.
The following test code compiles and runs successfully with gcc 7.3:
#include <math.h>
#include <stdio.h>
/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}
int main(){
float x[32];
float y[32];
int i;
int N = 32;
for(i = 0; i < N; i++) x[i] = i/100.0f;
x[10]=-89.0f; /* exp(-89.0f)=2.227e-39 which is a subnormal number */
x[11]=-1000.0f; /* output: 0.0 */
x[12]=1000.0f; /* output: Inf. */
x[13]=0.0f/0.0f; /* input: NaN: Not a number */
x[14]=1e20f*1e20f; /* input: Infinity */
x[15]=-1e20f*1e20f; /* input: -Infinity */
x[16]=2.3025850929940f; /* exp(2.3025850929940f)=10.0... */
exp_vect_d(x, y, N);
for(i = 0; i < N; i++) printf("x=%11.8e, y=%11.8en", x[i], y[i]);
return 0;
}
edited Nov 13 '18 at 23:36
answered Nov 13 '18 at 15:47
wimwim
2,116710
2,116710
2
Important to point out that you had to use-Ofast
(-O3 -ffast-math
) to enable auto-vectorization ofexpf
, and that's why it's directly calling_ZGVdN8v___expf_finite
which only works for finite non-NaN inputs. With just-O3
, you getvmovss
scalar loads/stores.
– Peter Cordes
Nov 13 '18 at 16:29
1
@PeterCordes: Unfortunately, the accuracy of the standardexpf
is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precisionexp
is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.
– wim
Nov 14 '18 at 0:21
1
Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalarexpf
is less accurate than scalar_expf_finite
(non-vectorized-ffast-math
), and/or vector_ZGVdN8v___expf_finite
. I thoughtexpf
and_expf_finite
gave the same results for finite values (and that scalar_expf_finite
was actually used internally byexpf
), but I'm not sure and haven't actually checked.
– Peter Cordes
Nov 14 '18 at 14:07
1
Yes, the question about the accuracy ofexpf
vs.expf_finite
vs._ZGVdN8v___expf_finite
is quite interesting. Maybe i'll have time to figure this out later on.
– wim
Nov 14 '18 at 15:33
1
godbolt.org/z/JUCVfW
– Z boson
Nov 15 '18 at 8:23
|
show 11 more comments
2
Important to point out that you had to use-Ofast
(-O3 -ffast-math
) to enable auto-vectorization ofexpf
, and that's why it's directly calling_ZGVdN8v___expf_finite
which only works for finite non-NaN inputs. With just-O3
, you getvmovss
scalar loads/stores.
– Peter Cordes
Nov 13 '18 at 16:29
1
@PeterCordes: Unfortunately, the accuracy of the standardexpf
is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precisionexp
is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.
– wim
Nov 14 '18 at 0:21
1
Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalarexpf
is less accurate than scalar_expf_finite
(non-vectorized-ffast-math
), and/or vector_ZGVdN8v___expf_finite
. I thoughtexpf
and_expf_finite
gave the same results for finite values (and that scalar_expf_finite
was actually used internally byexpf
), but I'm not sure and haven't actually checked.
– Peter Cordes
Nov 14 '18 at 14:07
1
Yes, the question about the accuracy ofexpf
vs.expf_finite
vs._ZGVdN8v___expf_finite
is quite interesting. Maybe i'll have time to figure this out later on.
– wim
Nov 14 '18 at 15:33
1
godbolt.org/z/JUCVfW
– Z boson
Nov 15 '18 at 8:23
2
2
Important to point out that you had to use
-Ofast
(-O3 -ffast-math
) to enable auto-vectorization of expf
, and that's why it's directly calling _ZGVdN8v___expf_finite
which only works for finite non-NaN inputs. With just -O3
, you get vmovss
scalar loads/stores.– Peter Cordes
Nov 13 '18 at 16:29
Important to point out that you had to use
-Ofast
(-O3 -ffast-math
) to enable auto-vectorization of expf
, and that's why it's directly calling _ZGVdN8v___expf_finite
which only works for finite non-NaN inputs. With just -O3
, you get vmovss
scalar loads/stores.– Peter Cordes
Nov 13 '18 at 16:29
1
1
@PeterCordes: Unfortunately, the accuracy of the standard
expf
is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp
is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.– wim
Nov 14 '18 at 0:21
@PeterCordes: Unfortunately, the accuracy of the standard
expf
is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp
is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.– wim
Nov 14 '18 at 0:21
1
1
Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar
expf
is less accurate than scalar _expf_finite
(non-vectorized -ffast-math
), and/or vector _ZGVdN8v___expf_finite
. I thought expf
and _expf_finite
gave the same results for finite values (and that scalar _expf_finite
was actually used internally by expf
), but I'm not sure and haven't actually checked.– Peter Cordes
Nov 14 '18 at 14:07
Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar
expf
is less accurate than scalar _expf_finite
(non-vectorized -ffast-math
), and/or vector _ZGVdN8v___expf_finite
. I thought expf
and _expf_finite
gave the same results for finite values (and that scalar _expf_finite
was actually used internally by expf
), but I'm not sure and haven't actually checked.– Peter Cordes
Nov 14 '18 at 14:07
1
1
Yes, the question about the accuracy of
expf
vs. expf_finite
vs. _ZGVdN8v___expf_finite
is quite interesting. Maybe i'll have time to figure this out later on.– wim
Nov 14 '18 at 15:33
Yes, the question about the accuracy of
expf
vs. expf_finite
vs. _ZGVdN8v___expf_finite
is quite interesting. Maybe i'll have time to figure this out later on.– wim
Nov 14 '18 at 15:33
1
1
godbolt.org/z/JUCVfW
– Z boson
Nov 15 '18 at 8:23
godbolt.org/z/JUCVfW
– Z boson
Nov 15 '18 at 8:23
|
show 11 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280563%2fhow-to-generate-simd-code-for-math-function-exp-using-openmp%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.
– tim18
Nov 13 '18 at 13:53
I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).
– mandar s
Nov 14 '18 at 5:10
1
Example
exp_vect_d
is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same#pragma omp simd
capabilities. What works with one compiler, does not necessarily work with the other.– wim
Nov 14 '18 at 14:25
You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with
-fveclib
– Z boson
Nov 15 '18 at 8:18