How to generate simd code for math function “exp” using openmp?












1















I am having a simple c code as follows



void calculate_exp(float *out, float *in, int size) {
for(int i = 0; i < size; i++) {
out[i] = exp(in[i]);
}
}


I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?










share|improve this question


















  • 2





    This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.

    – tim18
    Nov 13 '18 at 13:53











  • I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).

    – mandar s
    Nov 14 '18 at 5:10






  • 1





    Example exp_vect_d is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd capabilities. What works with one compiler, does not necessarily work with the other.

    – wim
    Nov 14 '18 at 14:25











  • You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with -fveclib

    – Z boson
    Nov 15 '18 at 8:18
















1















I am having a simple c code as follows



void calculate_exp(float *out, float *in, int size) {
for(int i = 0; i < size; i++) {
out[i] = exp(in[i]);
}
}


I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?










share|improve this question


















  • 2





    This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.

    – tim18
    Nov 13 '18 at 13:53











  • I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).

    – mandar s
    Nov 14 '18 at 5:10






  • 1





    Example exp_vect_d is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd capabilities. What works with one compiler, does not necessarily work with the other.

    – wim
    Nov 14 '18 at 14:25











  • You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with -fveclib

    – Z boson
    Nov 15 '18 at 8:18














1












1








1








I am having a simple c code as follows



void calculate_exp(float *out, float *in, int size) {
for(int i = 0; i < size; i++) {
out[i] = exp(in[i]);
}
}


I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?










share|improve this question














I am having a simple c code as follows



void calculate_exp(float *out, float *in, int size) {
for(int i = 0; i < size; i++) {
out[i] = exp(in[i]);
}
}


I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?







openmp simd






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 '18 at 11:58









mandar smandar s

61




61








  • 2





    This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.

    – tim18
    Nov 13 '18 at 13:53











  • I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).

    – mandar s
    Nov 14 '18 at 5:10






  • 1





    Example exp_vect_d is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd capabilities. What works with one compiler, does not necessarily work with the other.

    – wim
    Nov 14 '18 at 14:25











  • You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with -fveclib

    – Z boson
    Nov 15 '18 at 8:18














  • 2





    This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.

    – tim18
    Nov 13 '18 at 13:53











  • I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).

    – mandar s
    Nov 14 '18 at 5:10






  • 1





    Example exp_vect_d is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd capabilities. What works with one compiler, does not necessarily work with the other.

    – wim
    Nov 14 '18 at 14:25











  • You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with -fveclib

    – Z boson
    Nov 15 '18 at 8:18








2




2





This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.

– tim18
Nov 13 '18 at 13:53





This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.

– tim18
Nov 13 '18 at 13:53













I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).

– mandar s
Nov 14 '18 at 5:10





I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).

– mandar s
Nov 14 '18 at 5:10




1




1





Example exp_vect_d is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd capabilities. What works with one compiler, does not necessarily work with the other.

– wim
Nov 14 '18 at 14:25





Example exp_vect_d is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd capabilities. What works with one compiler, does not necessarily work with the other.

– wim
Nov 14 '18 at 14:25













You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with -fveclib

– Z boson
Nov 15 '18 at 8:18





You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with -fveclib

– Z boson
Nov 15 '18 at 8:18












1 Answer
1






active

oldest

votes


















3














You can use one of the following four alternatives to vectorize the exp function.
Note that I have used expf (float) instead of exp, which is a double function.
This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite in the compiler generated code.



#include<math.h>

int exp_vect_a(float* x, float* y, int N) {
/* Inform the compiler that N is a multiple of 8, this leads to shorter code */
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}


int exp_vect_b(float* restrict x, float* restrict y, int N) {
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_c(float* restrict x, float* restrict y, int N) {
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}


Note that Peter Cordes' comment is very relevant here:
Function _ZGVdN8v___expf_finite might give slightly different results than expf
because its focus is on speed, and not on special cases such as inputs which are
infinite, subnormal, or not a number.
Moreover, the accuracy is 4-ulp maximum relative error,
which is probably slightly less accurate than the standard expf function.
Therefore you need optimization level -Ofast (which allows less accurate code)
instead of -O3 to get the code vectorized with gcc.



See this libmvec page for futher details.



The following test code compiles and runs successfully with gcc 7.3:



#include <math.h>
#include <stdio.h>
/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */

int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

int main(){
float x[32];
float y[32];
int i;
int N = 32;

for(i = 0; i < N; i++) x[i] = i/100.0f;
x[10]=-89.0f; /* exp(-89.0f)=2.227e-39 which is a subnormal number */
x[11]=-1000.0f; /* output: 0.0 */
x[12]=1000.0f; /* output: Inf. */
x[13]=0.0f/0.0f; /* input: NaN: Not a number */
x[14]=1e20f*1e20f; /* input: Infinity */
x[15]=-1e20f*1e20f; /* input: -Infinity */
x[16]=2.3025850929940f; /* exp(2.3025850929940f)=10.0... */
exp_vect_d(x, y, N);
for(i = 0; i < N; i++) printf("x=%11.8e, y=%11.8en", x[i], y[i]);
return 0;
}





share|improve this answer





















  • 2





    Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

    – Peter Cordes
    Nov 13 '18 at 16:29






  • 1





    @PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

    – wim
    Nov 14 '18 at 0:21






  • 1





    Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

    – Peter Cordes
    Nov 14 '18 at 14:07








  • 1





    Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

    – wim
    Nov 14 '18 at 15:33






  • 1





    godbolt.org/z/JUCVfW

    – Z boson
    Nov 15 '18 at 8:23











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280563%2fhow-to-generate-simd-code-for-math-function-exp-using-openmp%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














You can use one of the following four alternatives to vectorize the exp function.
Note that I have used expf (float) instead of exp, which is a double function.
This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite in the compiler generated code.



#include<math.h>

int exp_vect_a(float* x, float* y, int N) {
/* Inform the compiler that N is a multiple of 8, this leads to shorter code */
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}


int exp_vect_b(float* restrict x, float* restrict y, int N) {
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_c(float* restrict x, float* restrict y, int N) {
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}


Note that Peter Cordes' comment is very relevant here:
Function _ZGVdN8v___expf_finite might give slightly different results than expf
because its focus is on speed, and not on special cases such as inputs which are
infinite, subnormal, or not a number.
Moreover, the accuracy is 4-ulp maximum relative error,
which is probably slightly less accurate than the standard expf function.
Therefore you need optimization level -Ofast (which allows less accurate code)
instead of -O3 to get the code vectorized with gcc.



See this libmvec page for futher details.



The following test code compiles and runs successfully with gcc 7.3:



#include <math.h>
#include <stdio.h>
/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */

int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

int main(){
float x[32];
float y[32];
int i;
int N = 32;

for(i = 0; i < N; i++) x[i] = i/100.0f;
x[10]=-89.0f; /* exp(-89.0f)=2.227e-39 which is a subnormal number */
x[11]=-1000.0f; /* output: 0.0 */
x[12]=1000.0f; /* output: Inf. */
x[13]=0.0f/0.0f; /* input: NaN: Not a number */
x[14]=1e20f*1e20f; /* input: Infinity */
x[15]=-1e20f*1e20f; /* input: -Infinity */
x[16]=2.3025850929940f; /* exp(2.3025850929940f)=10.0... */
exp_vect_d(x, y, N);
for(i = 0; i < N; i++) printf("x=%11.8e, y=%11.8en", x[i], y[i]);
return 0;
}





share|improve this answer





















  • 2





    Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

    – Peter Cordes
    Nov 13 '18 at 16:29






  • 1





    @PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

    – wim
    Nov 14 '18 at 0:21






  • 1





    Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

    – Peter Cordes
    Nov 14 '18 at 14:07








  • 1





    Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

    – wim
    Nov 14 '18 at 15:33






  • 1





    godbolt.org/z/JUCVfW

    – Z boson
    Nov 15 '18 at 8:23
















3














You can use one of the following four alternatives to vectorize the exp function.
Note that I have used expf (float) instead of exp, which is a double function.
This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite in the compiler generated code.



#include<math.h>

int exp_vect_a(float* x, float* y, int N) {
/* Inform the compiler that N is a multiple of 8, this leads to shorter code */
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}


int exp_vect_b(float* restrict x, float* restrict y, int N) {
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_c(float* restrict x, float* restrict y, int N) {
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}


Note that Peter Cordes' comment is very relevant here:
Function _ZGVdN8v___expf_finite might give slightly different results than expf
because its focus is on speed, and not on special cases such as inputs which are
infinite, subnormal, or not a number.
Moreover, the accuracy is 4-ulp maximum relative error,
which is probably slightly less accurate than the standard expf function.
Therefore you need optimization level -Ofast (which allows less accurate code)
instead of -O3 to get the code vectorized with gcc.



See this libmvec page for futher details.



The following test code compiles and runs successfully with gcc 7.3:



#include <math.h>
#include <stdio.h>
/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */

int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

int main(){
float x[32];
float y[32];
int i;
int N = 32;

for(i = 0; i < N; i++) x[i] = i/100.0f;
x[10]=-89.0f; /* exp(-89.0f)=2.227e-39 which is a subnormal number */
x[11]=-1000.0f; /* output: 0.0 */
x[12]=1000.0f; /* output: Inf. */
x[13]=0.0f/0.0f; /* input: NaN: Not a number */
x[14]=1e20f*1e20f; /* input: Infinity */
x[15]=-1e20f*1e20f; /* input: -Infinity */
x[16]=2.3025850929940f; /* exp(2.3025850929940f)=10.0... */
exp_vect_d(x, y, N);
for(i = 0; i < N; i++) printf("x=%11.8e, y=%11.8en", x[i], y[i]);
return 0;
}





share|improve this answer





















  • 2





    Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

    – Peter Cordes
    Nov 13 '18 at 16:29






  • 1





    @PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

    – wim
    Nov 14 '18 at 0:21






  • 1





    Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

    – Peter Cordes
    Nov 14 '18 at 14:07








  • 1





    Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

    – wim
    Nov 14 '18 at 15:33






  • 1





    godbolt.org/z/JUCVfW

    – Z boson
    Nov 15 '18 at 8:23














3












3








3







You can use one of the following four alternatives to vectorize the exp function.
Note that I have used expf (float) instead of exp, which is a double function.
This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite in the compiler generated code.



#include<math.h>

int exp_vect_a(float* x, float* y, int N) {
/* Inform the compiler that N is a multiple of 8, this leads to shorter code */
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}


int exp_vect_b(float* restrict x, float* restrict y, int N) {
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_c(float* restrict x, float* restrict y, int N) {
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}


Note that Peter Cordes' comment is very relevant here:
Function _ZGVdN8v___expf_finite might give slightly different results than expf
because its focus is on speed, and not on special cases such as inputs which are
infinite, subnormal, or not a number.
Moreover, the accuracy is 4-ulp maximum relative error,
which is probably slightly less accurate than the standard expf function.
Therefore you need optimization level -Ofast (which allows less accurate code)
instead of -O3 to get the code vectorized with gcc.



See this libmvec page for futher details.



The following test code compiles and runs successfully with gcc 7.3:



#include <math.h>
#include <stdio.h>
/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */

int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

int main(){
float x[32];
float y[32];
int i;
int N = 32;

for(i = 0; i < N; i++) x[i] = i/100.0f;
x[10]=-89.0f; /* exp(-89.0f)=2.227e-39 which is a subnormal number */
x[11]=-1000.0f; /* output: 0.0 */
x[12]=1000.0f; /* output: Inf. */
x[13]=0.0f/0.0f; /* input: NaN: Not a number */
x[14]=1e20f*1e20f; /* input: Infinity */
x[15]=-1e20f*1e20f; /* input: -Infinity */
x[16]=2.3025850929940f; /* exp(2.3025850929940f)=10.0... */
exp_vect_d(x, y, N);
for(i = 0; i < N; i++) printf("x=%11.8e, y=%11.8en", x[i], y[i]);
return 0;
}





share|improve this answer















You can use one of the following four alternatives to vectorize the exp function.
Note that I have used expf (float) instead of exp, which is a double function.
This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite in the compiler generated code.



#include<math.h>

int exp_vect_a(float* x, float* y, int N) {
/* Inform the compiler that N is a multiple of 8, this leads to shorter code */
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}


int exp_vect_b(float* restrict x, float* restrict y, int N) {
N = N & 0xFFFFFFF8;
x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y to generate `nice` code */
y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code */
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_c(float* restrict x, float* restrict y, int N) {
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}


Note that Peter Cordes' comment is very relevant here:
Function _ZGVdN8v___expf_finite might give slightly different results than expf
because its focus is on speed, and not on special cases such as inputs which are
infinite, subnormal, or not a number.
Moreover, the accuracy is 4-ulp maximum relative error,
which is probably slightly less accurate than the standard expf function.
Therefore you need optimization level -Ofast (which allows less accurate code)
instead of -O3 to get the code vectorized with gcc.



See this libmvec page for futher details.



The following test code compiles and runs successfully with gcc 7.3:



#include <math.h>
#include <stdio.h>
/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */

int exp_vect_d(float* x, float* y, int N) {
#pragma omp simd
for(int i=0; i<N; i++) y[i] = expf(x[i]);
return 0;
}

int main(){
float x[32];
float y[32];
int i;
int N = 32;

for(i = 0; i < N; i++) x[i] = i/100.0f;
x[10]=-89.0f; /* exp(-89.0f)=2.227e-39 which is a subnormal number */
x[11]=-1000.0f; /* output: 0.0 */
x[12]=1000.0f; /* output: Inf. */
x[13]=0.0f/0.0f; /* input: NaN: Not a number */
x[14]=1e20f*1e20f; /* input: Infinity */
x[15]=-1e20f*1e20f; /* input: -Infinity */
x[16]=2.3025850929940f; /* exp(2.3025850929940f)=10.0... */
exp_vect_d(x, y, N);
for(i = 0; i < N; i++) printf("x=%11.8e, y=%11.8en", x[i], y[i]);
return 0;
}






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 13 '18 at 23:36

























answered Nov 13 '18 at 15:47









wimwim

2,116710




2,116710








  • 2





    Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

    – Peter Cordes
    Nov 13 '18 at 16:29






  • 1





    @PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

    – wim
    Nov 14 '18 at 0:21






  • 1





    Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

    – Peter Cordes
    Nov 14 '18 at 14:07








  • 1





    Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

    – wim
    Nov 14 '18 at 15:33






  • 1





    godbolt.org/z/JUCVfW

    – Z boson
    Nov 15 '18 at 8:23














  • 2





    Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

    – Peter Cordes
    Nov 13 '18 at 16:29






  • 1





    @PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

    – wim
    Nov 14 '18 at 0:21






  • 1





    Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

    – Peter Cordes
    Nov 14 '18 at 14:07








  • 1





    Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

    – wim
    Nov 14 '18 at 15:33






  • 1





    godbolt.org/z/JUCVfW

    – Z boson
    Nov 15 '18 at 8:23








2




2





Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

– Peter Cordes
Nov 13 '18 at 16:29





Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

– Peter Cordes
Nov 13 '18 at 16:29




1




1





@PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

– wim
Nov 14 '18 at 0:21





@PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

– wim
Nov 14 '18 at 0:21




1




1





Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

– Peter Cordes
Nov 14 '18 at 14:07







Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

– Peter Cordes
Nov 14 '18 at 14:07






1




1





Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

– wim
Nov 14 '18 at 15:33





Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

– wim
Nov 14 '18 at 15:33




1




1





godbolt.org/z/JUCVfW

– Z boson
Nov 15 '18 at 8:23





godbolt.org/z/JUCVfW

– Z boson
Nov 15 '18 at 8:23


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280563%2fhow-to-generate-simd-code-for-math-function-exp-using-openmp%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Full-time equivalent

Bicuculline

さくらももこ