How to generate simd code for math function “exp” using openmp?

I am having a simple c code as follows

void calculate_exp(float *out, float *in, int size) {

    for(int i = 0; i < size; i++) {

        out[i] = exp(in[i]);

    }

}

I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?

asked Nov 13 '18 at 11:58

mandar s

2

This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.

– tim18
Nov 13 '18 at 13:53

I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).

– mandar s
Nov 14 '18 at 5:10

1

Example exp_vect_d is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd capabilities. What works with one compiler, does not necessarily work with the other.

– wim
Nov 14 '18 at 14:25

You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with -fveclib

– Z boson
Nov 15 '18 at 8:18

add a comment |

I am having a simple c code as follows

void calculate_exp(float *out, float *in, int size) {

    for(int i = 0; i < size; i++) {

        out[i] = exp(in[i]);

    }

}

I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?

asked Nov 13 '18 at 11:58

mandar s

2

This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.

– tim18
Nov 13 '18 at 13:53

I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).

– mandar s
Nov 14 '18 at 5:10

1

Example exp_vect_d is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd capabilities. What works with one compiler, does not necessarily work with the other.

– wim
Nov 14 '18 at 14:25

You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with -fveclib

– Z boson
Nov 15 '18 at 8:18

add a comment |

I am having a simple c code as follows

void calculate_exp(float *out, float *in, int size) {

    for(int i = 0; i < size; i++) {

        out[i] = exp(in[i]);

    }

}

I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?

asked Nov 13 '18 at 11:58

mandar s

I am having a simple c code as follows

void calculate_exp(float *out, float *in, int size) {

    for(int i = 0; i < size; i++) {

        out[i] = exp(in[i]);

    }

}

I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?

openmp simd

asked Nov 13 '18 at 11:58

mandar s

asked Nov 13 '18 at 11:58

mandar s

asked Nov 13 '18 at 11:58

mandar s

asked Nov 13 '18 at 11:58

mandar s

asked Nov 13 '18 at 11:58

mandar s

2

This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.

– tim18
Nov 13 '18 at 13:53

I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).

– mandar s
Nov 14 '18 at 5:10

1

Example exp_vect_d is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd capabilities. What works with one compiler, does not necessarily work with the other.

– wim
Nov 14 '18 at 14:25

You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with -fveclib

– Z boson
Nov 15 '18 at 8:18

add a comment |

2

This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.

– tim18
Nov 13 '18 at 13:53

I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).

– mandar s
Nov 14 '18 at 5:10

1

Example exp_vect_d is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd capabilities. What works with one compiler, does not necessarily work with the other.

– wim
Nov 14 '18 at 14:25

You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with -fveclib

– Z boson
Nov 15 '18 at 8:18

This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion.

– tim18
Nov 13 '18 at 13:53

I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx).

– mandar s
Nov 14 '18 at 5:10

Example exp_vect_d is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same #pragma omp simd capabilities. What works with one compiler, does not necessarily work with the other.

– wim
Nov 14 '18 at 14:25

You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with -fveclib

– Z boson
Nov 15 '18 at 8:18

add a comment |

1 Answer
1

active

oldest

votes

You can use one of the following four alternatives to vectorize the exp function.
Note that I have used expf (float) instead of exp, which is a double function.
This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite in the compiler generated code.

#include<math.h>



int exp_vect_a(float* x, float* y, int N) {

    /* Inform the compiler that N is a multiple of 8, this leads to shorter code */

    N = N & 0xFFFFFFF8;    

    x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y  to generate `nice` code */

    y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code                   */

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}





int exp_vect_b(float* restrict x, float* restrict y, int N) {

    N = N & 0xFFFFFFF8;

    x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y  to generate `nice` code */

    y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code                   */

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



/* This also vectorizes, but it doesn't lead to `nice` code */

int exp_vect_c(float* restrict x, float* restrict y, int N) {

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



/* This also vectorizes, but it doesn't lead to `nice` code */

int exp_vect_d(float* x, float* y, int N) {

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}

Note that Peter Cordes' comment is very relevant here:
Function _ZGVdN8v___expf_finite might give slightly different results than expf
because its focus is on speed, and not on special cases such as inputs which are
infinite, subnormal, or not a number.
Moreover, the accuracy is 4-ulp maximum relative error,
which is probably slightly less accurate than the standard expf function.
Therefore you need optimization level -Ofast (which allows less accurate code)
instead of -O3 to get the code vectorized with gcc.

See this libmvec page for futher details.

The following test code compiles and runs successfully with gcc 7.3:

#include <math.h>

#include <stdio.h>

/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */



int exp_vect_d(float* x, float* y, int N) {

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



int main(){

    float x[32];

    float y[32];

    int i;

    int N = 32;



    for(i = 0; i < N; i++) x[i] = i/100.0f;

    x[10]=-89.0f;            /* exp(-89.0f)=2.227e-39 which is a subnormal number */

    x[11]=-1000.0f;          /* output: 0.0                                   */

    x[12]=1000.0f;           /* output: Inf.                                  */

    x[13]=0.0f/0.0f;         /* input: NaN: Not a number                      */

    x[14]=1e20f*1e20f;       /* input: Infinity                               */

    x[15]=-1e20f*1e20f;      /* input: -Infinity                              */

    x[16]=2.3025850929940f;  /* exp(2.3025850929940f)=10.0...                 */

    exp_vect_d(x, y, N);

    for(i = 0; i < N; i++) printf("x=%11.8e,  y=%11.8en", x[i], y[i]);

    return 0;

}

edited Nov 13 '18 at 23:36

answered Nov 13 '18 at 15:47

wim

2,116710

2

Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

– Peter Cordes
Nov 13 '18 at 16:29

1

@PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

– wim
Nov 14 '18 at 0:21

1

Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

– Peter Cordes
Nov 14 '18 at 14:07

1

Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

– wim
Nov 14 '18 at 15:33

1

godbolt.org/z/JUCVfW

– Z boson
Nov 15 '18 at 8:23

|
show 11 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280563%2fhow-to-generate-simd-code-for-math-function-exp-using-openmp%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

#include<math.h>



int exp_vect_a(float* x, float* y, int N) {

    /* Inform the compiler that N is a multiple of 8, this leads to shorter code */

    N = N & 0xFFFFFFF8;    

    x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y  to generate `nice` code */

    y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code                   */

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}





int exp_vect_b(float* restrict x, float* restrict y, int N) {

    N = N & 0xFFFFFFF8;

    x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y  to generate `nice` code */

    y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code                   */

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



/* This also vectorizes, but it doesn't lead to `nice` code */

int exp_vect_c(float* restrict x, float* restrict y, int N) {

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



/* This also vectorizes, but it doesn't lead to `nice` code */

int exp_vect_d(float* x, float* y, int N) {

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}

See this libmvec page for futher details.

The following test code compiles and runs successfully with gcc 7.3:

#include <math.h>

#include <stdio.h>

/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */



int exp_vect_d(float* x, float* y, int N) {

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



int main(){

    float x[32];

    float y[32];

    int i;

    int N = 32;



    for(i = 0; i < N; i++) x[i] = i/100.0f;

    x[10]=-89.0f;            /* exp(-89.0f)=2.227e-39 which is a subnormal number */

    x[11]=-1000.0f;          /* output: 0.0                                   */

    x[12]=1000.0f;           /* output: Inf.                                  */

    x[13]=0.0f/0.0f;         /* input: NaN: Not a number                      */

    x[14]=1e20f*1e20f;       /* input: Infinity                               */

    x[15]=-1e20f*1e20f;      /* input: -Infinity                              */

    x[16]=2.3025850929940f;  /* exp(2.3025850929940f)=10.0...                 */

    exp_vect_d(x, y, N);

    for(i = 0; i < N; i++) printf("x=%11.8e,  y=%11.8en", x[i], y[i]);

    return 0;

}

edited Nov 13 '18 at 23:36

answered Nov 13 '18 at 15:47

wim

2,116710

2

Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

– Peter Cordes
Nov 13 '18 at 16:29

1

@PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

– wim
Nov 14 '18 at 0:21

1

Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

– Peter Cordes
Nov 14 '18 at 14:07

1

Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

– wim
Nov 14 '18 at 15:33

1

godbolt.org/z/JUCVfW

– Z boson
Nov 15 '18 at 8:23

|
show 11 more comments

#include<math.h>



int exp_vect_a(float* x, float* y, int N) {

    /* Inform the compiler that N is a multiple of 8, this leads to shorter code */

    N = N & 0xFFFFFFF8;    

    x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y  to generate `nice` code */

    y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code                   */

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}





int exp_vect_b(float* restrict x, float* restrict y, int N) {

    N = N & 0xFFFFFFF8;

    x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y  to generate `nice` code */

    y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code                   */

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



/* This also vectorizes, but it doesn't lead to `nice` code */

int exp_vect_c(float* restrict x, float* restrict y, int N) {

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



/* This also vectorizes, but it doesn't lead to `nice` code */

int exp_vect_d(float* x, float* y, int N) {

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}

See this libmvec page for futher details.

The following test code compiles and runs successfully with gcc 7.3:

#include <math.h>

#include <stdio.h>

/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */



int exp_vect_d(float* x, float* y, int N) {

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



int main(){

    float x[32];

    float y[32];

    int i;

    int N = 32;



    for(i = 0; i < N; i++) x[i] = i/100.0f;

    x[10]=-89.0f;            /* exp(-89.0f)=2.227e-39 which is a subnormal number */

    x[11]=-1000.0f;          /* output: 0.0                                   */

    x[12]=1000.0f;           /* output: Inf.                                  */

    x[13]=0.0f/0.0f;         /* input: NaN: Not a number                      */

    x[14]=1e20f*1e20f;       /* input: Infinity                               */

    x[15]=-1e20f*1e20f;      /* input: -Infinity                              */

    x[16]=2.3025850929940f;  /* exp(2.3025850929940f)=10.0...                 */

    exp_vect_d(x, y, N);

    for(i = 0; i < N; i++) printf("x=%11.8e,  y=%11.8en", x[i], y[i]);

    return 0;

}

edited Nov 13 '18 at 23:36

answered Nov 13 '18 at 15:47

wim

2,116710

2

Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

– Peter Cordes
Nov 13 '18 at 16:29

1

@PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

– wim
Nov 14 '18 at 0:21

1

Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

– Peter Cordes
Nov 14 '18 at 14:07

1

Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

– wim
Nov 14 '18 at 15:33

1

godbolt.org/z/JUCVfW

– Z boson
Nov 15 '18 at 8:23

|
show 11 more comments

#include<math.h>



int exp_vect_a(float* x, float* y, int N) {

    /* Inform the compiler that N is a multiple of 8, this leads to shorter code */

    N = N & 0xFFFFFFF8;    

    x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y  to generate `nice` code */

    y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code                   */

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}





int exp_vect_b(float* restrict x, float* restrict y, int N) {

    N = N & 0xFFFFFFF8;

    x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y  to generate `nice` code */

    y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code                   */

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



/* This also vectorizes, but it doesn't lead to `nice` code */

int exp_vect_c(float* restrict x, float* restrict y, int N) {

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



/* This also vectorizes, but it doesn't lead to `nice` code */

int exp_vect_d(float* x, float* y, int N) {

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}

See this libmvec page for futher details.

The following test code compiles and runs successfully with gcc 7.3:

#include <math.h>

#include <stdio.h>

/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */



int exp_vect_d(float* x, float* y, int N) {

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



int main(){

    float x[32];

    float y[32];

    int i;

    int N = 32;



    for(i = 0; i < N; i++) x[i] = i/100.0f;

    x[10]=-89.0f;            /* exp(-89.0f)=2.227e-39 which is a subnormal number */

    x[11]=-1000.0f;          /* output: 0.0                                   */

    x[12]=1000.0f;           /* output: Inf.                                  */

    x[13]=0.0f/0.0f;         /* input: NaN: Not a number                      */

    x[14]=1e20f*1e20f;       /* input: Infinity                               */

    x[15]=-1e20f*1e20f;      /* input: -Infinity                              */

    x[16]=2.3025850929940f;  /* exp(2.3025850929940f)=10.0...                 */

    exp_vect_d(x, y, N);

    for(i = 0; i < N; i++) printf("x=%11.8e,  y=%11.8en", x[i], y[i]);

    return 0;

}

edited Nov 13 '18 at 23:36

answered Nov 13 '18 at 15:47

wim

2,116710

#include<math.h>



int exp_vect_a(float* x, float* y, int N) {

    /* Inform the compiler that N is a multiple of 8, this leads to shorter code */

    N = N & 0xFFFFFFF8;    

    x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y  to generate `nice` code */

    y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code                   */

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}





int exp_vect_b(float* restrict x, float* restrict y, int N) {

    N = N & 0xFFFFFFF8;

    x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y  to generate `nice` code */

    y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code                   */

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



/* This also vectorizes, but it doesn't lead to `nice` code */

int exp_vect_c(float* restrict x, float* restrict y, int N) {

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



/* This also vectorizes, but it doesn't lead to `nice` code */

int exp_vect_d(float* x, float* y, int N) {

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}

See this libmvec page for futher details.

The following test code compiles and runs successfully with gcc 7.3:

#include <math.h>

#include <stdio.h>

/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */



int exp_vect_d(float* x, float* y, int N) {

    #pragma omp simd             

    for(int i=0; i<N; i++) y[i] = expf(x[i]);

    return 0; 

}



int main(){

    float x[32];

    float y[32];

    int i;

    int N = 32;



    for(i = 0; i < N; i++) x[i] = i/100.0f;

    x[10]=-89.0f;            /* exp(-89.0f)=2.227e-39 which is a subnormal number */

    x[11]=-1000.0f;          /* output: 0.0                                   */

    x[12]=1000.0f;           /* output: Inf.                                  */

    x[13]=0.0f/0.0f;         /* input: NaN: Not a number                      */

    x[14]=1e20f*1e20f;       /* input: Infinity                               */

    x[15]=-1e20f*1e20f;      /* input: -Infinity                              */

    x[16]=2.3025850929940f;  /* exp(2.3025850929940f)=10.0...                 */

    exp_vect_d(x, y, N);

    for(i = 0; i < N; i++) printf("x=%11.8e,  y=%11.8en", x[i], y[i]);

    return 0;

}

edited Nov 13 '18 at 23:36

answered Nov 13 '18 at 15:47

wim

2,116710

edited Nov 13 '18 at 23:36

answered Nov 13 '18 at 15:47

wim

2,116710

answered Nov 13 '18 at 15:47

wim

2,116710

answered Nov 13 '18 at 15:47

wim

2,116710

2

Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

– Peter Cordes
Nov 13 '18 at 16:29

1

@PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

– wim
Nov 14 '18 at 0:21

1

Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

– Peter Cordes
Nov 14 '18 at 14:07

1

Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

– wim
Nov 14 '18 at 15:33

1

godbolt.org/z/JUCVfW

– Z boson
Nov 15 '18 at 8:23

|
show 11 more comments

2

Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

– Peter Cordes
Nov 13 '18 at 16:29

1

@PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

– wim
Nov 14 '18 at 0:21

1

Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

– Peter Cordes
Nov 14 '18 at 14:07

1

Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

– wim
Nov 14 '18 at 15:33

1

godbolt.org/z/JUCVfW

– Z boson
Nov 15 '18 at 8:23

Important to point out that you had to use -Ofast (-O3 -ffast-math) to enable auto-vectorization of expf, and that's why it's directly calling _ZGVdN8v___expf_finite which only works for finite non-NaN inputs. With just -O3, you get vmovss scalar loads/stores.

– Peter Cordes
Nov 13 '18 at 16:29

@PeterCordes: Unfortunately, the accuracy of the standard expf is not in this table. Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision exp is not exactly 0.5ulp). I don't know the exact details on glibc's math functions.

– wim
Nov 14 '18 at 0:21

Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not that good. Still, the question is whether scalar expf is less accurate than scalar _expf_finite (non-vectorized -ffast-math), and/or vector _ZGVdN8v___expf_finite. I thought expf and _expf_finite gave the same results for finite values (and that scalar _expf_finite was actually used internally by expf), but I'm not sure and haven't actually checked.

– Peter Cordes
Nov 14 '18 at 14:07

Yes, the question about the accuracy of expf vs. expf_finite vs. _ZGVdN8v___expf_finite is quite interesting. Maybe i'll have time to figure this out later on.

– wim
Nov 14 '18 at 15:33

godbolt.org/z/JUCVfW

– Z boson
Nov 15 '18 at 8:23

|
show 11 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

EZf67GS5 J3M,uey9wUHuTlpi,H93TpD

搜尋此網誌

Nrthugu