Transposing a 4x4 matrix represented as a ulong value(as fast as possible)

I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.

The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.

Until recently I was using a 4x4 byte matrix:

var field = new byte[4,4];

Each value was an exponent of 2, so 0=0, 1=2, 2=4, 3=8, and so forth. The 2048 tile would be represented by 11.

Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong value.

It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:

    //flip horizontally

    const byte SZ = 4;

    public static byte[,] Invert(this byte[,] squares)

    {

        var tmp = new byte[SZ, SZ];

        for (byte x = 0; x < SZ; x++)

            for (byte y = 0; y < SZ; y++)

                tmp[x, y] = squares[x, SZ - y - 1];

        return tmp;

    }

I can do this inversion to a ulong ~15x faster:

    public static ulong Invert(this ulong state)

    {

        ulong c1 = state & 0xF000F000F000F000L;

        ulong c2 = state & 0x0F000F000F000F00L;

        ulong c3 = state & 0x00F000F000F000F0L;

        ulong c4 = state & 0x000F000F000F000FL;



        return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);

    }

Note the use of hex, which is extremely useful because each character represents a tile.

The operation I've having the most trouble with is Transpose, which flipped the x and y coordinates of values in the 2d array, like this:

    public static byte[,] Transpose(this byte[,] squares)

    {

        var tmp = new byte[SZ, SZ];

        for (byte x = 0; x < SZ; x++)

            for (byte y = 0; y < SZ; y++)

                tmp[y, x] = squares[x, y];

        return tmp;

    }

The fastest way I've found to do this is using this bit of ridiculousness:

    public static ulong Transpose(this ulong state)

    {

        ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals



        result |= (state & 0x0F00000000000000L) >> 12;

        result |= (state & 0x00F0000000000000L) >> 24;

        result |= (state & 0x000F000000000000L) >> 36;

        result |= (state & 0x0000F00000000000L) << 12;

        result |= (state & 0x000000F000000000L) >> 12;

        result |= (state & 0x0000000F00000000L) >> 24;

        result |= (state & 0x00000000F0000000L) << 24;

        result |= (state & 0x000000000F000000L) << 12;

        result |= (state & 0x00000000000F0000L) >> 12;

        result |= (state & 0x000000000000F000L) << 36;

        result |= (state & 0x0000000000000F00L) << 24;

        result |= (state & 0x00000000000000F0L) << 12;



        return result;

    }

Shockingly, this is still nearly 3x faster than the loop version. However, I'm looking for a more performant method either using leveraging a pattern inherent in transposition or more efficient management of the bits I'm moving around.

asked Nov 12 '18 at 20:03

Daniel

701721

1

not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)

– dlatikay
Nov 12 '18 at 20:11

I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | .... By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.

– elgonzo
Nov 12 '18 at 20:17

(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)

– elgonzo
Nov 12 '18 at 20:19

@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).

– Daniel
Nov 12 '18 at 21:24

@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on ulong result with the ulong state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...

– elgonzo
Nov 12 '18 at 21:29

add a comment |

I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.

The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.

Until recently I was using a 4x4 byte matrix:

var field = new byte[4,4];

Each value was an exponent of 2, so 0=0, 1=2, 2=4, 3=8, and so forth. The 2048 tile would be represented by 11.

Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong value.

It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:

    //flip horizontally

    const byte SZ = 4;

    public static byte[,] Invert(this byte[,] squares)

    {

        var tmp = new byte[SZ, SZ];

        for (byte x = 0; x < SZ; x++)

            for (byte y = 0; y < SZ; y++)

                tmp[x, y] = squares[x, SZ - y - 1];

        return tmp;

    }

I can do this inversion to a ulong ~15x faster:

    public static ulong Invert(this ulong state)

    {

        ulong c1 = state & 0xF000F000F000F000L;

        ulong c2 = state & 0x0F000F000F000F00L;

        ulong c3 = state & 0x00F000F000F000F0L;

        ulong c4 = state & 0x000F000F000F000FL;



        return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);

    }

Note the use of hex, which is extremely useful because each character represents a tile.

The operation I've having the most trouble with is Transpose, which flipped the x and y coordinates of values in the 2d array, like this:

    public static byte[,] Transpose(this byte[,] squares)

    {

        var tmp = new byte[SZ, SZ];

        for (byte x = 0; x < SZ; x++)

            for (byte y = 0; y < SZ; y++)

                tmp[y, x] = squares[x, y];

        return tmp;

    }

The fastest way I've found to do this is using this bit of ridiculousness:

    public static ulong Transpose(this ulong state)

    {

        ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals



        result |= (state & 0x0F00000000000000L) >> 12;

        result |= (state & 0x00F0000000000000L) >> 24;

        result |= (state & 0x000F000000000000L) >> 36;

        result |= (state & 0x0000F00000000000L) << 12;

        result |= (state & 0x000000F000000000L) >> 12;

        result |= (state & 0x0000000F00000000L) >> 24;

        result |= (state & 0x00000000F0000000L) << 24;

        result |= (state & 0x000000000F000000L) << 12;

        result |= (state & 0x00000000000F0000L) >> 12;

        result |= (state & 0x000000000000F000L) << 36;

        result |= (state & 0x0000000000000F00L) << 24;

        result |= (state & 0x00000000000000F0L) << 12;



        return result;

    }

asked Nov 12 '18 at 20:03

Daniel

701721

1

not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)

– dlatikay
Nov 12 '18 at 20:11

I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | .... By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.

– elgonzo
Nov 12 '18 at 20:17

(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)

– elgonzo
Nov 12 '18 at 20:19

@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).

– Daniel
Nov 12 '18 at 21:24

@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on ulong result with the ulong state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...

– elgonzo
Nov 12 '18 at 21:29

add a comment |

I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.

The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.

Until recently I was using a 4x4 byte matrix:

var field = new byte[4,4];

Each value was an exponent of 2, so 0=0, 1=2, 2=4, 3=8, and so forth. The 2048 tile would be represented by 11.

Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong value.

It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:

    //flip horizontally

    const byte SZ = 4;

    public static byte[,] Invert(this byte[,] squares)

    {

        var tmp = new byte[SZ, SZ];

        for (byte x = 0; x < SZ; x++)

            for (byte y = 0; y < SZ; y++)

                tmp[x, y] = squares[x, SZ - y - 1];

        return tmp;

    }

I can do this inversion to a ulong ~15x faster:

    public static ulong Invert(this ulong state)

    {

        ulong c1 = state & 0xF000F000F000F000L;

        ulong c2 = state & 0x0F000F000F000F00L;

        ulong c3 = state & 0x00F000F000F000F0L;

        ulong c4 = state & 0x000F000F000F000FL;



        return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);

    }

Note the use of hex, which is extremely useful because each character represents a tile.

The operation I've having the most trouble with is Transpose, which flipped the x and y coordinates of values in the 2d array, like this:

    public static byte[,] Transpose(this byte[,] squares)

    {

        var tmp = new byte[SZ, SZ];

        for (byte x = 0; x < SZ; x++)

            for (byte y = 0; y < SZ; y++)

                tmp[y, x] = squares[x, y];

        return tmp;

    }

The fastest way I've found to do this is using this bit of ridiculousness:

    public static ulong Transpose(this ulong state)

    {

        ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals



        result |= (state & 0x0F00000000000000L) >> 12;

        result |= (state & 0x00F0000000000000L) >> 24;

        result |= (state & 0x000F000000000000L) >> 36;

        result |= (state & 0x0000F00000000000L) << 12;

        result |= (state & 0x000000F000000000L) >> 12;

        result |= (state & 0x0000000F00000000L) >> 24;

        result |= (state & 0x00000000F0000000L) << 24;

        result |= (state & 0x000000000F000000L) << 12;

        result |= (state & 0x00000000000F0000L) >> 12;

        result |= (state & 0x000000000000F000L) << 36;

        result |= (state & 0x0000000000000F00L) << 24;

        result |= (state & 0x00000000000000F0L) << 12;



        return result;

    }

asked Nov 12 '18 at 20:03

Daniel

701721

I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.

The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.

Until recently I was using a 4x4 byte matrix:

var field = new byte[4,4];

Each value was an exponent of 2, so 0=0, 1=2, 2=4, 3=8, and so forth. The 2048 tile would be represented by 11.

Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong value.

It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:

    //flip horizontally

    const byte SZ = 4;

    public static byte[,] Invert(this byte[,] squares)

    {

        var tmp = new byte[SZ, SZ];

        for (byte x = 0; x < SZ; x++)

            for (byte y = 0; y < SZ; y++)

                tmp[x, y] = squares[x, SZ - y - 1];

        return tmp;

    }

I can do this inversion to a ulong ~15x faster:

    public static ulong Invert(this ulong state)

    {

        ulong c1 = state & 0xF000F000F000F000L;

        ulong c2 = state & 0x0F000F000F000F00L;

        ulong c3 = state & 0x00F000F000F000F0L;

        ulong c4 = state & 0x000F000F000F000FL;



        return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);

    }

Note the use of hex, which is extremely useful because each character represents a tile.

The operation I've having the most trouble with is Transpose, which flipped the x and y coordinates of values in the 2d array, like this:

    public static byte[,] Transpose(this byte[,] squares)

    {

        var tmp = new byte[SZ, SZ];

        for (byte x = 0; x < SZ; x++)

            for (byte y = 0; y < SZ; y++)

                tmp[y, x] = squares[x, y];

        return tmp;

    }

The fastest way I've found to do this is using this bit of ridiculousness:

    public static ulong Transpose(this ulong state)

    {

        ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals



        result |= (state & 0x0F00000000000000L) >> 12;

        result |= (state & 0x00F0000000000000L) >> 24;

        result |= (state & 0x000F000000000000L) >> 36;

        result |= (state & 0x0000F00000000000L) << 12;

        result |= (state & 0x000000F000000000L) >> 12;

        result |= (state & 0x0000000F00000000L) >> 24;

        result |= (state & 0x00000000F0000000L) << 24;

        result |= (state & 0x000000000F000000L) << 12;

        result |= (state & 0x00000000000F0000L) >> 12;

        result |= (state & 0x000000000000F000L) << 36;

        result |= (state & 0x0000000000000F00L) << 24;

        result |= (state & 0x00000000000000F0L) << 12;



        return result;

    }

c# bit-manipulation 64bit

asked Nov 12 '18 at 20:03

Daniel

701721

asked Nov 12 '18 at 20:03

Daniel

701721

asked Nov 12 '18 at 20:03

Daniel

701721

asked Nov 12 '18 at 20:03

Daniel

701721

asked Nov 12 '18 at 20:03

Daniel

701721

1

not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)

– dlatikay
Nov 12 '18 at 20:11

I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | .... By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.

– elgonzo
Nov 12 '18 at 20:17

(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)

– elgonzo
Nov 12 '18 at 20:19

@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).

– Daniel
Nov 12 '18 at 21:24

@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on ulong result with the ulong state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...

– elgonzo
Nov 12 '18 at 21:29

add a comment |

1

not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)

– dlatikay
Nov 12 '18 at 20:11

I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | .... By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.

– elgonzo
Nov 12 '18 at 20:17

(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)

– elgonzo
Nov 12 '18 at 20:19

@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).

– Daniel
Nov 12 '18 at 21:24

@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on ulong result with the ulong state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...

– elgonzo
Nov 12 '18 at 21:29

not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)

– dlatikay
Nov 12 '18 at 20:11

I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | .... By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.

– elgonzo
Nov 12 '18 at 20:17

(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)

– elgonzo
Nov 12 '18 at 20:19

@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).

– Daniel
Nov 12 '18 at 21:24

@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on ulong result with the ulong state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...

– elgonzo
Nov 12 '18 at 21:29

add a comment |

2 Answers
2

active

oldest

votes

you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:

public static ulong Transpose(this ulong state)

        {

            ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals



            result |= (state & 0x0F0000F0000F0000L) >> 12;

            result |= (state & 0x00F0000F00000000L) >> 24;

            result |= (state & 0x000F000000000000L) >> 36;

            result |= (state & 0x0000F0000F0000F0L) << 12;

            //result |= (state & 0x000000F000000000L) >> 12;

            //result |= (state & 0x0000000F00000000L) >> 24;

            result |= (state & 0x00000000F0000F00L) << 24;

            //result |= (state & 0x000000000F000000L) << 12;

            //result |= (state & 0x00000000000F0000L) >> 12;

            result |= (state & 0x000000000000F000L) << 36;

            //result |= (state & 0x0000000000000F00L) << 24;

            //result |= (state & 0x00000000000000F0L) << 12;



            return result;

        }

edited Nov 12 '18 at 22:09

answered Nov 12 '18 at 21:45

Aldert

1,2331212

How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

– Daniel
Nov 12 '18 at 22:23

add a comment |

An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".

For example the moves left by 12 and 24 could be done as:

ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));

r0 |= t & 0x0FF000FF000F0000UL;

That reduces 6 operations to 4. The multiplication shouldn't be slow, on a modern processor it takes 3 cycles, and while it is working on that multiply the processor can go ahead and work on the other steps too. As a bonus, on Intel the imul would go to port 1 while the shifts go to ports 0 and 6, so saving two shifts with a multiply is a good deal, opening up more room for the other shifts. The AND and OR operations can go to any ALU port and aren't really the problem here, but it may help for latency to split up the chain of dependent ORs:

public static ulong Transpose(this ulong state)

{

    ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals



    ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));

    ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;

    r0 |= (state & 0x00F0000F00000000L) >> 24;

    r1 |= (state & 0x000F000000000000L) >> 36;

    r0 |= (state & 0x000000000000F000L) << 36;

    r1 |= t & 0x0FF000FF000F0000UL;



    return r0 | r1;

}

answered Nov 12 '18 at 23:24

harold

41.4k357109

I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

– Daniel
Nov 13 '18 at 15:43

@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

– harold
Nov 13 '18 at 15:51

I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

– Daniel
Nov 13 '18 at 21:04

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53269297%2ftransposing-a-4x4-matrix-represented-as-a-ulong-valueas-fast-as-possible%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:

public static ulong Transpose(this ulong state)

        {

            ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals



            result |= (state & 0x0F0000F0000F0000L) >> 12;

            result |= (state & 0x00F0000F00000000L) >> 24;

            result |= (state & 0x000F000000000000L) >> 36;

            result |= (state & 0x0000F0000F0000F0L) << 12;

            //result |= (state & 0x000000F000000000L) >> 12;

            //result |= (state & 0x0000000F00000000L) >> 24;

            result |= (state & 0x00000000F0000F00L) << 24;

            //result |= (state & 0x000000000F000000L) << 12;

            //result |= (state & 0x00000000000F0000L) >> 12;

            result |= (state & 0x000000000000F000L) << 36;

            //result |= (state & 0x0000000000000F00L) << 24;

            //result |= (state & 0x00000000000000F0L) << 12;



            return result;

        }

edited Nov 12 '18 at 22:09

answered Nov 12 '18 at 21:45

Aldert

1,2331212

How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

– Daniel
Nov 12 '18 at 22:23

add a comment |

you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:

public static ulong Transpose(this ulong state)

        {

            ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals



            result |= (state & 0x0F0000F0000F0000L) >> 12;

            result |= (state & 0x00F0000F00000000L) >> 24;

            result |= (state & 0x000F000000000000L) >> 36;

            result |= (state & 0x0000F0000F0000F0L) << 12;

            //result |= (state & 0x000000F000000000L) >> 12;

            //result |= (state & 0x0000000F00000000L) >> 24;

            result |= (state & 0x00000000F0000F00L) << 24;

            //result |= (state & 0x000000000F000000L) << 12;

            //result |= (state & 0x00000000000F0000L) >> 12;

            result |= (state & 0x000000000000F000L) << 36;

            //result |= (state & 0x0000000000000F00L) << 24;

            //result |= (state & 0x00000000000000F0L) << 12;



            return result;

        }

edited Nov 12 '18 at 22:09

answered Nov 12 '18 at 21:45

Aldert

1,2331212

How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

– Daniel
Nov 12 '18 at 22:23

add a comment |

you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:

public static ulong Transpose(this ulong state)

        {

            ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals



            result |= (state & 0x0F0000F0000F0000L) >> 12;

            result |= (state & 0x00F0000F00000000L) >> 24;

            result |= (state & 0x000F000000000000L) >> 36;

            result |= (state & 0x0000F0000F0000F0L) << 12;

            //result |= (state & 0x000000F000000000L) >> 12;

            //result |= (state & 0x0000000F00000000L) >> 24;

            result |= (state & 0x00000000F0000F00L) << 24;

            //result |= (state & 0x000000000F000000L) << 12;

            //result |= (state & 0x00000000000F0000L) >> 12;

            result |= (state & 0x000000000000F000L) << 36;

            //result |= (state & 0x0000000000000F00L) << 24;

            //result |= (state & 0x00000000000000F0L) << 12;



            return result;

        }

edited Nov 12 '18 at 22:09

answered Nov 12 '18 at 21:45

Aldert

1,2331212

you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:

public static ulong Transpose(this ulong state)

        {

            ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals



            result |= (state & 0x0F0000F0000F0000L) >> 12;

            result |= (state & 0x00F0000F00000000L) >> 24;

            result |= (state & 0x000F000000000000L) >> 36;

            result |= (state & 0x0000F0000F0000F0L) << 12;

            //result |= (state & 0x000000F000000000L) >> 12;

            //result |= (state & 0x0000000F00000000L) >> 24;

            result |= (state & 0x00000000F0000F00L) << 24;

            //result |= (state & 0x000000000F000000L) << 12;

            //result |= (state & 0x00000000000F0000L) >> 12;

            result |= (state & 0x000000000000F000L) << 36;

            //result |= (state & 0x0000000000000F00L) << 24;

            //result |= (state & 0x00000000000000F0L) << 12;



            return result;

        }

edited Nov 12 '18 at 22:09

answered Nov 12 '18 at 21:45

Aldert

1,2331212

edited Nov 12 '18 at 22:09

answered Nov 12 '18 at 21:45

Aldert

1,2331212

answered Nov 12 '18 at 21:45

Aldert

1,2331212

answered Nov 12 '18 at 21:45

Aldert

1,2331212

How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

– Daniel
Nov 12 '18 at 22:23

add a comment |

How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

– Daniel
Nov 12 '18 at 22:23

How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

– Daniel
Nov 12 '18 at 22:23

add a comment |

An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".

For example the moves left by 12 and 24 could be done as:

ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));

r0 |= t & 0x0FF000FF000F0000UL;

public static ulong Transpose(this ulong state)

{

    ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals



    ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));

    ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;

    r0 |= (state & 0x00F0000F00000000L) >> 24;

    r1 |= (state & 0x000F000000000000L) >> 36;

    r0 |= (state & 0x000000000000F000L) << 36;

    r1 |= t & 0x0FF000FF000F0000UL;



    return r0 | r1;

}

answered Nov 12 '18 at 23:24

harold

41.4k357109

I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

– Daniel
Nov 13 '18 at 15:43

@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

– harold
Nov 13 '18 at 15:51

I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

– Daniel
Nov 13 '18 at 21:04

add a comment |

An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".

For example the moves left by 12 and 24 could be done as:

ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));

r0 |= t & 0x0FF000FF000F0000UL;

public static ulong Transpose(this ulong state)

{

    ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals



    ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));

    ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;

    r0 |= (state & 0x00F0000F00000000L) >> 24;

    r1 |= (state & 0x000F000000000000L) >> 36;

    r0 |= (state & 0x000000000000F000L) << 36;

    r1 |= t & 0x0FF000FF000F0000UL;



    return r0 | r1;

}

answered Nov 12 '18 at 23:24

harold

41.4k357109

I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

– Daniel
Nov 13 '18 at 15:43

@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

– harold
Nov 13 '18 at 15:51

I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

– Daniel
Nov 13 '18 at 21:04

add a comment |

An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".

For example the moves left by 12 and 24 could be done as:

ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));

r0 |= t & 0x0FF000FF000F0000UL;

public static ulong Transpose(this ulong state)

{

    ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals



    ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));

    ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;

    r0 |= (state & 0x00F0000F00000000L) >> 24;

    r1 |= (state & 0x000F000000000000L) >> 36;

    r0 |= (state & 0x000000000000F000L) << 36;

    r1 |= t & 0x0FF000FF000F0000UL;



    return r0 | r1;

}

answered Nov 12 '18 at 23:24

harold

41.4k357109

An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".

For example the moves left by 12 and 24 could be done as:

ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));

r0 |= t & 0x0FF000FF000F0000UL;

public static ulong Transpose(this ulong state)

{

    ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals



    ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));

    ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;

    r0 |= (state & 0x00F0000F00000000L) >> 24;

    r1 |= (state & 0x000F000000000000L) >> 36;

    r0 |= (state & 0x000000000000F000L) << 36;

    r1 |= t & 0x0FF000FF000F0000UL;



    return r0 | r1;

}

answered Nov 12 '18 at 23:24

harold

41.4k357109

answered Nov 12 '18 at 23:24

harold

41.4k357109

answered Nov 12 '18 at 23:24

harold

41.4k357109

answered Nov 12 '18 at 23:24

harold

41.4k357109

I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

– Daniel
Nov 13 '18 at 15:43

@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

– harold
Nov 13 '18 at 15:51

I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

– Daniel
Nov 13 '18 at 21:04

add a comment |

I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

– Daniel
Nov 13 '18 at 15:43

@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

– harold
Nov 13 '18 at 15:51

I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

– Daniel
Nov 13 '18 at 21:04

I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

– Daniel
Nov 13 '18 at 15:43

@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

– harold
Nov 13 '18 at 15:51

I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

– Daniel
Nov 13 '18 at 21:04

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

pdZMTbtrcn8L,Oq3,mhG,IeSpvGHF,u

搜尋此網誌

Nrthugu