Transposing a 4x4 matrix represented as a ulong value(as fast as possible)












3















I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.



The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.



Until recently I was using a 4x4 byte matrix:



var field = new byte[4,4];


Each value was an exponent of 2, so 0=0, 1=2, 2=4, 3=8, and so forth. The 2048 tile would be represented by 11.



Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong value.



It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:



    //flip horizontally
const byte SZ = 4;
public static byte[,] Invert(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[x, y] = squares[x, SZ - y - 1];
return tmp;
}


I can do this inversion to a ulong ~15x faster:



    public static ulong Invert(this ulong state)
{
ulong c1 = state & 0xF000F000F000F000L;
ulong c2 = state & 0x0F000F000F000F00L;
ulong c3 = state & 0x00F000F000F000F0L;
ulong c4 = state & 0x000F000F000F000FL;

return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);
}


Note the use of hex, which is extremely useful because each character represents a tile.



The operation I've having the most trouble with is Transpose, which flipped the x and y coordinates of values in the 2d array, like this:



    public static byte[,] Transpose(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[y, x] = squares[x, y];
return tmp;
}


The fastest way I've found to do this is using this bit of ridiculousness:



    public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

result |= (state & 0x0F00000000000000L) >> 12;
result |= (state & 0x00F0000000000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F00000000000L) << 12;
result |= (state & 0x000000F000000000L) >> 12;
result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000000L) << 24;
result |= (state & 0x000000000F000000L) << 12;
result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
result |= (state & 0x0000000000000F00L) << 24;
result |= (state & 0x00000000000000F0L) << 12;

return result;
}


Shockingly, this is still nearly 3x faster than the loop version. However, I'm looking for a more performant method either using leveraging a pattern inherent in transposition or more efficient management of the bits I'm moving around.










share|improve this question


















  • 1





    not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)

    – dlatikay
    Nov 12 '18 at 20:11











  • I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | .... By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.

    – elgonzo
    Nov 12 '18 at 20:17













  • (Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)

    – elgonzo
    Nov 12 '18 at 20:19













  • @elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).

    – Daniel
    Nov 12 '18 at 21:24











  • @Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on ulong result with the ulong state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...

    – elgonzo
    Nov 12 '18 at 21:29


















3















I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.



The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.



Until recently I was using a 4x4 byte matrix:



var field = new byte[4,4];


Each value was an exponent of 2, so 0=0, 1=2, 2=4, 3=8, and so forth. The 2048 tile would be represented by 11.



Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong value.



It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:



    //flip horizontally
const byte SZ = 4;
public static byte[,] Invert(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[x, y] = squares[x, SZ - y - 1];
return tmp;
}


I can do this inversion to a ulong ~15x faster:



    public static ulong Invert(this ulong state)
{
ulong c1 = state & 0xF000F000F000F000L;
ulong c2 = state & 0x0F000F000F000F00L;
ulong c3 = state & 0x00F000F000F000F0L;
ulong c4 = state & 0x000F000F000F000FL;

return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);
}


Note the use of hex, which is extremely useful because each character represents a tile.



The operation I've having the most trouble with is Transpose, which flipped the x and y coordinates of values in the 2d array, like this:



    public static byte[,] Transpose(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[y, x] = squares[x, y];
return tmp;
}


The fastest way I've found to do this is using this bit of ridiculousness:



    public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

result |= (state & 0x0F00000000000000L) >> 12;
result |= (state & 0x00F0000000000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F00000000000L) << 12;
result |= (state & 0x000000F000000000L) >> 12;
result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000000L) << 24;
result |= (state & 0x000000000F000000L) << 12;
result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
result |= (state & 0x0000000000000F00L) << 24;
result |= (state & 0x00000000000000F0L) << 12;

return result;
}


Shockingly, this is still nearly 3x faster than the loop version. However, I'm looking for a more performant method either using leveraging a pattern inherent in transposition or more efficient management of the bits I'm moving around.










share|improve this question


















  • 1





    not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)

    – dlatikay
    Nov 12 '18 at 20:11











  • I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | .... By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.

    – elgonzo
    Nov 12 '18 at 20:17













  • (Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)

    – elgonzo
    Nov 12 '18 at 20:19













  • @elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).

    – Daniel
    Nov 12 '18 at 21:24











  • @Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on ulong result with the ulong state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...

    – elgonzo
    Nov 12 '18 at 21:29
















3












3








3


1






I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.



The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.



Until recently I was using a 4x4 byte matrix:



var field = new byte[4,4];


Each value was an exponent of 2, so 0=0, 1=2, 2=4, 3=8, and so forth. The 2048 tile would be represented by 11.



Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong value.



It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:



    //flip horizontally
const byte SZ = 4;
public static byte[,] Invert(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[x, y] = squares[x, SZ - y - 1];
return tmp;
}


I can do this inversion to a ulong ~15x faster:



    public static ulong Invert(this ulong state)
{
ulong c1 = state & 0xF000F000F000F000L;
ulong c2 = state & 0x0F000F000F000F00L;
ulong c3 = state & 0x00F000F000F000F0L;
ulong c4 = state & 0x000F000F000F000FL;

return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);
}


Note the use of hex, which is extremely useful because each character represents a tile.



The operation I've having the most trouble with is Transpose, which flipped the x and y coordinates of values in the 2d array, like this:



    public static byte[,] Transpose(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[y, x] = squares[x, y];
return tmp;
}


The fastest way I've found to do this is using this bit of ridiculousness:



    public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

result |= (state & 0x0F00000000000000L) >> 12;
result |= (state & 0x00F0000000000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F00000000000L) << 12;
result |= (state & 0x000000F000000000L) >> 12;
result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000000L) << 24;
result |= (state & 0x000000000F000000L) << 12;
result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
result |= (state & 0x0000000000000F00L) << 24;
result |= (state & 0x00000000000000F0L) << 12;

return result;
}


Shockingly, this is still nearly 3x faster than the loop version. However, I'm looking for a more performant method either using leveraging a pattern inherent in transposition or more efficient management of the bits I'm moving around.










share|improve this question














I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.



The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.



Until recently I was using a 4x4 byte matrix:



var field = new byte[4,4];


Each value was an exponent of 2, so 0=0, 1=2, 2=4, 3=8, and so forth. The 2048 tile would be represented by 11.



Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong value.



It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:



    //flip horizontally
const byte SZ = 4;
public static byte[,] Invert(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[x, y] = squares[x, SZ - y - 1];
return tmp;
}


I can do this inversion to a ulong ~15x faster:



    public static ulong Invert(this ulong state)
{
ulong c1 = state & 0xF000F000F000F000L;
ulong c2 = state & 0x0F000F000F000F00L;
ulong c3 = state & 0x00F000F000F000F0L;
ulong c4 = state & 0x000F000F000F000FL;

return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);
}


Note the use of hex, which is extremely useful because each character represents a tile.



The operation I've having the most trouble with is Transpose, which flipped the x and y coordinates of values in the 2d array, like this:



    public static byte[,] Transpose(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[y, x] = squares[x, y];
return tmp;
}


The fastest way I've found to do this is using this bit of ridiculousness:



    public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

result |= (state & 0x0F00000000000000L) >> 12;
result |= (state & 0x00F0000000000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F00000000000L) << 12;
result |= (state & 0x000000F000000000L) >> 12;
result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000000L) << 24;
result |= (state & 0x000000000F000000L) << 12;
result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
result |= (state & 0x0000000000000F00L) << 24;
result |= (state & 0x00000000000000F0L) << 12;

return result;
}


Shockingly, this is still nearly 3x faster than the loop version. However, I'm looking for a more performant method either using leveraging a pattern inherent in transposition or more efficient management of the bits I'm moving around.







c# bit-manipulation 64bit






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 12 '18 at 20:03









DanielDaniel

701721




701721








  • 1





    not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)

    – dlatikay
    Nov 12 '18 at 20:11











  • I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | .... By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.

    – elgonzo
    Nov 12 '18 at 20:17













  • (Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)

    – elgonzo
    Nov 12 '18 at 20:19













  • @elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).

    – Daniel
    Nov 12 '18 at 21:24











  • @Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on ulong result with the ulong state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...

    – elgonzo
    Nov 12 '18 at 21:29
















  • 1





    not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)

    – dlatikay
    Nov 12 '18 at 20:11











  • I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | .... By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.

    – elgonzo
    Nov 12 '18 at 20:17













  • (Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)

    – elgonzo
    Nov 12 '18 at 20:19













  • @elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).

    – Daniel
    Nov 12 '18 at 21:24











  • @Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on ulong result with the ulong state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...

    – elgonzo
    Nov 12 '18 at 21:29










1




1





not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)

– dlatikay
Nov 12 '18 at 20:11





not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)

– dlatikay
Nov 12 '18 at 20:11













I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | .... By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.

– elgonzo
Nov 12 '18 at 20:17







I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | .... By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.

– elgonzo
Nov 12 '18 at 20:17















(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)

– elgonzo
Nov 12 '18 at 20:19







(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)

– elgonzo
Nov 12 '18 at 20:19















@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).

– Daniel
Nov 12 '18 at 21:24





@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).

– Daniel
Nov 12 '18 at 21:24













@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on ulong result with the ulong state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...

– elgonzo
Nov 12 '18 at 21:29







@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on ulong result with the ulong state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...

– elgonzo
Nov 12 '18 at 21:29














2 Answers
2






active

oldest

votes


















2














you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:



public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

result |= (state & 0x0F0000F0000F0000L) >> 12;
result |= (state & 0x00F0000F00000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F0000F0000F0L) << 12;
//result |= (state & 0x000000F000000000L) >> 12;
//result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000F00L) << 24;
//result |= (state & 0x000000000F000000L) << 12;
//result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
//result |= (state & 0x0000000000000F00L) << 24;
//result |= (state & 0x00000000000000F0L) << 12;

return result;
}





share|improve this answer


























  • How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

    – Daniel
    Nov 12 '18 at 22:23



















1














An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".



For example the moves left by 12 and 24 could be done as:



ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
r0 |= t & 0x0FF000FF000F0000UL;


That reduces 6 operations to 4. The multiplication shouldn't be slow, on a modern processor it takes 3 cycles, and while it is working on that multiply the processor can go ahead and work on the other steps too. As a bonus, on Intel the imul would go to port 1 while the shifts go to ports 0 and 6, so saving two shifts with a multiply is a good deal, opening up more room for the other shifts. The AND and OR operations can go to any ALU port and aren't really the problem here, but it may help for latency to split up the chain of dependent ORs:



public static ulong Transpose(this ulong state)
{
ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals

ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;
r0 |= (state & 0x00F0000F00000000L) >> 24;
r1 |= (state & 0x000F000000000000L) >> 36;
r0 |= (state & 0x000000000000F000L) << 36;
r1 |= t & 0x0FF000FF000F0000UL;

return r0 | r1;
}





share|improve this answer
























  • I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

    – Daniel
    Nov 13 '18 at 15:43













  • @Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

    – harold
    Nov 13 '18 at 15:51











  • I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

    – Daniel
    Nov 13 '18 at 21:04













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53269297%2ftransposing-a-4x4-matrix-represented-as-a-ulong-valueas-fast-as-possible%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:



public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

result |= (state & 0x0F0000F0000F0000L) >> 12;
result |= (state & 0x00F0000F00000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F0000F0000F0L) << 12;
//result |= (state & 0x000000F000000000L) >> 12;
//result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000F00L) << 24;
//result |= (state & 0x000000000F000000L) << 12;
//result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
//result |= (state & 0x0000000000000F00L) << 24;
//result |= (state & 0x00000000000000F0L) << 12;

return result;
}





share|improve this answer


























  • How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

    – Daniel
    Nov 12 '18 at 22:23
















2














you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:



public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

result |= (state & 0x0F0000F0000F0000L) >> 12;
result |= (state & 0x00F0000F00000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F0000F0000F0L) << 12;
//result |= (state & 0x000000F000000000L) >> 12;
//result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000F00L) << 24;
//result |= (state & 0x000000000F000000L) << 12;
//result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
//result |= (state & 0x0000000000000F00L) << 24;
//result |= (state & 0x00000000000000F0L) << 12;

return result;
}





share|improve this answer


























  • How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

    – Daniel
    Nov 12 '18 at 22:23














2












2








2







you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:



public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

result |= (state & 0x0F0000F0000F0000L) >> 12;
result |= (state & 0x00F0000F00000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F0000F0000F0L) << 12;
//result |= (state & 0x000000F000000000L) >> 12;
//result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000F00L) << 24;
//result |= (state & 0x000000000F000000L) << 12;
//result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
//result |= (state & 0x0000000000000F00L) << 24;
//result |= (state & 0x00000000000000F0L) << 12;

return result;
}





share|improve this answer















you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:



public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

result |= (state & 0x0F0000F0000F0000L) >> 12;
result |= (state & 0x00F0000F00000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F0000F0000F0L) << 12;
//result |= (state & 0x000000F000000000L) >> 12;
//result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000F00L) << 24;
//result |= (state & 0x000000000F000000L) << 12;
//result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
//result |= (state & 0x0000000000000F00L) << 24;
//result |= (state & 0x00000000000000F0L) << 12;

return result;
}






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 12 '18 at 22:09

























answered Nov 12 '18 at 21:45









AldertAldert

1,2331212




1,2331212













  • How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

    – Daniel
    Nov 12 '18 at 22:23



















  • How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

    – Daniel
    Nov 12 '18 at 22:23

















How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

– Daniel
Nov 12 '18 at 22:23





How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.

– Daniel
Nov 12 '18 at 22:23













1














An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".



For example the moves left by 12 and 24 could be done as:



ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
r0 |= t & 0x0FF000FF000F0000UL;


That reduces 6 operations to 4. The multiplication shouldn't be slow, on a modern processor it takes 3 cycles, and while it is working on that multiply the processor can go ahead and work on the other steps too. As a bonus, on Intel the imul would go to port 1 while the shifts go to ports 0 and 6, so saving two shifts with a multiply is a good deal, opening up more room for the other shifts. The AND and OR operations can go to any ALU port and aren't really the problem here, but it may help for latency to split up the chain of dependent ORs:



public static ulong Transpose(this ulong state)
{
ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals

ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;
r0 |= (state & 0x00F0000F00000000L) >> 24;
r1 |= (state & 0x000F000000000000L) >> 36;
r0 |= (state & 0x000000000000F000L) << 36;
r1 |= t & 0x0FF000FF000F0000UL;

return r0 | r1;
}





share|improve this answer
























  • I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

    – Daniel
    Nov 13 '18 at 15:43













  • @Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

    – harold
    Nov 13 '18 at 15:51











  • I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

    – Daniel
    Nov 13 '18 at 21:04


















1














An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".



For example the moves left by 12 and 24 could be done as:



ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
r0 |= t & 0x0FF000FF000F0000UL;


That reduces 6 operations to 4. The multiplication shouldn't be slow, on a modern processor it takes 3 cycles, and while it is working on that multiply the processor can go ahead and work on the other steps too. As a bonus, on Intel the imul would go to port 1 while the shifts go to ports 0 and 6, so saving two shifts with a multiply is a good deal, opening up more room for the other shifts. The AND and OR operations can go to any ALU port and aren't really the problem here, but it may help for latency to split up the chain of dependent ORs:



public static ulong Transpose(this ulong state)
{
ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals

ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;
r0 |= (state & 0x00F0000F00000000L) >> 24;
r1 |= (state & 0x000F000000000000L) >> 36;
r0 |= (state & 0x000000000000F000L) << 36;
r1 |= t & 0x0FF000FF000F0000UL;

return r0 | r1;
}





share|improve this answer
























  • I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

    – Daniel
    Nov 13 '18 at 15:43













  • @Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

    – harold
    Nov 13 '18 at 15:51











  • I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

    – Daniel
    Nov 13 '18 at 21:04
















1












1








1







An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".



For example the moves left by 12 and 24 could be done as:



ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
r0 |= t & 0x0FF000FF000F0000UL;


That reduces 6 operations to 4. The multiplication shouldn't be slow, on a modern processor it takes 3 cycles, and while it is working on that multiply the processor can go ahead and work on the other steps too. As a bonus, on Intel the imul would go to port 1 while the shifts go to ports 0 and 6, so saving two shifts with a multiply is a good deal, opening up more room for the other shifts. The AND and OR operations can go to any ALU port and aren't really the problem here, but it may help for latency to split up the chain of dependent ORs:



public static ulong Transpose(this ulong state)
{
ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals

ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;
r0 |= (state & 0x00F0000F00000000L) >> 24;
r1 |= (state & 0x000F000000000000L) >> 36;
r0 |= (state & 0x000000000000F000L) << 36;
r1 |= t & 0x0FF000FF000F0000UL;

return r0 | r1;
}





share|improve this answer













An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".



For example the moves left by 12 and 24 could be done as:



ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
r0 |= t & 0x0FF000FF000F0000UL;


That reduces 6 operations to 4. The multiplication shouldn't be slow, on a modern processor it takes 3 cycles, and while it is working on that multiply the processor can go ahead and work on the other steps too. As a bonus, on Intel the imul would go to port 1 while the shifts go to ports 0 and 6, so saving two shifts with a multiply is a good deal, opening up more room for the other shifts. The AND and OR operations can go to any ALU port and aren't really the problem here, but it may help for latency to split up the chain of dependent ORs:



public static ulong Transpose(this ulong state)
{
ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals

ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;
r0 |= (state & 0x00F0000F00000000L) >> 24;
r1 |= (state & 0x000F000000000000L) >> 36;
r0 |= (state & 0x000000000000F000L) << 36;
r1 |= t & 0x0FF000FF000F0000UL;

return r0 | r1;
}






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 12 '18 at 23:24









haroldharold

41.4k357109




41.4k357109













  • I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

    – Daniel
    Nov 13 '18 at 15:43













  • @Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

    – harold
    Nov 13 '18 at 15:51











  • I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

    – Daniel
    Nov 13 '18 at 21:04





















  • I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

    – Daniel
    Nov 13 '18 at 15:43













  • @Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

    – harold
    Nov 13 '18 at 15:51











  • I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

    – Daniel
    Nov 13 '18 at 21:04



















I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

– Daniel
Nov 13 '18 at 15:43







I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.

– Daniel
Nov 13 '18 at 15:43















@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

– harold
Nov 13 '18 at 15:51





@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.

– harold
Nov 13 '18 at 15:51













I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

– Daniel
Nov 13 '18 at 21:04







I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!

– Daniel
Nov 13 '18 at 21:04




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53269297%2ftransposing-a-4x4-matrix-represented-as-a-ulong-valueas-fast-as-possible%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Full-time equivalent

Bicuculline

さくらももこ