Transposing a 4x4 matrix represented as a ulong value(as fast as possible)
I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.
The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.
Until recently I was using a 4x4 byte matrix:
var field = new byte[4,4];
Each value was an exponent of 2, so 0=0
, 1=2
, 2=4
, 3=8
, and so forth. The 2048 tile would be represented by 11.
Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong
value.
It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:
//flip horizontally
const byte SZ = 4;
public static byte[,] Invert(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[x, y] = squares[x, SZ - y - 1];
return tmp;
}
I can do this inversion to a ulong
~15x faster:
public static ulong Invert(this ulong state)
{
ulong c1 = state & 0xF000F000F000F000L;
ulong c2 = state & 0x0F000F000F000F00L;
ulong c3 = state & 0x00F000F000F000F0L;
ulong c4 = state & 0x000F000F000F000FL;
return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);
}
Note the use of hex, which is extremely useful because each character represents a tile.
The operation I've having the most trouble with is Transpose
, which flipped the x
and y
coordinates of values in the 2d array, like this:
public static byte[,] Transpose(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[y, x] = squares[x, y];
return tmp;
}
The fastest way I've found to do this is using this bit of ridiculousness:
public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals
result |= (state & 0x0F00000000000000L) >> 12;
result |= (state & 0x00F0000000000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F00000000000L) << 12;
result |= (state & 0x000000F000000000L) >> 12;
result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000000L) << 24;
result |= (state & 0x000000000F000000L) << 12;
result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
result |= (state & 0x0000000000000F00L) << 24;
result |= (state & 0x00000000000000F0L) << 12;
return result;
}
Shockingly, this is still nearly 3x faster than the loop version. However, I'm looking for a more performant method either using leveraging a pattern inherent in transposition or more efficient management of the bits I'm moving around.
c# bit-manipulation 64bit
add a comment |
I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.
The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.
Until recently I was using a 4x4 byte matrix:
var field = new byte[4,4];
Each value was an exponent of 2, so 0=0
, 1=2
, 2=4
, 3=8
, and so forth. The 2048 tile would be represented by 11.
Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong
value.
It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:
//flip horizontally
const byte SZ = 4;
public static byte[,] Invert(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[x, y] = squares[x, SZ - y - 1];
return tmp;
}
I can do this inversion to a ulong
~15x faster:
public static ulong Invert(this ulong state)
{
ulong c1 = state & 0xF000F000F000F000L;
ulong c2 = state & 0x0F000F000F000F00L;
ulong c3 = state & 0x00F000F000F000F0L;
ulong c4 = state & 0x000F000F000F000FL;
return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);
}
Note the use of hex, which is extremely useful because each character represents a tile.
The operation I've having the most trouble with is Transpose
, which flipped the x
and y
coordinates of values in the 2d array, like this:
public static byte[,] Transpose(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[y, x] = squares[x, y];
return tmp;
}
The fastest way I've found to do this is using this bit of ridiculousness:
public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals
result |= (state & 0x0F00000000000000L) >> 12;
result |= (state & 0x00F0000000000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F00000000000L) << 12;
result |= (state & 0x000000F000000000L) >> 12;
result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000000L) << 24;
result |= (state & 0x000000000F000000L) << 12;
result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
result |= (state & 0x0000000000000F00L) << 24;
result |= (state & 0x00000000000000F0L) << 12;
return result;
}
Shockingly, this is still nearly 3x faster than the loop version. However, I'm looking for a more performant method either using leveraging a pattern inherent in transposition or more efficient management of the bits I'm moving around.
c# bit-manipulation 64bit
1
not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)
– dlatikay
Nov 12 '18 at 20:11
I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do somethingulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | ...
. By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.
– elgonzo
Nov 12 '18 at 20:17
(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)
– elgonzo
Nov 12 '18 at 20:19
@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).
– Daniel
Nov 12 '18 at 21:24
@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations onulong
result with theulong
state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...
– elgonzo
Nov 12 '18 at 21:29
add a comment |
I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.
The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.
Until recently I was using a 4x4 byte matrix:
var field = new byte[4,4];
Each value was an exponent of 2, so 0=0
, 1=2
, 2=4
, 3=8
, and so forth. The 2048 tile would be represented by 11.
Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong
value.
It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:
//flip horizontally
const byte SZ = 4;
public static byte[,] Invert(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[x, y] = squares[x, SZ - y - 1];
return tmp;
}
I can do this inversion to a ulong
~15x faster:
public static ulong Invert(this ulong state)
{
ulong c1 = state & 0xF000F000F000F000L;
ulong c2 = state & 0x0F000F000F000F00L;
ulong c3 = state & 0x00F000F000F000F0L;
ulong c4 = state & 0x000F000F000F000FL;
return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);
}
Note the use of hex, which is extremely useful because each character represents a tile.
The operation I've having the most trouble with is Transpose
, which flipped the x
and y
coordinates of values in the 2d array, like this:
public static byte[,] Transpose(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[y, x] = squares[x, y];
return tmp;
}
The fastest way I've found to do this is using this bit of ridiculousness:
public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals
result |= (state & 0x0F00000000000000L) >> 12;
result |= (state & 0x00F0000000000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F00000000000L) << 12;
result |= (state & 0x000000F000000000L) >> 12;
result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000000L) << 24;
result |= (state & 0x000000000F000000L) << 12;
result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
result |= (state & 0x0000000000000F00L) << 24;
result |= (state & 0x00000000000000F0L) << 12;
return result;
}
Shockingly, this is still nearly 3x faster than the loop version. However, I'm looking for a more performant method either using leveraging a pattern inherent in transposition or more efficient management of the bits I'm moving around.
c# bit-manipulation 64bit
I have been working on a C# implementation of 2048 for the purpose of implementing reinforcement learning.
The "slide" operation for each move requires that tiles be moved and combined according to specific rules. Doing so involves a number of transformations on a 2d array of values.
Until recently I was using a 4x4 byte matrix:
var field = new byte[4,4];
Each value was an exponent of 2, so 0=0
, 1=2
, 2=4
, 3=8
, and so forth. The 2048 tile would be represented by 11.
Because the (practical) maximum value for a given tile is 15 (which only requires 4 bits), it is possible to shove the contents of this 4x4 byte array into a ulong
value.
It turns out that certain operations are vastly more efficient with this representation. For example, I commonly have to invert arrays like this:
//flip horizontally
const byte SZ = 4;
public static byte[,] Invert(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[x, y] = squares[x, SZ - y - 1];
return tmp;
}
I can do this inversion to a ulong
~15x faster:
public static ulong Invert(this ulong state)
{
ulong c1 = state & 0xF000F000F000F000L;
ulong c2 = state & 0x0F000F000F000F00L;
ulong c3 = state & 0x00F000F000F000F0L;
ulong c4 = state & 0x000F000F000F000FL;
return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);
}
Note the use of hex, which is extremely useful because each character represents a tile.
The operation I've having the most trouble with is Transpose
, which flipped the x
and y
coordinates of values in the 2d array, like this:
public static byte[,] Transpose(this byte[,] squares)
{
var tmp = new byte[SZ, SZ];
for (byte x = 0; x < SZ; x++)
for (byte y = 0; y < SZ; y++)
tmp[y, x] = squares[x, y];
return tmp;
}
The fastest way I've found to do this is using this bit of ridiculousness:
public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals
result |= (state & 0x0F00000000000000L) >> 12;
result |= (state & 0x00F0000000000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F00000000000L) << 12;
result |= (state & 0x000000F000000000L) >> 12;
result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000000L) << 24;
result |= (state & 0x000000000F000000L) << 12;
result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
result |= (state & 0x0000000000000F00L) << 24;
result |= (state & 0x00000000000000F0L) << 12;
return result;
}
Shockingly, this is still nearly 3x faster than the loop version. However, I'm looking for a more performant method either using leveraging a pattern inherent in transposition or more efficient management of the bits I'm moving around.
c# bit-manipulation 64bit
c# bit-manipulation 64bit
asked Nov 12 '18 at 20:03
DanielDaniel
701721
701721
1
not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)
– dlatikay
Nov 12 '18 at 20:11
I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do somethingulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | ...
. By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.
– elgonzo
Nov 12 '18 at 20:17
(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)
– elgonzo
Nov 12 '18 at 20:19
@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).
– Daniel
Nov 12 '18 at 21:24
@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations onulong
result with theulong
state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...
– elgonzo
Nov 12 '18 at 21:29
add a comment |
1
not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)
– dlatikay
Nov 12 '18 at 20:11
I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do somethingulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | ...
. By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.
– elgonzo
Nov 12 '18 at 20:17
(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)
– elgonzo
Nov 12 '18 at 20:19
@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).
– Daniel
Nov 12 '18 at 21:24
@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations onulong
result with theulong
state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...
– elgonzo
Nov 12 '18 at 21:29
1
1
not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)
– dlatikay
Nov 12 '18 at 20:11
not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)
– dlatikay
Nov 12 '18 at 20:11
I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something
ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | ...
. By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.– elgonzo
Nov 12 '18 at 20:17
I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something
ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | ...
. By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.– elgonzo
Nov 12 '18 at 20:17
(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)
– elgonzo
Nov 12 '18 at 20:19
(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)
– elgonzo
Nov 12 '18 at 20:19
@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).
– Daniel
Nov 12 '18 at 21:24
@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).
– Daniel
Nov 12 '18 at 21:24
@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on
ulong
result with the ulong
state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...– elgonzo
Nov 12 '18 at 21:29
@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on
ulong
result with the ulong
state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...– elgonzo
Nov 12 '18 at 21:29
add a comment |
2 Answers
2
active
oldest
votes
you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:
public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals
result |= (state & 0x0F0000F0000F0000L) >> 12;
result |= (state & 0x00F0000F00000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F0000F0000F0L) << 12;
//result |= (state & 0x000000F000000000L) >> 12;
//result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000F00L) << 24;
//result |= (state & 0x000000000F000000L) << 12;
//result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
//result |= (state & 0x0000000000000F00L) << 24;
//result |= (state & 0x00000000000000F0L) << 12;
return result;
}
How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.
– Daniel
Nov 12 '18 at 22:23
add a comment |
An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".
For example the moves left by 12 and 24 could be done as:
ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
r0 |= t & 0x0FF000FF000F0000UL;
That reduces 6 operations to 4. The multiplication shouldn't be slow, on a modern processor it takes 3 cycles, and while it is working on that multiply the processor can go ahead and work on the other steps too. As a bonus, on Intel the imul
would go to port 1 while the shifts go to ports 0 and 6, so saving two shifts with a multiply is a good deal, opening up more room for the other shifts. The AND and OR operations can go to any ALU port and aren't really the problem here, but it may help for latency to split up the chain of dependent ORs:
public static ulong Transpose(this ulong state)
{
ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals
ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;
r0 |= (state & 0x00F0000F00000000L) >> 24;
r1 |= (state & 0x000F000000000000L) >> 36;
r0 |= (state & 0x000000000000F000L) << 36;
r1 |= t & 0x0FF000FF000F0000UL;
return r0 | r1;
}
I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.
– Daniel
Nov 13 '18 at 15:43
@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.
– harold
Nov 13 '18 at 15:51
I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!
– Daniel
Nov 13 '18 at 21:04
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53269297%2ftransposing-a-4x4-matrix-represented-as-a-ulong-valueas-fast-as-possible%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:
public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals
result |= (state & 0x0F0000F0000F0000L) >> 12;
result |= (state & 0x00F0000F00000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F0000F0000F0L) << 12;
//result |= (state & 0x000000F000000000L) >> 12;
//result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000F00L) << 24;
//result |= (state & 0x000000000F000000L) << 12;
//result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
//result |= (state & 0x0000000000000F00L) << 24;
//result |= (state & 0x00000000000000F0L) << 12;
return result;
}
How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.
– Daniel
Nov 12 '18 at 22:23
add a comment |
you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:
public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals
result |= (state & 0x0F0000F0000F0000L) >> 12;
result |= (state & 0x00F0000F00000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F0000F0000F0L) << 12;
//result |= (state & 0x000000F000000000L) >> 12;
//result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000F00L) << 24;
//result |= (state & 0x000000000F000000L) << 12;
//result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
//result |= (state & 0x0000000000000F00L) << 24;
//result |= (state & 0x00000000000000F0L) << 12;
return result;
}
How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.
– Daniel
Nov 12 '18 at 22:23
add a comment |
you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:
public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals
result |= (state & 0x0F0000F0000F0000L) >> 12;
result |= (state & 0x00F0000F00000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F0000F0000F0L) << 12;
//result |= (state & 0x000000F000000000L) >> 12;
//result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000F00L) << 24;
//result |= (state & 0x000000000F000000L) << 12;
//result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
//result |= (state & 0x0000000000000F00L) << 24;
//result |= (state & 0x00000000000000F0L) << 12;
return result;
}
you can skip 6 steps by combining, i commented them out to show you result, should make it twice as fast:
public static ulong Transpose(this ulong state)
{
ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals
result |= (state & 0x0F0000F0000F0000L) >> 12;
result |= (state & 0x00F0000F00000000L) >> 24;
result |= (state & 0x000F000000000000L) >> 36;
result |= (state & 0x0000F0000F0000F0L) << 12;
//result |= (state & 0x000000F000000000L) >> 12;
//result |= (state & 0x0000000F00000000L) >> 24;
result |= (state & 0x00000000F0000F00L) << 24;
//result |= (state & 0x000000000F000000L) << 12;
//result |= (state & 0x00000000000F0000L) >> 12;
result |= (state & 0x000000000000F000L) << 36;
//result |= (state & 0x0000000000000F00L) << 24;
//result |= (state & 0x00000000000000F0L) << 12;
return result;
}
edited Nov 12 '18 at 22:09
answered Nov 12 '18 at 21:45
AldertAldert
1,2331212
1,2331212
How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.
– Daniel
Nov 12 '18 at 22:23
add a comment |
How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.
– Daniel
Nov 12 '18 at 22:23
How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.
– Daniel
Nov 12 '18 at 22:23
How did I miss this! Great idea. Marking this as the answer as it is now reasonably close to the speed of my other transformations.
– Daniel
Nov 12 '18 at 22:23
add a comment |
An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".
For example the moves left by 12 and 24 could be done as:
ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
r0 |= t & 0x0FF000FF000F0000UL;
That reduces 6 operations to 4. The multiplication shouldn't be slow, on a modern processor it takes 3 cycles, and while it is working on that multiply the processor can go ahead and work on the other steps too. As a bonus, on Intel the imul
would go to port 1 while the shifts go to ports 0 and 6, so saving two shifts with a multiply is a good deal, opening up more room for the other shifts. The AND and OR operations can go to any ALU port and aren't really the problem here, but it may help for latency to split up the chain of dependent ORs:
public static ulong Transpose(this ulong state)
{
ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals
ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;
r0 |= (state & 0x00F0000F00000000L) >> 24;
r1 |= (state & 0x000F000000000000L) >> 36;
r0 |= (state & 0x000000000000F000L) << 36;
r1 |= t & 0x0FF000FF000F0000UL;
return r0 | r1;
}
I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.
– Daniel
Nov 13 '18 at 15:43
@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.
– harold
Nov 13 '18 at 15:51
I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!
– Daniel
Nov 13 '18 at 21:04
add a comment |
An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".
For example the moves left by 12 and 24 could be done as:
ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
r0 |= t & 0x0FF000FF000F0000UL;
That reduces 6 operations to 4. The multiplication shouldn't be slow, on a modern processor it takes 3 cycles, and while it is working on that multiply the processor can go ahead and work on the other steps too. As a bonus, on Intel the imul
would go to port 1 while the shifts go to ports 0 and 6, so saving two shifts with a multiply is a good deal, opening up more room for the other shifts. The AND and OR operations can go to any ALU port and aren't really the problem here, but it may help for latency to split up the chain of dependent ORs:
public static ulong Transpose(this ulong state)
{
ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals
ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;
r0 |= (state & 0x00F0000F00000000L) >> 24;
r1 |= (state & 0x000F000000000000L) >> 36;
r0 |= (state & 0x000000000000F000L) << 36;
r1 |= t & 0x0FF000FF000F0000UL;
return r0 | r1;
}
I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.
– Daniel
Nov 13 '18 at 15:43
@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.
– harold
Nov 13 '18 at 15:51
I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!
– Daniel
Nov 13 '18 at 21:04
add a comment |
An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".
For example the moves left by 12 and 24 could be done as:
ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
r0 |= t & 0x0FF000FF000F0000UL;
That reduces 6 operations to 4. The multiplication shouldn't be slow, on a modern processor it takes 3 cycles, and while it is working on that multiply the processor can go ahead and work on the other steps too. As a bonus, on Intel the imul
would go to port 1 while the shifts go to ports 0 and 6, so saving two shifts with a multiply is a good deal, opening up more room for the other shifts. The AND and OR operations can go to any ALU port and aren't really the problem here, but it may help for latency to split up the chain of dependent ORs:
public static ulong Transpose(this ulong state)
{
ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals
ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;
r0 |= (state & 0x00F0000F00000000L) >> 24;
r1 |= (state & 0x000F000000000000L) >> 36;
r0 |= (state & 0x000000000000F000L) << 36;
r1 |= t & 0x0FF000FF000F0000UL;
return r0 | r1;
}
An other trick is that sometimes it is possible to move disjoint sets of bit-groups left by different amounts using a multiplication. This requires that the partial products don't "overlap".
For example the moves left by 12 and 24 could be done as:
ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
r0 |= t & 0x0FF000FF000F0000UL;
That reduces 6 operations to 4. The multiplication shouldn't be slow, on a modern processor it takes 3 cycles, and while it is working on that multiply the processor can go ahead and work on the other steps too. As a bonus, on Intel the imul
would go to port 1 while the shifts go to ports 0 and 6, so saving two shifts with a multiply is a good deal, opening up more room for the other shifts. The AND and OR operations can go to any ALU port and aren't really the problem here, but it may help for latency to split up the chain of dependent ORs:
public static ulong Transpose(this ulong state)
{
ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals
ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;
r0 |= (state & 0x00F0000F00000000L) >> 24;
r1 |= (state & 0x000F000000000000L) >> 36;
r0 |= (state & 0x000000000000F000L) << 36;
r1 |= t & 0x0FF000FF000F0000UL;
return r0 | r1;
}
answered Nov 12 '18 at 23:24
haroldharold
41.4k357109
41.4k357109
I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.
– Daniel
Nov 13 '18 at 15:43
@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.
– harold
Nov 13 '18 at 15:51
I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!
– Daniel
Nov 13 '18 at 21:04
add a comment |
I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.
– Daniel
Nov 13 '18 at 15:43
@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.
– harold
Nov 13 '18 at 15:51
I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!
– Daniel
Nov 13 '18 at 21:04
I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.
– Daniel
Nov 13 '18 at 15:43
I don't fully understand this, but I've confirmed that the results are identical. However, I have benchmarked the two and it appears that this solution is slower than the accepted solution by a factor of 2 or 3. My linqpad performance test is here: pastebin.com/ST8jXPFT, if you want to try. I am on an i7-7700.
– Daniel
Nov 13 '18 at 15:43
@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.
– harold
Nov 13 '18 at 15:51
@Daniel it was a little faster for me, are you accidentally testing this in 32bit mode? I used this to test.
– harold
Nov 13 '18 at 15:51
I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!
– Daniel
Nov 13 '18 at 21:04
I think it was just a peculiarity of LinqPad. I ran the same thing inside Visual Studio and your version ran a little bit faster, like it did for you. Thanks!
– Daniel
Nov 13 '18 at 21:04
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53269297%2ftransposing-a-4x4-matrix-represented-as-a-ulong-valueas-fast-as-possible%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
not ridiculous. brillant already. the 05AB1E golf language probably has a single-character command for it, a pity that it needs to be C# :)
– dlatikay
Nov 12 '18 at 20:11
I guess you used "shockingly" in an ironic manner. Anyway, the purpose of my comment is not to doubt my irony detector. ;-P I don't know how clever the JIT optimizer is, but perhaps you could make your unrolled Transpose fractionally faster by avoiding setting result repeatedly but rather do something
ulong result = (state & mask1) | ((state & mask 2) >> 12) | ((state & mask3) >> 24) | ...
. By the way, .NETCore 3.0 (release planned in 2019) will likely feature support for SSE/AVX hardware intrinsics, which would probably be of benefit to accelerate things further.– elgonzo
Nov 12 '18 at 20:17
(Link to MS blog post about SSE/AVX SIMD support in .NET core 3.0: blogs.msdn.microsoft.com/dotnet/2018/10/10/…)
– elgonzo
Nov 12 '18 at 20:19
@elgonzo, I tried returning a single statement, but the results was never better and usually very minutely worse than multiple assigns for some reason I don't understand. Much of the advantage of this method comes from how easily the CPU can keep pipelines filled; this is evident when switching to from debug performance testing (where a single statement is faster) to release (where it is not).
– Daniel
Nov 12 '18 at 21:24
@Daniel, yes, part of the the performance difference is that the non-looped methods don't risk stalling the pipeline. Another perfomance advantage is that the operations on
ulong
result with theulong
state (both local variables) can be all done within/accross registers, minimizing cache accesses. Unless the JIT is really genious, i don't think the byte array would be hold entirely in a register, thus leading to more cache accesses and more cycles spent...– elgonzo
Nov 12 '18 at 21:29