Difference between PREFETCH and PREFETCHNTA instructions












3















The PREFETCHNTA instruction is basically used to bring the data from main memory to caches by the prefetcher, but instructions with the NT suffix are known to skip caches and avoid cache pollution.



So what does PREFETCHNTA do which is different from the PREFETCH instruction?










share|improve this question





























    3















    The PREFETCHNTA instruction is basically used to bring the data from main memory to caches by the prefetcher, but instructions with the NT suffix are known to skip caches and avoid cache pollution.



    So what does PREFETCHNTA do which is different from the PREFETCH instruction?










    share|improve this question



























      3












      3








      3


      1






      The PREFETCHNTA instruction is basically used to bring the data from main memory to caches by the prefetcher, but instructions with the NT suffix are known to skip caches and avoid cache pollution.



      So what does PREFETCHNTA do which is different from the PREFETCH instruction?










      share|improve this question
















      The PREFETCHNTA instruction is basically used to bring the data from main memory to caches by the prefetcher, but instructions with the NT suffix are known to skip caches and avoid cache pollution.



      So what does PREFETCHNTA do which is different from the PREFETCH instruction?







      caching assembly x86 prefetch isa






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 12 '18 at 22:08







      Abhishek Nikam

















      asked Nov 12 '18 at 21:33









      Abhishek NikamAbhishek Nikam

      111111




      111111
























          1 Answer
          1






          active

          oldest

          votes


















          3














          prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)



          On paper, the x86 ISA doesn't specify how it implements the NT hint.
          http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.





          prefetchNTA from WB memory1 on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.



          What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See
          Do current x86 architectures support non-temporal loads (from "normal" memory)?



          On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta is only ~1/16th of total L3 size).





          @HadiBrais commented on this answer with some info on AMD CPUs.



          Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.





          Footnote 1: prefetchNTA from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa loads to hit an already-populated LFB. But note that movntdqa from WB memory is not useful.






          share|improve this answer


























          • Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.

            – Hadi Brais
            Nov 13 '18 at 2:17











          • Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.

            – Hadi Brais
            Nov 13 '18 at 2:18






          • 1





            According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.

            – Hadi Brais
            Nov 13 '18 at 2:22






          • 1





            If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...

            – Hadi Brais
            Nov 13 '18 at 23:24






          • 1





            ...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.

            – Hadi Brais
            Nov 13 '18 at 23:25











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53270421%2fdifference-between-prefetch-and-prefetchnta-instructions%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          3














          prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)



          On paper, the x86 ISA doesn't specify how it implements the NT hint.
          http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.





          prefetchNTA from WB memory1 on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.



          What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See
          Do current x86 architectures support non-temporal loads (from "normal" memory)?



          On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta is only ~1/16th of total L3 size).





          @HadiBrais commented on this answer with some info on AMD CPUs.



          Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.





          Footnote 1: prefetchNTA from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa loads to hit an already-populated LFB. But note that movntdqa from WB memory is not useful.






          share|improve this answer


























          • Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.

            – Hadi Brais
            Nov 13 '18 at 2:17











          • Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.

            – Hadi Brais
            Nov 13 '18 at 2:18






          • 1





            According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.

            – Hadi Brais
            Nov 13 '18 at 2:22






          • 1





            If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...

            – Hadi Brais
            Nov 13 '18 at 23:24






          • 1





            ...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.

            – Hadi Brais
            Nov 13 '18 at 23:25
















          3














          prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)



          On paper, the x86 ISA doesn't specify how it implements the NT hint.
          http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.





          prefetchNTA from WB memory1 on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.



          What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See
          Do current x86 architectures support non-temporal loads (from "normal" memory)?



          On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta is only ~1/16th of total L3 size).





          @HadiBrais commented on this answer with some info on AMD CPUs.



          Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.





          Footnote 1: prefetchNTA from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa loads to hit an already-populated LFB. But note that movntdqa from WB memory is not useful.






          share|improve this answer


























          • Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.

            – Hadi Brais
            Nov 13 '18 at 2:17











          • Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.

            – Hadi Brais
            Nov 13 '18 at 2:18






          • 1





            According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.

            – Hadi Brais
            Nov 13 '18 at 2:22






          • 1





            If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...

            – Hadi Brais
            Nov 13 '18 at 23:24






          • 1





            ...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.

            – Hadi Brais
            Nov 13 '18 at 23:25














          3












          3








          3







          prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)



          On paper, the x86 ISA doesn't specify how it implements the NT hint.
          http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.





          prefetchNTA from WB memory1 on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.



          What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See
          Do current x86 architectures support non-temporal loads (from "normal" memory)?



          On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta is only ~1/16th of total L3 size).





          @HadiBrais commented on this answer with some info on AMD CPUs.



          Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.





          Footnote 1: prefetchNTA from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa loads to hit an already-populated LFB. But note that movntdqa from WB memory is not useful.






          share|improve this answer















          prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)



          On paper, the x86 ISA doesn't specify how it implements the NT hint.
          http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.





          prefetchNTA from WB memory1 on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.



          What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See
          Do current x86 architectures support non-temporal loads (from "normal" memory)?



          On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta is only ~1/16th of total L3 size).





          @HadiBrais commented on this answer with some info on AMD CPUs.



          Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.





          Footnote 1: prefetchNTA from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa loads to hit an already-populated LFB. But note that movntdqa from WB memory is not useful.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 13 '18 at 18:07

























          answered Nov 12 '18 at 23:03









          Peter CordesPeter Cordes

          122k17184312




          122k17184312













          • Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.

            – Hadi Brais
            Nov 13 '18 at 2:17











          • Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.

            – Hadi Brais
            Nov 13 '18 at 2:18






          • 1





            According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.

            – Hadi Brais
            Nov 13 '18 at 2:22






          • 1





            If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...

            – Hadi Brais
            Nov 13 '18 at 23:24






          • 1





            ...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.

            – Hadi Brais
            Nov 13 '18 at 23:25



















          • Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.

            – Hadi Brais
            Nov 13 '18 at 2:17











          • Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.

            – Hadi Brais
            Nov 13 '18 at 2:18






          • 1





            According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.

            – Hadi Brais
            Nov 13 '18 at 2:22






          • 1





            If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...

            – Hadi Brais
            Nov 13 '18 at 23:24






          • 1





            ...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.

            – Hadi Brais
            Nov 13 '18 at 23:25

















          Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.

          – Hadi Brais
          Nov 13 '18 at 2:17





          Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.

          – Hadi Brais
          Nov 13 '18 at 2:17













          Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.

          – Hadi Brais
          Nov 13 '18 at 2:18





          Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.

          – Hadi Brais
          Nov 13 '18 at 2:18




          1




          1





          According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.

          – Hadi Brais
          Nov 13 '18 at 2:22





          According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.

          – Hadi Brais
          Nov 13 '18 at 2:22




          1




          1





          If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...

          – Hadi Brais
          Nov 13 '18 at 23:24





          If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...

          – Hadi Brais
          Nov 13 '18 at 23:24




          1




          1





          ...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.

          – Hadi Brais
          Nov 13 '18 at 23:25





          ...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.

          – Hadi Brais
          Nov 13 '18 at 23:25


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53270421%2fdifference-between-prefetch-and-prefetchnta-instructions%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Full-time equivalent

          さくらももこ

          13 indicted, 8 arrested in Calif. drug cartel investigation