GitXplorerGitXplorer
d

store_forwarding

public
0 stars
0 forks
0 issues

Commits

List of commits on branch main.
Verified
2d4ea375d30637ddb17839273b07454ad560a4de

Merge pull request #1 from leviska/main

ddanlark1 committed 2 years ago
Unverified
e6f542b12f1ca3b648f9a3e18ea1c2c739106a6e

made generics instead of macros

lleviska committed 2 years ago
Unverified
a94ffe96a8a6019127986e7a0d2f0a9d989d21e0

More data

ddanlark1 committed 2 years ago
Unverified
027898ed50edfe094f5c0d667170d38d466075f1

Better wording

ddanlark1 committed 2 years ago
Unverified
b85be7346c521cc92930ba7fd065ed769f5aa3a7

First commit of the forward store benchmark

ddanlark1 committed 2 years ago

README

The README file for this repository.

Store-forwarding benchmark

This repo exercises store forwarding. Modern processors can forward a memory store to a subsequent load from the same address. This is called “store-to-load forwarding”, and it improves performance because the load does not have to wait for the data to be written to the cache and then read back again. For reference, see Agner Fog's guide.

The example of a forward store is:

movaps  xmmword ptr [rsp], xmm0
mov     eax, dword ptr [rsp + 2]

It stores 16 bytes and reads 4 bytes with the offset 2.

Benchmarks are represented in a store_X_load_Y_offset_Z where we store X bytes, load Y bytes with the offset Z (obviously, Y + Z $\leq$ X). Latency tables for some random Intel, AMD and Arm processors.

Rome

Skylake

Graviton 2

In the end it means you should avoid in libraries loading last bytes or loading bytes across the boundary of 8 bytes -- it has bad performance on Intel and Arm. AMD has been quite good overall. Examples for that:

TBD: more processors, 32 byte loads, etc.