This blog post demonstrates usage of tools like Cachegrind to check the number of times the code hits and misses to/from the cache. It gives level-wise cache and branch prediction analysis.
Given below are some of the technical terms and abbreviations that are useful to know before we explore Cachegrind.
- Ir : Instruction cache reads.
- I1mr : L1 Instruction cache misses.
- ILmr : Last level (L3) Instruction cache misses.
- Dr : Data cache reads.
- Dlmr : L1 Data cache reads.
- DLmr : Last level (L3) Data cache reads.
- D1mw : L1 data cache miss.
- DLmw : Last level (L3) data cache miss.
- Bc : Conditional branches executed.
- Bcm : Conditional branches mis-predicted.
- Bi : Indirect branches executed.
- Bim : Indirect branches missed.
Cachegrind is one of the utility from the tool-suite called Valgrind. Therefore Valgrind has to be installed on your system.
If you are using this on a Ubuntu machine then enter this command from the Terminal :
sudo apt-get install valgrind
For other OS platforms, you need to build from source files. Follow instructions given in this website.
Cache insights along with branch prediction analysis :
$ valgrind --tool=cachegrind --branch-sim=yes ./a.out
Function level analysis :
$ cg_annotate <cachegrind.out.pid>The output file is generated in the same directory where the valgrind tool is used.
Line by line analysis :
$ cg_annotate <cachegrind.out.pid> --auto=yes
For demonstration purpose, I am using a C program which does in-place M x N size matrix transposition and has fixed input.
Insights (Output) :
Observations imply that Instruction cache miss rate is 0.10% and mis-prediction rate is 5.1% for the transpose code.
- It doesn’t account for other process activity.
- It doesn’t account for virtual-to-physical address mappings; hence the entire simulation is not a true representation of what’s happening in the cache.
- Valgrind’s custom threads implementation will schedule threads differently to the standard one. This could warp the results for threaded programs.