Cache and Branch-Prediction Profiler.

This blog post demonstrates usage of tools like Cachegrind  to check the number of times the code hits and misses to/from the cache. It gives level-wise cache and branch prediction analysis.

Overview :

Given below are some of the technical terms and abbreviations that are useful to know before we explore Cachegrind.

  • Ir : Instruction cache reads.
  • I1mr : L1 Instruction cache misses.
  • ILmr : Last level (L3) Instruction cache misses.
  • Dr : Data cache reads.
  • Dlmr : L1 Data cache reads.
  • DLmr : Last level (L3) Data cache reads.
  • D1mw : L1 data cache miss.
  • DLmw : Last level (L3) data cache miss.
  • Bc : Conditional branches executed.
  • Bcm : Conditional branches mis-predicted.
  • Bi : Indirect branches executed.
  • Bim : Indirect branches missed.


Installation :

Cachegrind is one of the utility from the tool-suite called Valgrind. Therefore Valgrind has to be installed on your system.

If you are using this on a Ubuntu machine then enter this command from the Terminal :

sudo apt-get install valgrind

For other OS platforms, you need to build from source files. Follow instructions given in this website


Usage :

Cache insights along with branch prediction analysis :

$ valgrind --tool=cachegrind --branch-sim=yes ./a.out

Function level analysis :

$ cg_annotate <cachegrind.out.pid>
The output file is generated in the same directory where the valgrind tool is used.

Line by line analysis :

$ cg_annotate <cachegrind.out.pid> --auto=yes


Demonstration :

For demonstration purpose, I am using a C program which does in-place M x N size matrix transposition and has fixed input.

Compilation :

screen-shot-2017-02-03-at-7-49-07-pm
screen-shot-2017-02-03-at-7-48-12-pm

Insights (Output) :

screen-shot-2017-02-03-at-7-45-19-pm

Observations imply that Instruction cache miss rate is 0.10% and mis-prediction rate is 5.1% for the transpose code.


Limitations :

  • It doesn’t account for other process activity.
  • It doesn’t account for virtual-to-physical address mappings; hence the entire simulation is not a true representation of what’s happening in the cache.
  • Valgrind’s custom threads implementation will schedule threads differently to the standard one. This could warp the results for threaded programs.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s