Abstract
This document describes two methods for measuring the performance of GNU APL. The purpose of such measurements is to find locations in the source code that perform poorly.
Methods
gprof profiling
gprof is a command line tool that measures how the CPU time of an APL session is distributed over the different GNU APL source files.
Preparation
Profiling with gprof requires that the enire GNU APL code is compiled with specific compiler flags. The GNU APL build system sets these flags if gprof is desired. The simplest way to set these flags is to simply run:
make gprof # set proper compiler flags in make
make -C src clean all
in the top-level GNU APL directory.
Measurement
For example:
cd src
./apl -T testcases_3/*tc # produce file gmon.out
gprof ./apl gmon.out > profile # produce file 'profile' from file gmon.out
The example above profiles GNU APL over all testcases in directory src/testcases_3. It is up to the user to determine what is being profiled (a somewhat broad profile over all GNU APL primitives in the example above. To obtain a more narrow profile, say for specific APL primitives, you should:
-
Write an APL workspace (or a testcase file) that executes the function of interest in a narrow APL loop, then
-
Run apl with that workspace, and finally
-
run gprof with the output file created by step 2. above (by convention this file is named gprof.out).
As another, more detailed, example, say we want to profile the APL matrix division, i.e. the APL code A ⌹ B. The workspace in step 1. above could then be this:
#!./apl --script
∇LOOP COUNT
L: ⊣A⌹B ◊ →(0 < COUNT←COUNT-1)⍴L
∇
loop_count←1000000 ⍝ keep APL busy with A ⌹ B
N←10
A←⎕RVAL 1 (,N) (0 0 0 1) ⍝ a random 10 element complex vector
B←⎕RVAL 2 (N,N) (0 0 0 1) ⍝ a random 10×10 complex matrix
LOOP loop_count ⍝ run the loop loop_count times
In step 2. we run the workpace (in directory src):
./apl -f workspaces/GPROF_LOOP.apl # invoke apl with the above script file
In step 3. we analyze the result gmon.out of step 2. The file gmon.out is a hard to read binary file with the measurement results. Step 3 converts this binary file into a human-readable statistics file named profile:
gprof ./apl gmon.out > profile
The text file profile then contains a breakdown of the CPU time spent in different source code functions. Some notes may help to achieve a meaningful profile:
-
make sure that the APL code focuses on the function(s) of interest,
-
avoid producing any output (the interpreter blocks on output, which may obfuscate the profile).
-
make the loop run long enough; particularly when the profiled footprint is small. Otherwise some small function may be missing entirely in the profile. The longer the loop runs the more reliable is the result. An execution time of one minute is a good starting point for a simple APL primitive.
CPU cycle counters
GNU APL has a number of built-in CPU cycle counters that are spread over some GNU APL source files of interest. The existing counters are defined in src/Performance.def. The counters generate a small runtime overhead and therefore have to be enabled explicitly. For example with:
./configure PERFORMANCE_COUNTERS_WANTED=yes ,,,
The ./configure option DEVELOP_WANTED=yes automatically sets PERFORMANCE_COUNTERS_WANTED=yes, but also other debug features that may have performance impacts.
The counters are mananged from APL via ⎕FIO. Every counter has a unique ID and can be cleared (before a measurement is started) and read (after a measurement). For example:
⊣ ⎕FIO[200] 16 ⍝ clear CPU cycle counter for dyadic A ÷ B
⊣ 42.1 ÷ ⍳ 1000 ⍝ perform 1000 divisions
8 ⎕CR (⎕FIO[201] 16) ⍝ read counter for dyadic A ÷ B
┌→────────────────────────────────────────────┐
│2 ┌→────┐ 1 3087 9529569 999 370713 182857673│
│ │A ÷ B│ │
│ └─────┘ │
└ϵ────────────────────────────────────────────┘
The result Z of Z←⎕FIO[201] is an 8 item vector:
-
Z[1] statistics type. The statistics types are:
-
1 for the Cell operations of a monadic scalar APL function. That is, the time for one vector item,
-
2 for the Cell operations of a dyadic scalar APL function. That is, the time for one vector item,
-
3 for a monadic APL primitive. That is, the time for one vector, or
-
4 for a dyadic APL primitive. That is, the time for one vector.
-
-
Z[2] statistics name (the APL operation that was measured),
-
Z[3] the number of executions of the first APL vector item. Z[3] is N if, e.g. N vectors were measured in a loop, and is 1 if no loop is used (like in the example above).
-
Z[4] the (sum of) CPU cycles of the first APL vector item. Therefore Z[4] ÷ Z[3] is the average number of CPU cycles
-
Z[5] the (sum of) the squares of the CPU cycles of all the first APL vector items. (To compute the variance). If Z[3] ←→ 1 (one vectror measured) then Z[4]×Z[4] ←→ Z[5],
-
Z[6] number of executions of all subsequent APL vector items
-
Z[7] the (sum of) CPU cycles of all subsequent APL vector items,
-
Z[8] the (sum of) the squares of the CPU cycles of the all subsequent APL vector items. (To compute the variance),
In 1990 a particular APL operation, when executed again and again, would have always used the same number of CPU cycles in every pass. These days, modern CPUs feature a number of optimizations (caching, branch prediction, etc.) that cause the number of CPU cycles for the same APL operations to differ considerably. To get a better understanding of the varying CPU cycles:
-
Z[2 3 4] are the CPU cycles for the first item of every APL vector while
-
Z[5 6 7] are the CPU cycles for all subsequent vector items. The former is normally considerably larger due to the instruction and data caches of the CPU,
-
Z[3] ÷ Z[2] and Z[6] ÷ Z[5] are the average number of CPU cycles for the respective (i.e. first vs. subsequent) measurements, and
-
(Z[4]⋆0.5) ÷ Z[2] and Z[7] ÷ Z[5] are the statistical variances of the first and subsequent CPU cycles respectively.
The output above, i.e.:
┌→────────────────────────────────────────────┐
│2 ┌→────┐ 1 3087 9529569 999 370713 182857673│
│ │A ÷ B│ │
│ └─────┘ │
└ϵ────────────────────────────────────────────┘
tells us: The statistics type Z[1] was 2 (which, BTW, corresponds to macro perfo_2() in Performance.def. The operation Z[2] measured by that statistics was A ÷ B. The first division, i.e. 42.1 ÷ ↑⍳1000 ←→ 42.1 ÷ 1, took Z[4] ←→ 3087 CPU cycles. The Z[6] ←→ 999 subsequent divisions, i.e. 42.1 ÷ 1↓⍳1000 ←→ 42.1 ÷ 2 3 4 … 1000 took 370713 CPU cycles total, or 370713 ÷ 999 ←→ 371 CPU cycles per division. The comparison between the first division (3087 CPU cycles) and the average of the subsequent divisions (371 CPU cycles) demonstarates the effect of caching, etc. Since there was only one first vector item, the variance is the square of that item, i.e. Z[5] ←→ 9529569 ←→ 3087×3087 ←→ Z[4]×Z[4] (a single item always has a variance of 0). If we had run A ÷ B in a loop, say N times, then:
-
Z[3] ←→ N,
-
Z[4] ÷ N ←→ the average CPU cycles for the first item,
-
Z[6] ÷ (N×999) ←→ the average CPU cycles for subsequent items,
-
usually Z[5] ≠ Z[4]×Z[4] because the CPU cycles for the first element (in each of the N iterations) vary.
In statistics we learn that:
-
Variance(X) ←→ ExpectationX - Expectation(X²), and
-
StandardDeviation(X) ←→ Variance(X) ⋆ 0.5.
The triples Z[3 4 5] and Z[6 7 8] therefore provide all the information that is needed to compute the Expectation and the Variance of the first and of all subsequent vector items. GNU APL ships with a handful of workspaces (search for get_statistics in the .apl files) that measure and plot the benchmarking of some scalar APL functions.