How rr works

Low overhead recording and replay of applications (trees of processes and threads).

Record nondeterministic inputs, replay deterministically.

Why?

Offline debugging: record intermittent test failures "at scale" online, debug the recordings offline at leisure.

Deterministic debugging: record nondeterministic failure once, replay deterministically forever.

Omniscient debugging: step backwards in time; issue queries over program state changes.

Overview

rr record prog --args
saves recording to trace/

rr replay trace/
debugger socket drives replay

Most of an application's execution is deterministic.

rr records the nondeterministic parts.

Examples of nondeterministic inputs

Then during replay, emulate system calls and rdtsc by writing the saved nondeterministic data back to the tracee.

Shared-memory multitasking is a nondeterministic "input".

... but modern hardware can't record it efficiently. So rr doesn't record truly parallel executions.

Scheduling tasks

Can switch tasks at syscalls. Must preempt straight-line code too; and replay the preemptions deterministically.

Hardware performance counters (HPCs)

Recent chips count instructions-retired, branches-retired, ..., and can be programmed to interrupt after a count of x.

Simulate task preemption with HPC interrupts.

Idea: program insns-retired counter to interrupt after k . That k approximates a time slice.

Replaying preemption

Record the insn-retired counter value v to the trace file. During replay, program the interrupt for v. Voilà.

UNIX signals are recorded and replayed like task preemptions.

Record counter value v and signum. Replay by interrupting after v and "delivering" signum.

System requirements

Basic requirements

rr touches low-level details of machine architecture, by necessity; f.e. kernel syscall ABI.

Supporting more ISAs is "just work"; expect x86-64 in the future.

Precise HPC events identify points in execution.

Precise replay of signals and preemption requires interrupting tracees at these events.

Performance counters are messier in reality

seccomp-bpf enables rr to selectively trace syscalls.

Only trap to rr for syscalls that can't be handled in the tracee. Over 100x faster in µbenchmarks.

Buffer syscalls; flush buffer as "super event"

TODO DIAGRAM

No ASLR or ptrace hardening

TODO

Recorder implementation

Tasks are controlled through the ptrace API.

HPCs are controlled through the perf event API.

The first traced task is forked from rr. After that, clone() and fork()from tracees add new tasks.

And tasks die at exit().

Simplified recorder loop

    while live_task():
        task t = schedule()
        if not status_changed(t):
            resume_execution(t)
        handle_event(t)
    

Scheduling a task

    task schedule():
        for each task t, round-robin:
            if is_runnable(t)
               or status_changed(t):
                return t
        tid = waitpid(ANY_CHILD_TASK)
        return task_map[tid]
  

Tasks changing status

    bool status_changed(task t):
        # Non-blocking
        return waitpid(t.tid, WNOHANG)

    # Deceptively simple: includes
    # syscalls, signals, ptrace
    # events ...
  

Resuming task execution

Invariant: At most one task is running userspace code. All other tasks are either idle or awaiting completion of a syscall.

Multiple running tasks are nondeterministic

TODO EXAMPLE RACE, SYSCALL DIVERGENCE

Resuming a task, simplified

    void resume_execution(task t):
        ptrace(PTRACE_SYSCALL, t.tid)
        waitpid(t.tid)  # Blocking

    # Again, deceptively simple: traps
    # for syscalls, signals, ptrace
    # events ...
  

Most recorder work is done for handle_event(task t).

But before looking at it, a few digressions ...

Generating time-slice interrupts

Trapping tracees at rdtsc

Tracees generate ptrace events by executing fork, clone, exit, and some other syscalls.

ptrace events exist for linux reasons that aren't interesting.

(rr tracees can share memory mappings with other processes.

Not possible to record efficiently in SW; needs kernel and/or HW support. Unsupported until then.)

Tracee events seen by handle_event()

handle_event() structure

TODO

Non-nestable events

TODO

Some syscalls must be executed atomically; can't switch task until syscall finishes.

TODO mmap example

On the other hand, some syscalls require switching; syscall can't finish until the task switches.

TODO waitpid example

Scratch buffers for blocking syscalls

TODO

POSIX signals 101

TODO

Linux signals 202

TODO

Recording signal delivery

TODO

Finishing signal handlers

TODO

Delivering unhandled signals

TODO

This breaks the rr scheduling invariant.

Syscall buffer

ptrace traps are expensive. Do as much work in tracee process as possible.

Use seccomp-bpf to selectively trap syscalls.

Syscall hooks are LD_PRELOAD'd into tracees.

Hooks record kernel return value and outparam data to the syscall buffer.

rr monkeypatches __kernel_vsyscall() in vdso to jump to rr trampoline.

Trampoline calls dispatcher, which calls rr hook if available.

Untraced syscalls are recorded to syscallbuf by tracee. Traced events recorded by the rr process "flush" the tracee's syscallbuf.

Lib falls back on traced syscalls.

Simplified example of wrapper function

static int sys_close(int fd)
{
   long ret;
   if (!start_buffer_syscall(SYS_close))
     /* Fall back on traced syscall. */
     return syscall(SYS_close, fd);
   /* Buffer this close() call. */
   ret = untraced_syscall1(SYS_close, fd);
   return commit_syscall(SYS_close, ret);
}
  

How untraced syscalls are made

resume_execution changes for PTRACE_SECCOMP events

TODO

Syscallbuf hooks of may-block syscalls

TODO

perf events to the rescue: "descheduled" event

TODO

Handling "desched notifications"

TODO

Saved traces

Trace directory contents

Replayer implementation

Emulate most syscalls using trace data.

Actually execute a small number.

Built around PTRACE_SYSEMU

TODO show difference from PTRACE_SYSCALL

Main loop overview

TODO

Replaying time-slice interrupts, in theory

TODO

Replaying time-slice interrupts, in practice

Delivering signals

TODO

Replaying buffered syscalls

TODO

Debugger interface

(gdb) set i 10 can cause replay divergence.

So you're not allowed to do it.

Light wrapper around gdb protocol

TODO

Replayer core passes through to ptrace requests of tracee

TODO

SIGTRAP, SIGTRAP, SIGTRAP; breakpoints, int3, stepi

TODOdistinguishing causes of traps

Future work

Roadmap for near future

TODO

Checkpointing

TODO

Omniscient debugging

TODO

Exploratory scheduling; targeted recording

TODO

Copy traces across machines

TODO

Record shared-memory multithreading

TODO

Record ptrace API

TODO

Port, port, port

TODO

Thanks from the rr team!

rr for RnR people

Release 0.4 available today at

rr-project.org

Use cases

Design concerns

rr recorder overview

Trade-off: scheduling from userspace

Headache: kernel writes racing with userspace

rr replayer overview

Replayer headache: slack in counter interrupts

Recorder "fast mode": syscall buffering

Headache: many syscalls made internally in glibc

Headache: buffering syscalls that may block

Fun debugging tricks