Currently Being Moderated

Performance Comparison of DynamoRIO and Pin

Posted by Derek Bruening on Sep 18, 2008 9:36:16 AM

Since DynamoRIO has just been re-released (DynamoRIO Dynamic Instrumentation Tool Platform, DynamoRIO Dynamic Instrumentation Tool Platform for Linux) let's re-evaluate where it stands performance-wise versus another popular dynamic instrumentation system, Pin.

 

Pin's 2005 PLDI paper (Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation) gives performance numbers for DynamoRIO version 0.9.3 and Valgrind version 2.2.0 versus the latest Pin at the time.  The paper shows base performance as well as performance with a simple basic block execution count tool.  Let's model our experiments on those measurements.

 

I'm using DynamoRIO version 0.9.6 build 9601 and Pin pin-2.5-20751-gcc.4.0.0-ia32_intel64-linux.  My compiler is:

gcc version 4.3.0 20080428 (Red Hat 4.3.0-8) (GCC)

 

I ran the SPECCPU2000 integer benchmarks, just like the paper (also because I have yet to get a copy of SPECCPU2006 set up...).  I used an Intel Core 2 Q9300 quad-core processor at 2.50GHz on a machine with 4GB RAM running Fedora 9.  The benchmarks and tools were all 32-bit.

 

Here are the results with no tool (just measuring the base infrastructure):

 

 

 

DynamoRIO is consistently faster than Pin, with an average slowdown of 34% versus Pin's 71%.

 

And here are the results with the basic block execution count tools (source code displayed below):

 

 

 

For DynamoRIO, two different versions of the tool are shown.  The first always saves the arithmetic flags, as was done in the Pin paper.  The second performs a simple analysis and only saves the flags when necessary.  Both outperform the Pin tool.

 

Pin Client

 

I followed the advice at http://rogue.colorado.edu/Pin/docs/20751/Pin/html/index.html#PERFORMANCE and used IPOINT_ANYWHERE and PIN_FAST_ANALYSIS_CALL.  Note that both of these clients ignore racy increments to the counter from multiple threads.

 

#include "pin.H"
#include <iostream>

int bbcount;

VOID PIN_FAST_ANALYSIS_CALL docount() { bbcount++; }

VOID Trace(TRACE trace, VOID *v) {
    for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) {
        BBL_InsertCall(bbl, IPOINT_ANYWHERE, AFUNPTR(docount),
                       IARG_FAST_ANALYSIS_CALL, IARG_END);
    }
}

VOID Fini(int, VOID * v) {
#ifdef SHOW_RESULTS
    cout << "Count is " << bbcount << endl;
#endif
}

int main(int argc, CHAR *argv[]) {
    PIN_InitSymbols();
    PIN_Init(argc, argv);
    TRACE_AddInstrumentFunction(Trace, 0);
    PIN_AddFiniFunction(Fini, 0);
    PIN_StartProgram();
    return 0;
}

 

DynamoRIO Client

 

Note that both of these clients ignore racy increments to the counter from multiple threads.  For DynamoRIO, adding a LOCK prefix to the inc instruction is easy to do; however, it has a significant performance impact (a three times slowdown in a quick test).  Using thread-private caches and aggregating the count at the end would be more performant.

 

#include "dr_api.h"

#define TESTALL(mask, var) (((mask) &amp; (var)) == (mask))
#define TESTANY(mask, var) (((mask) &amp; (var)) != 0)

static int global_count;

static dr_emit_flags_t
event_basic_block(void *drcontext, void *tag, instrlist_t *bb,
                  bool for_trace, bool translating)
{
    instr_t *instr, *first = instrlist_first(bb);
    uint flags;
    /* Our inc can go anywhere, so find a spot where flags are dead. */
    for (instr = first; instr != NULL; instr = instr_get_next(instr)) {
        flags = instr_get_arith_flags(instr);
        /* OP_inc doesn't write CF but not worth distinguishing */
        if (TESTALL(EFLAGS_WRITE_6, flags) &amp;&amp; !TESTANY(EFLAGS_READ_6, flags))
            break;
    }
    if (instr == NULL)
        dr_save_arith_flags(drcontext, bb, first, SPILL_SLOT_1);
    instrlist_meta_preinsert
        (bb, (instr == NULL) ? first : instr,
         INSTR_CREATE_inc(drcontext, OPND_CREATE_ABSMEM
                          ((byte *)&amp;global_count, OPSZ_4)));
    if (instr == NULL)
        dr_restore_arith_flags(drcontext, bb, first, SPILL_SLOT_1);
    return DR_EMIT_DEFAULT;
}

static void event_exit(void)
{
#ifdef SHOW_RESULTS
    dr_printf("Count is %d\n", global_count);
#endif
}

DR_EXPORT void dr_init(client_id_t id)
{
    dr_register_exit_event(event_exit);
    dr_register_bb_event(event_basic_block);
}
2,968 Views Tags: performance, dynamorio, tools


Sep 18, 2008 11:30 AM Michael E. Locasto Michael E. Locasto    says:

Any conjecture as to the main source of the difference between the bases of both frameworks? Pin seems to be doing consistently more work. Is Pin doing a bunch of record-keeping that supports some type of efficient checkpointing, or thread tracking, or ... ? That's probably a question for the Pin folks to answer, but I was wondering if you had any insights here. What do the results from twolf, vpr, and mcf (and to some extent gzip) indicate about that difference?

Sep 18, 2008 12:19 PM Derek Bruening Derek Bruening    says in response to Michael E. Locasto:

The difference is likely a combination of macro design differences (e.g., whether application code is simply copied to the code cache (DynamoRIO) or is instead optimized (Pin)) and micro design differences (e.g., the exact machine instruction sequences used in performance-critical in-cache operations).  AFAIK no user-visible features such as checkpointing are at play here.  Note that the extra wall-clock time spent need not all come from a higher instruction count and thus what we usually think of as "more work": greater data cache pressure and subsequently more cache misses can also account for performance differences.

 

For gcc and perlbmk, in particular, Pin incurs extra overhead beyond DynamoRIO by spending too much time trying to optimize, at least according to the Pin paper I point to above.

 

For DynamoRIO, the performance of these types of benchmarks with a lot of code reuse and a lot of indirect branches is dictated by indirect branch handling.  The time spent in DynamoRIO proper (the decoder, dispatcher, code generator, etc.) is negligible here because the amount of code is (relatively) small, and the time spent in the code cache on non-indirect branches is not very different from native (with some exceptions).  Indirect branches are where a code caching tool must do some work and access some data that is not present natively.  There are many design choices in precisely how application addresses are mapped to code cache addresses, with some choices prevailing on some application types and some microarchitectures but losing on others.

 

Sep 28, 2008 4:37 AM Tevi Devor Tevi Devor    says:

Please see comment posted by tevi (from PIN team at Intel) at:

http://www.govirtual.org/docs/DOC-1400#cf

 

PIN team at Intel is getting better performance #'s for PIN than those presented here. We would like to resolve the difference.

Jan 30, 2009 4:44 AM Robert Cohn Robert Cohn    says:

I don't think the flags-are-dead optimization you do in the dynamorio client is correct. 'Add' instructions modify the flags. But if the add instruction reads memory it could fault and eflags would not be modified. I don't see how dynamorio can fix up the state so the handler will see the correct eflags value if the client is responsible for handling eflags.