=head1 TITLE Calling C Functions =head1 STATUS Proposal, supersedes parts of F. This document does not describe HL issues of interfaces, but rather the internals that might be needed to implement it. =head1 AUTHOR Leopold Toetsch =head1 ABSTRACT Parrot is calling C functions on behalf of user code in a vast variety: opcode and vtable functions, (multi-)subs and -methods. All these functions are called with differing calling schemes, which is causing a lot of issues. This document is covering mainly opcodes. The description of current state is also showing some problems with vtables and methods. But a generalization to vtable/methods and interfaces needs still more specifications from the namespace front. This proposal tries to describe current state and a possible solution. =head1 DESCRIPTION of STATE Classification of function types. =head2 Opcode Functions Parrot has currently +1200 opcodes. There are basically two types of opcodes: simple, easily-JITtable [1] (mostly number-related) opcodes, which have a representation at the hardware level. Second: opcodes that a) just call another function b) are too complex to deserve an opcode or c) things that better should be methods. =head3 a) just calling another function E.g. op find_cclass(out INT, in INT, in STR, in INT, in INT) calls the API C with the very same arguments and order of that function. The C argument of the opcode is the return value of the called function. =head3 b) too complex E.g. op gcd(out INT, out INT, out INT, in INT, in INT) :advanced_math Please have a look at the generated code in the function runcore at F (or any other runcore). (Line numbers may vary a bit due to code changes). =head3 c) better a method A lot of the C or C opcodes should rather be methods. E.g. op split(out PMC, in STR, in STR) :base_core =head2 Sidenote: opcode permutations A major problem with the current scheme is opcode permutations. Due to variable and constant arguments, we get C<2^n> opcodes instead of just one. C is the amount of incoming args that allow variable and constants. E.g. the user visible opcode (F: B(out INT, in INT, in INT) actually is one of: sub_i_i_i # sub I0, I1, I2 sub_i_i_ic # sub I0, I1, 100 sub_i_ic_i # sub I0, 100, I2 sub_i_ic_ic # sub I0, 100, 10 -> set I0, 90 The last one doesn't exist in reality, it is precalculated at compile time by evaluation of C with the constants stuffed into registers. There are B<16> incarnations of the mentioned C opcode B. See e.g. F. The switched, CGoto, and CGP runcores have also implementations of this I opcode, so that we finally have B<64> variants that just all call the same function. This is all well hidden behind (perl) magic, if you don't look at the generated code. The C<*.ops> files are easy to maintain, but we should also be aware of the generated code. The JIT runtime doesn't and won't [2] implement such opcodes, it calls just the normal opcode function inside F, which adds another call level to run the target function. =head2 Sidenote 2: Opcount permutations and predereferencing/JIT There are 2 implementation choices: =over 4 =item a) thread independent code The code is reusable by multiple threads in the interpeter, but therefore can not be dereferenced fully. A constant is dereferenced to the memory address of the constant, but a register is represented as an offset from the (per-(thread)interpreter) register base pointer. This produces slower code and suffers from opcode permutations. =item b) fully dereferenced Registers are also dereferenced to addresses, which is collapsing the code for constants and registers to just one case. The opcode permutation is gone, the runops core size is just 1/4th, but each thread needs another copy of the predereferenced or compiled opcode stream(s). =back Currently a) is implemented. JIT code generation could use either option at runtime with some adjustments. =head2 Opcode function complexity and slowness An opcode function is basically a multisub, which gets dispatched by the runcore depending on (type, type_const) per opcode argument that allows constants. The argument is then dereferenced according to it's type. But dereferencing arguments is done in each opcode, which while it's specialized to the type, is still fairly complex. Dereferencing an C on x86 takes 3 CPU instructions, on PPC (or other CPUs that are lacking (base,index,scale) memory access it's 4 CPU instructions. This of course needs additional registers to hold intermediate results, which is causing extra pain on the register-starved x86 architecture. 5 more instructions are needed to preserve and restore non-volatile registers on the hardware stack. Here is a disassembly of the opcode function C (optimized compile of course): (gdb) disas Parrot_sub_i_i_i Dump of assembler code for function Parrot_sub_i_i_i: 0x080c9860 : push %ebp 0x080c9861 : mov %esp,%ebp 0x080c9863 : sub $0x8,%esp 0x080c9866 : mov %esi,(%esp) 0x080c9869 : mov %edi,0x4(%esp) 0x080c986d : mov 0x8(%ebp),%eax 0x080c9870 : mov 0xc(%ebp),%edx 0x080c9873 : mov 0xc(%eax),%esi 0x080c9876 : mov 0x4(%edx),%ecx 0x080c9879 : mov 0x8(%eax),%edx 0x080c987c : mov 0x4(%eax),%edi 0x080c987f : add $0x10,%eax 0x080c9882 : mov (%ecx,%edx,4),%edx 0x080c9885 : sub (%ecx,%esi,4),%edx 0x080c9888 : mov %edx,(%ecx,%edi,4) 0x080c988b : mov (%esp),%esi 0x080c988e : mov 0x4(%esp),%edi 0x080c9892 : mov %ebp,%esp 0x080c9894 : pop %ebp 0x080c9895 : ret We have: =over 4 =item * call ABI including register frame setup/tear down: 0, 1, 50-53 =item * register preservation, restore: 3-9, 43, 46 =item * argument dereferencing: 13-28, 34-40 =item * setting next pc: 31 =item * the actual work: 37 (partly the other is deref) =back As such opcode functions are also called by the JIT core for non-JITted or non-JITtable opcodes, the usually fastest runcore is also heaving a very big penality caused by these opcode functions. =head2 Opcode Roundup I'm since a long time in favor of just keeping simple opcodes. JIT/x86 has currently around 200 implemented opcodes, which is about the amount that is JITable. OTOH I'd like to have more opcodes to e.g. access int8, int16, int32, int64, float, and vectors of these. =head2 It sometimes doesn't even compile >One more note: be sure to compile Parrot optimized I can't. My dev machine's running gcc 2.95.4, and gcc throws lisp error messages compiling the switch core if I turn on optimizations. -- Dan As said, we currently have more than 1200 opcode variants. Continuing the approach of everything is an opcode (but see below) could end with opcode counts and runloop functions that are by far beyond C compiler optimization or even compilation limits. =head2 A Remark WRT Loadable Opcode Libraries A proposed scheme to circumvent the mentioned compiler troubles includes to load parts of the opcodes as shared libraries. This would of course reduce the core opcode count but doesn't change the overall problem: these loaded opcodes would still exhibit the permutation cost of operand addressing and - depending on implementation - some of the code multiplication by providing differing run core versions of the opcodes. Only the slowest run core (function based) can call these opcode functions with no speed loss, all other runcores need some extra code or would just call the loaded functions anyway, which renders the achievable opcode speed 'optimization' to nothing. This means that we can call a plain runtime library function at the same speed that a loaded opcode function would provide, just without the overhead of code permutation inherently present with current opcodes. =head2 VTABLE Functions Vtable functions are implementing the object and class behaviour of PMCs. Calling a vtable function is quite fast. I see several disadvantages though. =head3 Vtable size A I provides all the builtin interface methods of all PMCs. When you look at F you see a lot of vtable methods, like C, which just call the default implementation, which then throws an exception. All PMCs have all vtable slots allocated independently of some usage for it. This takes a lot of memory, around 1 KByte (32-bit CPU) per class. There are a lot of utility PMCs like C that implement almost no functionality via the C. The whole vtable is basically unused memory and just increases parrots memory footprint. =head3 Extendibility As the C structure with all its slots is strictly defined at compile time, you can't just extend the vtable for one class or PMC to get more functionality. All PMCs and all classes will get that new slot. This means that we can't create yet unknown interfaces at runtime, which are based on vtables. =head3 Inheritance C method inheritance is done at compile-time by the PMC compiler F. Therefore you can't override any of these C methods at runtime. You can create a new derived class, though, which inherits from a PMC, e.g. cl = subclass 'Integer', 'MyInt' and then override e.g. the C method: .namespace ['MyInt'] .sub __absolute :method debug('MyInt.abs() called') ... .end But this works only statically for already existing methods during creation of the C class. Multiple inheritance, like e.g. needed for the C PMC, doesn't work at all. =head3 Introspection Builtin vtables aren't visible with the C opcode. .local pmc Int, MyInt # a .Integer and a 'MyInt' PMC Int = new .Integer MyInt = new 'MyInt' $I0 = can MyInt, '__absolute' # 1 $I1 = can Int, '__absolute' # 0 and you can't use method call syntax at all for vtables: $I1 = Int.'__absolute'() # Method '__absolute' not found There is also no metadata info about vtable function arguments or return value type (if any). =head3 Opcode and vtable permutations Vtable functions suffer partly by the same argument permutation problem as opcodes. From a HLL point of view a vtable function is part of some interface. E.g. The C PMC I the I interface. One part is to fetch an item from the array: elem = ar[idx] When we now have a look at vtable.tbl, we get: $ grep get_\.\*keyed vtable.tbl | wc -l 15 a long list of 15 C functions that together build one part of the C interface, namely 'fetch' an element. It's of course a nice optimization to be able to use native integers for the index or convenient to directly be able to fetch a C element out of the PMC array, but imagine now that Perl6 wants to create a tied array with a user defined C method. Overriding just C would work, if all arguments are PMCs, but already fail, for a native integer key. That is a C implementation would have to override more or less all of these fetch methods. The same is true for the C operation, which consists of another bunch of 15 vtable functions. Due to opcode permutation (constant vs variable) we need 28 opcodes for each of these 2 interface parts. =head3 Unimplemented vtable slots There is almost no support for C operations within vtables. Currently a typical opcode sequence to perform a C function on a PMC is: set S0, P0 # e.g. a .String PMC length I0, S0 # get (codepoint) length of String P0 new P1, .Integer set P1, I0 This becomes of course worse if more strings are involved: set S0, P0 set S1, P1 substr S2, S0, 1, 2, S1 new P2, .String set P2, S2 set P0, S0 That is six opcode dispatches for a plain C functionality instead of just one. Also a lot of PMC I functions are missing, e.g. a plain: q = p.get_numeric() # return a number PMC: $q = +$p; Only C or C are in vtable, which return native C or C respectively. For completness, we'd still have to extend a vtable by a fair amount of slots, which only adds to the mentioned size and permutation problems. =head2 Builtin Methods To work around the inability of extending vtables, there is another calling scheme, implemented via the C syntax in F<.pmc>s. E.g. in F METHOD PMC* lower() { ... } and we can use this like any other method from PASM/PIR: .local pmc s0, s1 ... s1 = s0."lower"() The drawback is that calling these methods works totally different, as these methods are created as NCI PMCs. There is no symmetry between vtable methods like C and "real" methods like C. =head2 Builtin Multi-Subs All the infix operations are calling (possibly builtin) multi subroutines. The infix operations are usable as opcode: add P0, P1, P2 which actually gets translated to infix .MMD_ADD, P0, P1, P2 It can be used in a function call: "__add"(l, r, d) or as a method: d = l."__add"(r) Multi subs are the most flexible way to implement a given functionality as a dedicated function is called that deals with the correct argument types already. =head2 Method Overloading and GC Issues Methods and vtables can be overriden with PASM/PIR code. This currently needs the invocation of an extra run loop. We then have a C stack layout like e.g. this (assuming the stack is growing downwards): switch_core() # e.g. when running -S ParrotObject_get_integer() # in ParrotObject 2) run_meth_fromc_args_reti() # 2) runops_args() # src/inter_run.c runops() runops_int() switch_core() # next run loop 2) Parrot_ prefix omitted, but you get the picture. With this C stack layout, we basically have two problems: =head3 The C stack isn't traced, if GC is called from the runloop A C opcode issued at the runloop doesn't trace the C stack. Here is the reason why: =over =item Diagram Stack depth vs execution time C stack, time A C stack time B ------------------------------------------ run_core run_core [ Vtable function ] [ some function ] [ stack alignment ] [ another function ] local PMC variable A [ stack alignment ] local PMC variable B ---> time =item Stack alignment The ABIs of C or C all define a stack alignment of 16 bytes. To keep that aligment the stack pointer is just decrement by some amount if needed. This can create 'holes' in the C stack, which aren't filled by any local variable. The memory just keeps it's old contents. =item Timely destruction and conservative GC Parrot has a conservative GC, which basically means: every memory location that looks like a pointer to an object keeps this object alive. If there is an INTVAL that holds the address of a PMC, that PMC is marked being alive. When now such a PMC needs timely destruction, the GC would find it alive, even if no real reference to that PMC is existing anymore, and timely destruction doesn't work anymore. =item All together Due to the stack alignment there is a fair (and fully unpredictable) chance that "old" memory looks like a PMC. In the diagram above, the C is still visible on the C stack at time C. This leads to subtle bugs, which are almost impossible to track. It took me 3 days do even grep what's going on here and some more to fully analyze the problem (yes this bug did occur, and it's the reason for not tracing the C stack at the runloop level). =back =head3 Not tracing the C Stack is also wrong When there is a secondary run loop, all items on the stack up aren't traced, *if* GC is triggered by the C opcode. This could lead to infant mortality F, unless extra care is taken to anchor objects manually. This is of course error-prone as hell, because you don't always know, that somewhere down the stack another runloop will be started. =head2 MMD issues Currently the MMD lookup is fully dynamic, but is cached in a quadratic table for a type pair. Besides the size issues (the table is already 1 MByte just for Parrot core types) there is also another problem: the table holds pointers to C code and PMCs. The type of pointer is coded by setting the low bit of the pointer. But this hack fails miserably, if the architecture doesn't align C function at least at word boundaries or is using low bits for permissions (e.g. hppa). =head1 Function Summary =head2 Table of C Functions and Major Characteristics speed extend inherit introspect call scheme ----------------------------------------------------------- Opcode + 1) - + Opcode Vtable + - static - Vtable METHOD - 2) + + + NCI Multi - 2) + + + C 3) 1) Dynamic opcodes, but only usable as function. 2) Without opcode rewriting/caching 3) or NCI if a multisub is obtained by find_name =head2 Pros for current =over 4 =item * It works (more or less) and is implemented. =item * Some items are fixable. E.g. we could create an introspection add-on for vtables. =back =head2 Cons =over 4 =item * Code size =item * May hit hard compiler limits, by adding further opcode (permuations) =item * Code complexity due to 4 differing schemes =item * Fixing current issues needs still more code and adds complexity =item * Big memory footprint with bad cache locality =item * Implementing interfaces is complex, when it comes to mix different vtables and methods to create new functionality =back =head1 PROPOSAL =head2 Summary The 4 differing call schemes are replaced by just one, named B. The calling conventions for a B are standardized. =head2 B Well, there is already an existing F which is unused. I'm just reusing it. Anyway, a B provides at least these fields: function_ptr ... to the C function name ... function name function_signature ... kind of params/return value The C is replacing C for calls inside the interpreter (to known functions). =head2 Calling conventions: B All functions pointed to by a B have the same function prototype: typedef int (*csub_fun) (Interp *, void **args); The C array is defined to hold these items: args[0] ... address of return value, if function returns something args[0/1..n-1/n] ... n arguments the function is getting. These are either INTVALs or pointers to other types. args[n/n+1] ... pc of this instruction The return value is set to 1, if the function did change the C, i.e. it's causing a branch or 0 else. The C is used for functions that are called from the runloop and is usually just ignored. But it allows also to run arbitrary PASM/PIR code by a) create new register context b) create a return continuation c) update the C to point to the user code (a-c is exactly what C is doing) and d) setup C and C pointers. Which means that any CSub called from the runloop can easily call a PASM/PIR implementation without starting an extra run loop. =head2 Opcode functions become C's Here is the same C opcode as a C: (gdb) disas F_sub_i_i_i Dump of assembler code for function F_sub_i_i_i: 0x080484d0 : push %ebp 0x080484d1 : mov %esp,%ebp 0x080484d3 : mov 0xc(%ebp),%eax 0x080484d6 : mov 0x4(%eax),%edx 0x080484d9 : mov (%eax),%ecx 0x080484db : sub 0x8(%eax),%edx 0x080484de : xor %eax,%eax 0x080484e0 : mov %edx,(%ecx) 0x080484e2 : pop %ebp 0x080484e3 : ret It's obvious that getting the args is now as efficient as getting arguments from the stack. No extra non-volatile registers are used (and don't have to be preserved). The opcode count is halved, code size is even reduced more due to simpler (and faster) opcodes. Actually above assembler code is also (almost [3]) the same as generated code by C i.e. a PASM/PIR subroutine compiled to hardware assembler code. This one function is now also the same for C and C and can therefore be emitted just once. This is reducing the code size of C to ~1/4th of current size. =head3 The runcore dereferences arguments The reduced code size in the opcode function comes of course at some price. Dereferencing the arguments per opcode is specialized per argument type. Doing it just once in the runloop needs a more general code that dereferences the arguments in a loop for all the opcodes and fills the args array. That is, we have a switch statement with argument types, very similar to argument passing to a PASM subroutine. This is slower in the general case of course. I've timed the C, which measures pure opcode dispatch speed: ./parrot -f 3.25 s ./np2 mops 10.4 s (np2.c is a proof of concept program, implementing the major ideas of opcode dispatch). But there is also: ./np2 mops 1.45 s Now what: is it 3 times slower or faster? The answer can be: both. The first and slow variant is using the general code all the time. The second timing is from partially inlining the most-used opcodes into a switch statement and not calling any function at all. That is the function runcore becomes a partially switched one, but the now smaller and faster opcode functions are still there, when the JIT core needs it. The C benchmark is of course just measuring pure opcode dispatch with very low overhead function bodies. More complex functionality in opcodes is by far not that much subject to dispatch speed. And a final remark: the JIT core is +300 times faster than C for the I benchmark. Issues from above: =head3 a) just calling another function As the call signature of the opcode and the called function are the same, we can just call the function directly. Everything is an opcode (Dan S.) gets a revival, yeah. But in a much more general and useful sense: we get rid ot the dupliation due to argument permutation and additionally avoid an extra function call. =head3 b) too complex We can denote opcode functions with an extra attribute stating how it'll be implemented. There is already C (which is currently just ignored). It would mean that a) all runcores implement it and b) the function body is also inlined in the switch statement of the function core. The opposite of that would be C, which means that the one and only incarnation of this opcode is in F (or any other file) and other runcores just call this function. Opcodes with a plain C have one function body in each runcore. "Opcode" for a C has the meaning: the C structure provides the metadata for function arguments and a possible return value and there is one and only one C that implements the function body. =head3 Prederefed runcores, loadable opcodes As opcode functions are much more light-weight with this calling scheme, we can call opcode functions from the function runcore with less penalty. If we are keeping interpreter independent code, we can denote opcodes with e.g. C and always call the function implementation instead of duplicating and permutating code in switched, direct threaded, and JIT core. The latter does exactly that anyway, if the architecture doesn't provide a JITted version of the opcode. =head2 Vtable functions and METHODS become C Most of the vtable variants and all the NCI methods would be replaced by Bs. The vtable would have some basic information about the PMC like C or C. The C are kept in the method or namespace hash of the PMC (basically an C), which permits indexed access too. E.g. PMC *csub = pmc->vtable->meth_hash_bucket_ptr[idx].value; branched = ((csub_fun) PMC_struct_val(csub))(interp, args); Or: branched = pmc->vtable->csubs[idx](interp, args); for functions kept directly in the vtable structure. =head2 Summary =head3 Proposal Pros =over 4 =item * The unification of the different call schemes of C functions and the usage of it as an opcode replacement would vastly simplify Parrot's internals. =item * The unification of calling schemes generalizes the concept of an opcode and unifies opcodes with other API functions inside the interpreter. =item * Moving the argument setup into the runloop vastly reduces the permutation count of current opcodes. The B<64> incarnations of C boil down to just one, which actually is the called function itself. =item * Having just one calling scheme is much more friendly to further optimizations like creating JITted variants =item * The mentioned GC problem due to secondary runloop can be avoided by allowing all (PMC) opcodes to branch to another Sub. =item * The unification of vtable functions and methods will simplify the implementation of interfaces. =item * Less overhead for method calls (NCI is much more general) =back =head3 Cons =over 4 =item * Calling a B as opcode needs an extra loop for argument setup =item * Calling a B from Parrot C code is slightly less efficient than calling a plain C function. =item * It has to be implemented =back =head1 TODO Most of the todos aren't directly related to this proposal but need fixing/speccing anyway. =over 4 =item * Specify method calls to Bs =item * Static (class) methods vs functions =item * Namespace issues e.g. open P0, 'f' vs P0 = "open"('f') =item * Call signatures for multi-subs and -methods =back =head1 IMPLEMENTATION STEPS =head2 Proof of concept The example program C (see also F) implements the opcode dispatch and a limited set of argument setup for integer and some other signatures. It additionally uses a C (16 bit) type for opcodes and is prepared for loading long integer constants from the constant table. =head2 Adjust code generation chain =over 4 =item * Annotate opcodes with C or C =item * Call C from other runcores =back =head2 Interfaces (needs more specs still) =over 4 =item * Split vtable into vtable and CSsub parts =item * Patch PMC and opcode compilers to create CSubs =back =head2 Cleanup and removal of old code =head1 FOOTNOTES =over 4 =item [1] JIT The term C in Parrot is technically misleading, when it comes to the sense 'Just in time (compilation)'. Parrot is predereferencing code just in time, when the instruction is executed first. This is actually a just in time recompilation to a faster opcode, especially, when a polymorhic inline cache (PIC) caches e.g. an hash lookup. Compilation to machinecode is still done per file (with the C<-j> switch) or now (very limited) per subroutine (with C<-Sj> or C<-Cj>. Nethertheless I'll use C here in the sense of compiling to machine code. =item [2] JIT opcode coverage It takes about 0.2 - 0.5 hrs to properly implement and test just one JITted opcode for one architecture (if all goes well and you don't have to resort to debugging). See e.g. svn commit log from Feb, 16th r11554 - r11562, where I've implemented about 20 ops for the PPC architecture. Actually already existing templates have vastly simplified that job. Please open F in C<$EDITOR> and locate the line: Parrot_jit_fn_info_t op_jit All entries with C<1> in the C fields are not implemented inside the JIT runtime and have to call a function inside F. =item [3] PIC/JIT Below is the disassembly of the C function from: .sub main ... $I0 = 'sub'(i, j) ... .end .sub 'sub' .param int a .param int b $I0 = a - b .return ($I0) .end created by: $ gdb parrot (gdb) b parrot_build_asm (gdb) r -Cj sub.pir (gdb) fin (gdb) x /13i $1->arena.start 0x84b7d58: push %ebp 0x84b7d59: mov %esp,%ebp 0x84b7d5b: mov 0x10(%ebp),%eax 0x84b7d5e: mov 0x4(%eax),%edx 0x84b7d61: mov 0x8(%eax),%ecx 0x84b7d64: sub %ecx,%edx 0x84b7d66: mov 0x10(%ebp),%eax 0x84b7d69: mov (%eax),%eax 0x84b7d6b: mov %edx,(%eax) 0x84b7d6d: xor %eax,%eax 0x84b7d6f: mov %ebp,%esp 0x84b7d71: pop %ebp 0x84b7d72: ret The second fetching of the C isn't needed. But besides of that it's the same code as I is creting for a C. As PIC/JIT is using the same calling conventions too, this creates a wide area of future improvment possibilities. A loadable opcode library can e.g. have the actual opcode bodies implemented as PIR subroutines that get compiled to native assembler code. We'd just need more types (char, float, vector, ...) in the PASM to be able to express the necessary transitions to hardware assembler code. =back =head1 SEE ALSO Please make sure to have a look at generated C files F, F et al. This needs a built parrot of course. F and F are two more generated files for JIT and EXEC. The demo code is in F. =cut vim: expandtab sw=2 tw=70: