[PYTHON] Regarding Pyston 0.3

Introduction

Good evening, nice to meet you. (゜ ∀ ゜) o 彡 ° pyston! pyston! !!

Pyston overview

Pyston is a Python 2.7 compatible processing system developed by Dropbox. JIT compilation using LLVM is expected to speed up python, and its rival is pypy. It seems that it only supports x86_64 right now, and ubuntu14 is recommended if you want to try it.

Click here for details on how to build. http://qiita.com/Masahito/items/edd028ebc17c9e6b22b0

Pyston released v0.2 on 2014/09 and seems to be working on the development of v0.3.

At 0.3, performance is being improved for actual benchmarks.

The pyston repository contains representative benchmarks If it was already built, I was able to run the benchmark with make run_TESTNAME.

minibenchmarks


allgroup.py  fannkuch.py      go.py      interp2.py  nbody_med.py  raytrace.py
chaos.py     fannkuch_med.py  interp.py  nbody.py    nq.py         spectral_norm.py

microbenchmarks


attribute_lookup.py  fib2.py            listcomp_bench.py  repatching.py          vecf_add.py
attrs.py             function_calls.py  nested.py          simple_sum.py          vecf_dot.py
closures.py          gcj_2014_2_b.py    polymorphism.py    sort.py
empty_loop.py        gcj_2014_3_b.py    prime_summing.cpp  thread_contention.py
fib.py               iteration.py       prime_summing.py   thread_uncontended.py
fib.pyc              lcg.py             pydigits.py        unwinding.py

Highlights of Pyston

I investigated the features of the JIT compiler while looking at the Pyston README.

https://github.com/dropbox/pyston

I think the highlight of Pyston is JIT compilation using LLVM. It uses LLVM as a JIT compiler, and I think the most famous one is FTL JIT of JavaScriptCore.

The explanation of FTL JIT here will be helpful. http://blog.llvm.org/2014/07/ftl-webkits-llvm-based-jit.html

JSC performs 4-layer JIT compilation.

  1. low-level interpreter (LLInt)
  2. baseline JIT
  3. DFG JIT
  4. FTL JIT (This is the only one that uses LLVM)

Pyston also does 4-layer JIT compilation. It seems to use Pypa's Parser and AST from Pyston 0.3 corresponding to JSC's JSBytecode.

  1. LLVM-IR interpreter (EffortLevel::INTERPRETED)
  2. Baseline LLVM compilation (EffortLevel::MINIMAL)
  3. Improved LLVM compilation (EffortLevel::MODERATE)
  4. Full LLVM optimization + compilation (EffortLevel::MAXIMAL)

JIT compilation by LLVM is performed in the 2-3-4th layer. In the second layer, we embed the code that collects the type at runtime without optimizing with LLVM. In the 4th layer, TypeSpeculation is performed based on the Type collected at runtime, and optimization is performed by LLVM. Generates fast code.

The target of the 4th layer is a loop that is executed 10,000 times or more, or a function that is called 10,000 times or more.

In the future, the third layer will be deleted. It seems that I want to replace the LLVM-IR interpreter in the first layer with my own implementation.

patchpoint seems to use LLVM's intrinsics, Is stackmaps an original implementation?

Inlining

It seems that python methods are inlined in a timely manner when JIT is compiled.

Also, the basic operations (boxing / unboxing) and collections (list / dict / tuple / xrange) that are frequently required in the runtime are It seems that bitcode is generated when Pyston is compiled and inlining is performed at the bitcode level when JIT is compiled.

This area is a little characteristic, and Inliner seems to have created Pyston's own Pass while modifying LLVM's.

See codegen / opt / inliner and runtime / inline (this is the bitcode generated collection)

inline cache

RuntimeIC(void*addr, int num_slots, int slot_size) Manage dictionaries with. The dictionary itself is ICInfo

The call is from a class that inherits RuntimeIC, call ()

The call itself became a template.

:lang:src/runtime/ics.cpp


template <class... Args> uint64_t call_int(Args... args) {
  return reinterpret_cast<uint64_t (*)(Args...)>(this->addr)(args...);
}

template <class... Args> bool call_bool(Args... args) {
  return reinterpret_cast<bool (*)(Args...)>(this->addr)(args...);
}

template <class... Args> void* call_ptr(Args... args) {
  return reinterpret_cast<void* (*)(Args...)>(this->addr)(args...);
}

template <class... Args> double call_double(Args... args) {
  return reinterpret_cast<double (*)(Args...)>(this->addr)(args...);
}

See below for details src/runtime/ics src/asm_writing/icinfo

hidden class

Is it equivalent to the hidden class of V8?

For some reason, it inherits ConservativeGCObject, and many methods are for GC. It's a mystery

For Python, do we need a hidden class to absorb the differences in attributes? I'm not sure about the Python specifications, is the difference in attributes a problem?

See below for details src/runtime/objmodel

Type feedback

Embed code that collects run-time type information when compiling a 2-3 layer JIT.

Basically, at runtime we retrieve the Cls field of BoxedClass and Emit asm to be recorded in recorder at the time of JIT compilation of 2-3 layers. It seems that the one whose number of records is 100 or more at the time of JIT compilation is adopted as the type prediction result.

The type prediction result is set to CompileType, and it is specialized from dynamic type to CompileType at JIT compilation. At that time, it seems that we will actively try to unbox from Boxed Class to Compile Type.

speculation is src / analysis / type_analysis recorder is src / codegen / type_recording Record processing in runtime is src / runtime / objmodel

Object representation

All Instances handled by Pyston seem to be in a boxed state. Therefore, the cls field is filled in at the beginning, and BoxeInt stores the value of int64_t as an example.

The list of Boxed Classes probably looks like this.

:lang:


BoxedClass* object_cls, *type_cls, *bool_cls, *int_cls, *long_cls, *float_cls, *str_cls, *function_
  *none_cls, *instancemethod_cls, *list_cls, *slice_cls, *module_cls, *dict_cls, *tuple_cls, *file_cls,
  *member_cls, *method_cls, *closure_cls, *generator_cls, *complex_cls, *basestring_cls, *unicode_cls, *
  *staticmethod_cls, *classmethod_cls;

It seems that only boxed objects can be stored in various collections (list and dict) and args, The process of extracting the actual value from it seems to be as specialized as possible.

Therefore, based on the result of type feedback, the runtime type is inferred from the type of args of the collected function. It seems to eliminate boxed objects as much as possible and insert unboxed objects in a timely manner.

For type speculation, refer to src / analysis, for Box, refer to src / runtime / classobj and its derived classes.

Optimize

LLVM optimization seems to generate fast code.

:lang:src/codegen/irgen.cpp


  doCompile()
    CompiledFunction()
    if (ENABLE_SPECULATION && effort >= EffortLevel::MODERATE)
      doTypeAnalysis()
        BasicBlockTypePropagator::propagate()
    optimizeIR() /*Set LLVM optimizations with LLVM PassManager*/
      makeFPInliner() /*pyston's own My Inlining Pass*/

      EscapeAnalysis() /*Escape Analysis, a proprietary implementation of pyston*/
      createPystonAAPass() /*Update AA results with reference to Escape analysis results*/

      createMallocsNonNullPass() /* (malloced != NULL)Seems to remove*/

      createConstClassesPass()
      createDeadAllocsPass() /*Remove alloc that does not escape*/

The main control is src / codegen / irgen.cpp Speculation system is analysis See codegen / opt for proprietary LLVM optimization paths

I wondered if EscapeAnalysis should be replaced with alloc-> stack allocation, For LLVM's ModRef analysis, it seemed to just feed back a NoEscape reference as NoModRef.

Does LLVM's ScalarReplAggregates refer to NoModRef and replace it with Alloca?

DeadAllocsPass analyzes the load / store reference by referring to the AA result and removes unnecessary Load. * / After that, dce of LLVM may remove the instruction equivalent to alloca.

http://blog.pyston.org/2014/11/06/frame-introspection-in-pyston/ In the blog, it seems that local variables are assigned to stack.

C API native extension

From v0.2, it seems to support the extension of C_API. As a sample code, it was in test / test_extension. See src / capi

:lang:test/basic_test.c


static PyObject *
test_load(PyObject *self, PyObject *args)
{
    if (!PyArg_ParseTuple(args, ""))
        return NULL;

    assert(stored);
    Py_INCREF(stored);
    return stored;
}

static PyMethodDef TestMethods[] = {
    {"store",  test_store, METH_VARARGS, "Store."},
    {"load",  test_load, METH_VARARGS, "Load."},
    {NULL, NULL, 0, NULL}        /* Sentinel */
};

PyMODINIT_FUNC
initbasic_test(void)
{
    PyObject *m;

    m = Py_InitModule("basic_test", TestMethods);
    if (m == NULL)
        return;
}

Optimization point of python processing system

The material is detailed in this IBM Python JIT compiler, so please refer to it. I tried to find out what kind of part is slow in python.

http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/slides/vee18-ishizaki-presentation.pdf

  1. Look up hash when access field / * See ics? The speedup of field reference is unknown * /
  2. check instance of a class / * All basic classes are boxed * /
  3. search dictionary when call hasattr / * I don't understand attr * /
  4. exception check without splitting BBs / * Are there any conventions for python exceptions * /
  5. specialize runtime type information /* type feedback and type speculation */
  6. speculatively builtin-functions /* bitcode inlining */
  7. reference counting without branch /* ??? */
  8. map to stack-allocated variables /* escape analysis and deadalloc */

Summary

It wasn't progressing, so it was just a memo. It would be great if you could get a feel for how it feels compared to other Python processing systems and JavaScript JIT processing systems (V8, FTL JIT, etc.) that Pyston may be referring to.

It seems that Pyston plans to incorporate pypy benchmarks while comparing PyPy and Cpython. Compared to the current Pyston, PyPy is too fast and the mechanism is fundamentally different. I have high expectations for Pyston in the future.

Recommended Posts

Regarding Pyston 0.3
[Note] Regarding Tensorflow
Regarding Pyston 0.3
[Note] Regarding Tensorflow
Regarding VirusTotal API
Regarding VirusTotal API
Try running Pyston 0.1