Wrapping up my GSoC


Open Source Development 101 and Developing LLVM and Polly 101– this is how I would describe my first GSoC. It not only opened an avenue to pursue my interest in systems research, but also got me connected with an open source developer community with similar interests and immense knowledge. Right time to mention Dr Tobias Grosser, Dr Michael Kruse, Dr Ramakrishna Upadrasta, my mentors! I cannot thank them enough.
Among a lot of the other firsts on this project were using git to develop in a quasi-production scenario, writing platform-independent code, proactively diving deep into inscrutable code-bases to understand the root-cause of an apparently implausible bug, remote collaboration, succinct and contextual communication, asking the right questions etc. It was also the first time I worked on LLVM, Polly and Julia together, on the very active upstream branches of three projects simultaneously. I did find doing this at the very beginning a little overwhelming, and frankly, sometimes discouraging because I wasn't making quick visible progress. On the brighter side, I realized I was not only getting acquainted quickly with the different coding standards of projects, but also was benefitting from my practice of adhering to certain conventions in my college and school projects though I was frequently asked by my then peers to be flexible.

Overview of Efforts

To briefly touch upon what I did, here’s the list of commits / discussions with the issues they addressed:
  1. Enabled Julia to build and link to Polly-ACC
    1. Introduce USE_POLLY_ACC julia/pull/21903
      1. Previous version:- Introduce USE_POLLY_ACC julia/pull/21736 : remote had to be deleted since the local branch was rebased.
      2. First attempt:- Initialize all targets that the LLVM build is configured for julia/pull/21142
    2. USE_POLLY_ACC : Remove dependency on CUDA header files and link libjulia to libGPURuntime julia/pull/22036
    3. [OPEN] Ensure libGPURuntime.so is a part of the portable tarball for USE_POLLY_ACC julia/pull/22365
  2. Fixed Polly-ACC to work in new and unforeseen environments, including  when,
    1. Invoked through ORCJIT
    2. Invoked through Julia
    3. Sent IR with debug metadata
        1. Previous version https://reviews.llvm.org/D35630 : had failed Polly’s AOSP build bot.
    4. CUDA SDK is installed at a nonstandard location
  3. Enhanced Polly’s capabilities
    1. Introduce TARGET_HYBRID
    2. Use run-time information with polly to better optimize code.
      1. Private discussions with Dr. Andreas Simbürger on Polly JIT and proposed an approach to implement recompilation of “hot functions”
  4. Logged performance improvements and diagnosed failures with optimizing PolyBench.jl
    1. Attempt at getting polly.jl Julia module to work.
  5. Logged bugs with Polly discovered over the course of the project
    1. [OPEN] Bug 33324 - allocateMemoryForDeviceCUDA received request for 0 bytes
    2. [RESOLVED] Bug 34102 - cudaFree and cudaManagedAlloc in GPUJIT.c fail Julia build
    3. [RESOLVED] Bug 32735 - DeLICMTests fails when LLVM_LINK_LLVM_DYLIB=ON and LLVM_BUILD_LLVM_DYLIB=ON
    4. [RESOLVED] Bug 32877 - regression test on cuda-managed-memory-simple.ll fail
  6. Keeping Julia up-to-date with upstream LLVM
    1. [OPEN] Fix for llvm6 julia/pull/23350
    2. [Closed] [ WIP ] Changes for LLVM 5 release julia/pull/22971
    3. Fix compilation on LLVM svnjulia/pull/22371
    4. [Discussion] Build fails while running Julia on usr/lib/julia/inference.ji : "Program aborted due to an unhandles Error:"/julia/issues/23238
    5. [Discussion] Changes for new BinaryFormat module in llvm-svn julia/issues/22362
    6. [Discussion] Build failed with llvm-svn julia/issues/22069
Click here to look at all my contributions to Julia on GitHub,  and here for contributions to Polly and LLVM.

Timeline

Initialize the NVPTX backend for PPCGCodeGeneration

The NVPTX backend had to be initialised for the PPCGCodeGeneration pass to generate the PTX kernel string. This was the first of the  issues I had to address, and one which took the longest time.
The first approach was to use the InitAllTargets* functions to initialize the backend within Julia (julia/pull/21142, Enabling other backends for Julia). But this had to be avoided due to concerns from the community that the JIT would slow down.
Another effort was made to initialize the backend within Polly (Initializing NVPTX backend within Polly), when it was configured to build its GPU code-generation facilities as well. This required that Polly be linked to the NVPTX backend libraries such that calls to backend’s routines can be resolved at linking stage. Linking the backend to Polly caused opt and clang to fail on startup since the code to initialize the NVPTX backend’s cl::opt command-line options ran twice since all the backends are linked to opt and clang. But, not linking the backend’s libraries to Polly caused bugpoint’s build to fail at link time since bugpoint isn’t already linked to the backend.
A lot of time was spent in figuring out a way to provide the NVPTX library to Polly, while not reducing the number of possible build configurations and limiting the changes within Polly. This proved as an opportunity to understand the fundamentals of static and dynamic linking.
The solution was to link the backends just to bugpoint (Linking NVPTX backend for Polly-ACC) and not linking the NVPTX backend to Polly’s loadable module (https://reviews.llvm.org/D32442).

Enabling Julia to build and link to Polly-ACC’s components

Introduce USE_POLLY_ACC was the first PR to bring Polly-ACC’s capabilities to Julia. Polly’s GPURuntime facility, used to handle GPU memory allocations and kernel launches,  was statically linked into Julia by including <polly/Support/LinkGPURuntime.h> in Julia’s source, which in turn included <polly/External/GPURuntime/GPUJIT.c> that defined the runtime API. This hack was required since the API is called at runtime by code compiled from IR generated by Polly-ACC, which inserts calls to the API into the IR, and not from within Julia. This required specifying the location of the CUDA SDK as well. An alternative was to link libGPURuntime.so/dll with GNU ld’s --no-as-needed attribute (prevents libraries from being pruned even if the executable doesn’t invoke any function in them) and provide the runtime path to the library using --rpath. This solution depended on a platform’s linker to provide GNU ld’s --no-as-needed and --rpath functionality and libGPURuntime.so/dll’s location written into the Julia’s shared library, which wasn’t considered a clean solution.
The second PR, USE_POLLY_ACC : Remove dependency on CUDA…, linked Julia to the libGPURuntime.so/dll since <polly/Support/LinkGPURuntime.h> wasn’t including GPUJIT.c and therefore dropped the dependency on the CUDA SDK that GPURuntime had imposed. Since Julia had to link to GPURuntime dynamically, it had to be located in libjulia’s RPATHs. libGPURuntime wasn’t included as a part of LLVM’s public interface and therefore had to be installed into $JULIA_SRC/usr/lib manually if Julia was using an out-of-tree LLVM build, which wasn’t a problem when GPURuntime was included statically into Julia.

Benchmarking PolyBench.jl

Mathias’s Resisinger’s Julia port of the polybench benchmark, PolyBench.jl, was used to gauge the speedups brought by PPCGCodeGeneration on Julia. This was a continuous efforts since changes in Polly, LLVM and Julia could affect the performance of kernels in PolyBench.jl. Here’s the latest performance comparison of kernels in PolyBench.jl,

Click here to visit the spreadsheet and select the "Comparison" tab to view the entire chart, run-times and program output that diagnoses failures. Note that speedup polly_hybrid follows the best among polly_hybrid and polly_cpu most of the time.
Polly’s TARGET_CPU, TARGET_GPU and TARGET_HYBRID (with option “--check-bounds=no”)  were compared against vanilla Julia (with options “-O3 --check-bounds=no”). Because of this, those kernels in which Polly failed to detect a SCoP or was unable to apply an optimization actually ran a little slower.
Only 9 kernels, including gemm, were transformed by PPCGCodeGeneration and only seidel-2d sees a huge performance improvement of 4x. It was a surprise that gemm had only improved by 50%, giving 1.5x speedup, given that Polly applied special optimizations meant just for matrix multiplication. The small performance gap might be because of ppcg generating a schedule to call the GPU to work on small chunks of data, even when the GPU has higher capacity, and multiple host-device data transfers. I have gathered information on the causes of failures of  PPCGCodeGeneration and IslScheduleOptimiser and intend to address them in the future, one by one.
As a consequence of running these benchmarks, various bugs within Polly surfaced which set the tone for the majority part of my GSoC. Some of them were due to invoking Polly through ORCJIT, whereas Polly was originally designed to be used clang, where Polly is invoked on only one LLVM Module due to all functions available at compile time.

Some bugs revealed during benchmarking

  1. Bug 33324 : revealed an issue with the schedule generated by ppcg, that caused the generated code to request for “0 bytes” of memory. This caused the gramschmidt kernel to crash.
  2. Illegal access results in segfault and crashes PolyBench.jl : Was encountered when the ludcmp kernel was being optimized by PPCGCodeGeneration. One of the functions within Polly are called through a pointer to an object set to NULL. The peculiarity of this case is that this happened only when polly was invoked through Julia, but doesn’t crash opt when running on the IR that fed to Polly within Julia.

Preventing debug metadata from leaking into LLVM Module meant for the GPU from the source Module

This issue was discovered when PolyBench.jl was being benchmarked on “julia-debug”, the debug build of Julia. All LLVM-IR instructions were annotated with debug metadata and LLVM Module had debug intrinsics.polybench kernels started failing when they had debug metadata, when running PolyBench.jl with.
This, I would say, was the most daunting part of my GSoC because neither did I know where to look for nor how to look for the bug. I was overwhelmed by the possible variety of causes.
Here, I need to thank my mentor Dr. Tobias Grosser for having encouraged me to solve the problem on my own. I took this as an opportunity to dig deep into Polly’s and LLVM’s codebases, acquainting myself with many concepts in LLVM and Polly. The time spent helped me gain more confidence in my understanding and become more of a self-starter.
My first approach(Stripping invalid debug info...) was to remove all debug metadata and debug intrinsics  that went into the IR module meant for the GPU code (called GPUModule from now on). The debug intrinsics also had to removed because they were passed as arguments to the PTX kernel, which was later fixed by separate patch.
It was finally resolved when a fellow developer, Siddharth Bhat, pointed out that Polly sometimes copies LLVM-IR instructions instead of generating them from their polyhedral representation. Instructions annotated with debug location metadata were being copied into the GPUModule, which in-turn pulled all related debug metadata into the GPUModule. The solution was to chip off the metadata before copying it into the GPUModule (Remove debug metadata from copied inst...). Also, this had to be done only for instructions meant for the GPUModule since certain cases require the debug location on the instruction (e.g. inlining a CPU function with debug metadata).

Ensure that the expected kernel is launched by assigning unique names to PTX kernels using their source context

The PTX kernels generated by PPCGCodeGeneration  were called “kernel_$ID” where ID would indicate the kernel’s position in a depth first search of the ISL abstract syntax tree generated by ppcg. This uniquely identifies a kernel within a SCoP given to ppcg and launches the correct kernel when a program has only one SCoP.
Many kernels in PolyBench.jl have more than one SCoP, e.g. 2mm and 3mm, which require names that uniquely identify them across SCoPs. Unique identification is required due to GPU runtime that maintains a cache of PTX kernels associated to their names alone. This means that when there are 2 different SCoPs that generate kernels, the first kernel of both SCoPs would be called “kernel_0” which would end up calling 1st SCoP’s kernel when the 2nd SCoP launches it’s “kernel_0” kernel.
When I benchmarked PolyBench.jl, this issue was more pronounced as kernels are generated across functions. This struck me when I noticed that there were just 2 entries in the NVIDIA Profiler’s kernel-profiling report for PolyBench.jl, “kernel_0” and “kernel_1” were the only kernels profiled. The maximum, average  and minimum runtimes of each kernel weren’t close to each other.

I then introduced Prefix the name of the calling host function in the name of callee GPU kernel and [PPCGCodeGen] Differentiate kernels based on their parent Scop to uniquely identify the kernels based on the SCoP and its parent function they were synthesized from.

Introduced a “hybrid” target

PPCGCodeGeneration may not be able to offload all SCoPs to the GPU. In these cases, the SCoP can be passed to IslScheduleOptimizer to attempt optimize it for the CPU, if PPCGCodeGeneration failed without code generating the SCoP. This can happen if ppcg fails to generate the schedule tree. Once PPCGCodeGeneration starts codegenerating the AST, the information about the SCoP is outdated, even if it fails to generate the PTX kernel or estimates the offload to be costly.
D34054 Introduce a hybrid target... sequenced the IslScheduleOptimizer right after PPCGCodeGeneration in the pass pipeline to optimize SCoPs whose corresponding LLVM Regions hadn’t been touched by PPCGCodeGeneration. This can be enabled by passing “-polly-target=hybrid” in opt’s command line or prefixed with “-mllvm” in clang’s command line.

In the future, I intend to modify PPCGCodeGeneration such that a SCoP’s Region is modified by only after all checks pass. This’d help many kernels in PolyBench.jl which currently fail PPCGCodeGeneration in the middle of codegenerating the SCoP.

Explore using run-time information to help Polly better optimize

One of the advantages of JIT-ed languages is that the values of all the parameters to a function call are available right before the function’s compiled. This opens up new opportunities to optimize the code, e.g. generate specialised code for a particular set of arguments if the function’s called on the same set of arguments many times. This strategy is called “hot recompilation”.
I had to view this strategy in the context of polyhedral compilation. One of the ways the PPCGCodeGeneration or IslScheduleOptimizer can generate better code is when they have hints about the parameters to the SCoP, e.g. the iteration space. These can be supplied as inequalities on the loop bounds using the llvm.expect or llvm.assume intrinsics, where the second is equivalent of assuming the given condition will always be true. These inequalities are represented as hyperplanes that prune the iteration space that make it easier for isl or ppcg to select the best transformation.
The function “assume” (defined in FunctionWrappers.jl) uses Base.llvmcall to insert LLVM-IR into Julia code. My experiments with using the assume function suggested that Polly was able produce more opimized code. A good use case of this would be for Neural Networks: since the dimensions of activations and weight matrices are fixed between 2 layers, these conditions can be inserted into a wrapper function for a particular layer that in turn calls the generic function and while forcing the generic function to be inlined. E.g,
function gemm_layer_1(A, B, C)
  assume( A.dim[0] = 6 && A.dim[1] == 5 && B.dim[0] == 5 .... );
  @inline gemm(A,B,C)
end

I then had discussion with Dr. Andreas Simburger, the lead developer of PolyJIT, a version of Polly that implements hot recompilation of functions. I proposed a more generic framework, to extract SCoPs into separate functions such that a modified ORCJIT Tranformation can be used to intercept calls to these SCoPs, where the SCoP parameters are essentially the function parameters. This eliminates the task of inferring the SCoP parameters.  I intend to push forward in this direction after my GSoC.

Keeping Julia up-to-date with upstream LLVM and other challenges

If I were to reflect on the challenges, one was in working with upstream versions of 3 projects with build regressions as soon as a change in a dependee project breaks a dependant. This kept happening with Julia whenever the LLVM repo was updated. The Julia community preferred to work with released versions than with the trunk, which meant it took time for a PR that updated Julia for LLVM’s trunk to be reviewed and pulled in. It was more difficult when I had to depend on the community for fixes when the changes were beyond my understanding. During the last stages of my GSoC which coincided with the release of v5.0, Julia needed a lot of fixes. I raised this with the community and attempted a fix on my own, but thankfully the community stepped in after a while and pushed a PR that carried changes far more subtle than I thought. Nevertheless, I had figured a way to fix the issues with some understanding, and gained knowledge as I moved through wilderness of LLVM documentation and with help from the community.
Another challenge I faced was in negotiating the compile time, around half an hour, and unexpected OOM crashes. This was because USE_POLLY_ACC was supported only in an in-tree LLVM configuration.  The out-of-tree configuration skips installing the libraries and headers and building a tarball / zip of libraries. So even the smallest change required me to wait for half an hour. The wait-time increased when I did not have access to a high-performance server or when I needed a debug build.

Status of objectives as of the end of GSoC

Completed

  1. Adapted Julia and Polly to build with components and dependences of Polly-ACC
  2. “Introduction of @polly-acc macro” was replaced by “Introduced a ‘hybrid’ target” after discussions with my mentor.

To be continued after GSoC

  1. Getting more Julia generated kernels recognized and optimized
  2. Using run-time parameters to better optimize generated code

Looking back

Personally, I would say, the project "success" has rung in many intangible benefits -- the foothold it has helped me gain as a part of the Polly community, the experience in applying my understanding, new knowledge, getting accustomed to working in a remote environment, appreciating the cultural and demographic differences. I must also mention those “aha” moments I had when I figured out why something was not working.

To the future

I'm passionate about compilers and want to push forward in this direction. It will be a great opportunity if I could collaborate with Dr. Andres Simburger in bringing his JIT to upstream Polly, and improve the runtime of kernels in PolyBench. Through this, I hope to learn the subtleties and working of LLVM IR and optimization passes.

Comments

Popular posts from this blog

Enabling Julia to target GPGPUs using Polly | Google Summer of Code 2017