Author Image

Befikir T. Bogale

Befikir T. Bogale

Graduate Research Assistant at Global Computing Lab

I am a Ph.D. student in Computer Science advised by Dr. Michela Taufer at the University of Tennessee, Knoxville. I received my Bachelors in Computer Science in Spring 2024 from UTK. My research interests are developing tools for performance analysis in High Performance Computing (HPC) environments.

Experiences

1
University of Tennessee

Oct 2022 - Present

Knoxville, TN

Graduate Research Assistant

Aug 2024 - Present

Responsibilities:
  • Building an LLVM pass plugin to expose compiler remark information to annotation profiling tools like Caliper
  • Conducting performance analysis to evaluate the impact of different compilers and optimization levels on application performance
Undergraduate Research Assistant

Oct 2022 - May 2024

Responsibilities:
  • Developed containerized images using Singularity/Apptainer to enhance portability and reproducibility of HPC applications
  • Researched and mitigated sources of non-determinism in scientific HPC applications to improve reliability and accuracy
  • Implemented a checkpointing framework for neural networks leveraging deduplication to efficiently store epoch history
  • Collaborated with researchers at Lawrence Livermore National Laboratory and Argonne National Laboratory on HPC projects.

Lawrence Livermore National Laboratory

May 2024 - Aug 2025

Livermore, CA

Defense Science and Technology Internship

May 2025 - Aug 2025

Responsibilities:
  • Developed a lightweight, general approach exposing compiler optimization provenance at runtime, which integrates with existing profiling infrastructure such as Caliper, and programmatic analysis with Thicket
  • Validated this approach on the RAJA Performance Suite across optimization levels, demonstrating how fusing compiler and runtime perspectives enables evidence-guided optimization for performance portability
Graduate Computing Scholar Internship

May 2024 - Aug 2024

Responsibilities:
  • Developed a cluster-based methodology to characterize the performance of portable HPC applications across diverse architectures
  • Conducted a performance study of different CPUs and GPUs with different types of memory using the RAJA Performance Suite with other members of the Thicket team
2

3
Los Alamos National Laboratory

Junk 2023 - Aug 2023

Los Alamos, NM

Parallel Computing Intern

Junk 2023 - Aug 2023

Responsibilities:
  • Parallelized X-ray transport simulations to improve computational efficiency and scalability
  • Leveraged Kokkos for portability across multiple architectures, utilizing vectorization and thread team policies
  • Optimized performance, achieving over 13× speedup in parallelized code compared to the serial implementation

Skills

Education

University of Tennessee
2024-Present
PhD in Computer Science (High-Performance Computing Concentration)
University of Tennessee
2020-2024
BSc in Computer Science

Research Projects

Correlating Compiler Optimizations with Runtime Performance
Correlating Compiler Optimizations with Runtime Performance
Developer and Researcher Jan 2024 - Present

This project develops a methodology to connect compiler optimization decisions with runtime performance for performance-portability libraries like RAJA. It integrates compiler optimization data directly into runtime profiles so developers can see how specific optimizations affect execution. We demonstrate the approach using kernels from the RAJA Performance Suite.

Thicket
Thicket
Developer and Researcher Jan 2024 - Present

Thicket is a Python-based toolkit for analyzing ensemble performance data. It is also built on top of Hatchet, allowing for the same benefits that Hatchet provides.

Hatchet
Hatchet
Developer and Researcher Jan 2024 - Present

Hatchet is a Python library that enables users to analyze performance data generated by different HPC profilers. Its main advantage over other tools is that it is capable of ingesting data from different profilers into a common representation, allowing users to use the same code to analyze performance data from different sources.

ANACIN-X
ANACIN-X
Developer and Researcher Oct 2022 - May 2024

ANACIN-X is a suite of tools designed for trace-based analysis of non-deterministic behavior in MPI applications, helping developers and scientists identify root sources of non-determinism. It features a framework for characterizing non-determinism through graph similarity, consisting of execution trace collection, event graph construction, kernel analysis, and distance visualization. Additionally, it includes use cases focused on communication patterns to enhance understanding and reproducibility in HPC applications.

Publications

Maintaining performant code in a world of fast-evolving computer architectures and programming models poses a significant challenge to scientists. Typically, benchmark codes are used to model some aspects of a large application code’s performance, and are easier to build and run. Such benchmarks can help assess the effects of code or algorithm changes, system updates, and new hardware. However, most performance benchmarks are not written using a wide range of GPU programming models. The RAJA Performance Suite provides a comprehensive set of computational kernels implemented in a variety of programming models. We integrated the performance measurement and analysis tools Caliper and Thicket into the RAJAPerf to facilitate performance comparison across kernel implementations and architectures. This paper describes the RAJAPerf, performance metrics that can be collected, and experimental analysis with case studies.

Towards Affordable Reproducibility Using Scalable Capture and Comparison of Intermediate Multi-Run Results

Ensuring reproducibility in high-performance computing (HPC) applications is a significant challenge, particularly when nondeterministic execution can lead to untrustworthy results. Traditional methods that compare final results from multiple runs often fail because they provide sources of discrepancies only a posteriori and require substantial resources, making them impractical and unfeasible. This paper introduces an innovative method to address this issue by using scalable capture and comparing intermediate multi-run results. By capitalizing on intermediate checkpoints and hash-based techniques with user-defined error bounds, our method identifies divergences early in the execution paths. We employ Merkle trees for checkpoint data to reduce the I/O overhead associated with loading historical data. Our evaluations on the nondeterministic HACC cosmology simulation show that our method effectively captures differences above a predefined error bound and significantly reduces I/O overhead. Our solution provides a robust and scalable method for improving reproducibility, ensuring that scientific applications on HPC systems yield trustworthy and reliable results.

Professional Services

Served as the Lead Student Volunteer for workshops at SC25

Served as a Student Volunteer at SC24. In this role, I helped ensure the sessions of the conference ran smoothly. Additionally, I performed other miscellaneous tasks, such as keeping track of the number of attendees in the sessions at which I was working.

Posters

An Approach for Correlating Compiler Optimizations with Runtime Performance

This work builds a framework to understand how compiler optimizations influence the performance of performance-portability libraries such as RAJA. By combining compiler optimization remarks with runtime profiles, it creates a unified view that links compiler decisions to their execution impact. A case study on the RAJA Performance Suite demonstrates how this approach reveals optimization requirements and performance drivers across architectures.

Cluster-Based Methodology for Characterizing the Performance of Portable Application

This work focuses on performance portability and proposes a methodological approach to assessing and explaining how different kernels behave across various hardware architectures using the RAJA Performance Suite (RAJAPerf). Our methodology leverages metrics from the Intel top-down pipeline and clustering techniques to sort the kernels based on performance characteristics. We assess the methodology on 54 RAJAPerf’s computational kernels on Intel Xeon and NVIDIA V100 platforms. Our results confirm the effectiveness of our methodology in automatically characterizing performance differentials and speedups, particularly in memory-bound kernels.

Achievements, Honors, and Scholarships

Participated in the Graduate Student track of the ACM Student Research Competition, presenting my poster “Cluster-Based Methodology for Characterizing the Performance of Portable Applications”

Graduate Fellowship

Awarded the Graduate Fellowship at the University of Tennessee, Knoxville