My Google Summer of Code 2025 Selection with CERN
Posted on Sat 28 June 2025 in Blog
I'm excited to share that I've been selected for Google Summer of Code 2025 with CERN-HSF, working on "Using ROOT in the field of genome sequencing."
What is this blog about?
This blog will document my Google Summer of Code experience with CERN-HSF, focusing on applying ROOT framework to genomic data analysis.
Why CERN-HSF?
I've been interested in CERN since learning about the Higgs boson discovery in 2012. The technology behind particle physics research always fascinated me, so when GSoC 2025 applications opened, CERN-HSF was my top choice.
The HSF (HEP Software Foundation) operates as an umbrella organization for high-energy physics software projects. What attracts me to this field is how particle physicists tackle fundamental questions like "What are the fundamental blocks that make up our Universe?" and "What is the nature of dark matter and dark energy?" using cutting-edge computational tools.
Project Discovery
I spent time exploring the HSF project ideas published on February 11 and found two that matched my interests:
Project 1: TMVA SOFIE - HLS4ML Integration This focused on machine learning inference optimization within ROOT's TMVA toolkit. The goal was integrating hls4ml with SOFIE to enable efficient ML model inference, converting models from Keras, PyTorch, and ONNX formats into optimized C++ code.
Project 2: Using ROOT in Genome Sequencing This involved applying ROOT's data processing capabilities to genomic data storage and analysis - essentially using particle physics tools for biological research.
Following the Selection Process
CERN-HSF has a structured two-phase selection process due to the high number of applicants:
Phase 1 (Feb 27 - March 24): Pre-selection and Evaluation Tests Following the guidelines, I waited until February 27 to contact mentors. I sent emails to the mentors of both projects, attaching my CV and explaining my motivation for each choice.
Both project mentors responded with evaluation tests to demonstrate the skills needed for their respective projects.
For TMVA SOFIE - HLS4ML Integration: The test evaluated: - C++ and Python programming skills - Understanding of machine learning frameworks (PyTorch, ONNX) - Knowledge of hls4ml architecture and high-level synthesis concepts - Familiarity with ROOT's TMVA system - Understanding of model optimization and inference techniques
For Using ROOT in Genome Sequencing: The test covered: - C++ and Python programming proficiency - Understanding of ROOT framework and its data structures - Knowledge of genomic data formats (SAM/BAM files) - Familiarity with bioinformatics tools and workflows - Understanding of data compression and storage optimization
Both tests were challenging and required demonstrating practical coding skills alongside theoretical knowledge. The mentors emphasized that the tests were private, solutions should be personal, and response time was part of the evaluation.
Phase 2 (March 24 - April 8): Proposal Development By 1st April, I received emails from both mentors that I had passed evaluation tests for all projects I applied. This began the second phase where I discussed project ideas, timelines, and objectives with mentors.
The mentors helped me develop detailed proposals for both projects during this period with constant suggestions. The application deadline was April 8, I submitted 3 proposals.
The Selection Results
On May 8, Google announced the accepted student projects, Earlier I received an email from one of the organization admins saying I was selected for 2 projects and needed to make a choice.
It was good to know that I had been selected for both projects. This meant that both my proposals and evaluation test performances were strong enough for acceptance through the competitive two-phase process.
Making the Decision
Choosing between the two projects was difficult since both offered unique learning opportunities. I ultimately chose "Using ROOT in the field of genome sequencing" because:
- Interdisciplinary nature: It combines computational physics with bioinformatics
- Novel application: ROOT isn't commonly used for genomic data analysis
- Practical relevance: Genomic datasets are growing rapidly and need efficient storage solutions
- Technical innovation: Working with ROOT's new RNTuple format alongside traditional bioinformatics tools.
Also the most important factor was that my previous GSoC was in the field of bioinformatics so it made sense to streamline my experiences. Also seeing Google Deepmind work on bioinformatics made it more interesting, I knew they would be doing something in the field of genomics and they actually launched Alphagenome this June. I also became part of Princeton University's Compiler Research Team where I am working with some of the finest developers.
Building on Previous Experience
This isn't my first GSoC participation. My previous experience taught me valuable lessons about: - Managing project scope and setting realistic milestones - Maintaining regular communication with mentors and the community - Documenting progress systematically (which is mandatory for CERN-HSF students) - Writing clean, maintainable code for open source projects
This experience helped me navigate the structured selection process and better evaluate which project would be the best fit.
Project Overview
The selected project focuses on extending GeneROOT capabilities through:
- Reproducing Previous Results: Validating existing comparisons against ROOT master
- Compression Analysis: Comparing ROOT's compression strategies with Samtools for BAM/RAM conversions
- RNTuple Implementation: Exploring ROOT's RNTuple format for efficient genomic data storage
- File Splitting Techniques: Investigating different ROOT file splitting approaches
- Performance Benchmarking: Producing comprehensive comparison reports
ROOT typically achieves 10-50% smaller file sizes and multiple times faster read throughput compared to traditional formats. Applied to genomic data, this could significantly improve storage efficiency and analysis speed for research institutions.
Working with My Mentors
I'll be working with experienced mentors: - Martin Vasilev from University of Plovdiv - Vassil Vassilev from Princeton University - Fons Rademakers from CERN
These developers have extensive experience with ROOT development and particle physics data analysis, bringing valuable expertise to the genomics application domain.
Understanding the HSF Community
Working with CERN-HSF means joining a community that has been participating in GSoC since 2011, with the program expanding under the HEP Software Foundation umbrella since 2017. The organization maintains high standards and requires student blogs (like this one) as part of the evaluation process.
What's Coming Next
I'm currently working on the project and have done most of the part involving the RNTuple converter.
In upcoming blog posts, I'll document: - Technical implementation details - Performance comparisons and benchmarks - Challenges in interdisciplinary software development - Progress updates and lessons learned
Following CERN-HSF tradition, I'll maintain regular blog posts throughout the project to share my experience with the community.
Thanks for reading! I'm looking forward to contributing to this exciting intersection of particle physics and genomics research.