Categories
Deep Learning

Cross-facility Federated Learning

The Cross-Facility Federated Learning (xFFL) framework aims to bridge this compute divide. It allows data scientists and domain experts to efficiently exploit multiple independent data centres for extreme-scale deep learning tasks.

The Cross-facility Federated Learning framework

In a decade, AI frontier research transitioned from the researcher’s workstation to thousands of high-end hardware-accelerated compute nodes. This rapid evolution shows no signs of slowing down in the foreseeable future. Obtaining and efficiently exploiting computing resources at that scale is a daunting challenge for universities and SMEs.

Cross-Facility Federated Learning is an experimental distributed computing methodology designed to explore three main research directions:

  1. Using High-Performance Computers (HPCs) as an enabling platform to train and fine-tune foundation AI models;
  2. Exploiting the aggregated computing power of multiple HPCs through a Workflow Management System (WMS) to solve today’s large-scale problems (e.g., training trillion-parameters models);
  3. Federated Learning (FL) as a technique for virtually pooling different datasets while keeping each one private to its owner(s).

The xFFL workflow is based on a previous small-scale experiment that described the federated training of a VGG16 model across the CINECA MARCONI100 HPC facility (Bologna, Italy) and the HPC4AI OpenStack cloud (Torino, Italy). Thanks to the flexibility offered by StreamFlow, this approach is general enough to be reused in the xFFL experiments on a larger scale.

To our knowledge, xFFL is the first experimental attempt to train a foundation model through a cross-facility approach on geographically distant HPCs. To learn more about the methodology and its development, visit the official xFFL web page or explore the GitHub repository.

StreamFlow application

The first iteration of the xFFL experiment trained the LLaMA-2 7 billion model, exploiting two EuroHPC supercomputers cooperatively: CINECA Leonardo (Bologna, Italy) and IT4I Karolina (Ostrava, Czech Republic). Since those two infrastructures are located in two states speaking different languages, such an experiment aims to train a single bilingual model using two distinct language corpora.

The FL protocol is described by a cyclic workflow expressed through the Common Workflow Language (CWL). The xFFL workflow execution is managed by a StreamFlow instance running on a VM hosted on the CINECA ADA cloud (Bologna, Italy).

StreamFlow offloads SLURM jobs with input data (aggregated LLaMA model) to be submitted to the different supercomputers. Then, it gathers the jobs’ results (a set of LLaMA models, each trained on a different supercomputer) for the aggregation step. Aggregation is performed locally on the ADA cloud VM. Training and validation datasets remain stored on the owner’s infrastructures and never move.

In our experiments, queueing times were negligible for up to 32 nodes but increased to over 7 minutes for 128 nodes. However, execution times were orders of magnitude higher, ranging from 774 (on two nodes) to 14 (on 128 nodes) hours. Also, other sources of overhead, i.e., data transfer and model aggregation times, do not exceed hundreds of
seconds, making them negligible compared to the tens of hours each training round needs.

Therefore, despite showing excellent strong scaling, training times offset queuing times and transfer overhead even for large amounts of computing power. This result makes it perfectly feasible to separate different training rounds into multiple job submissions without significantly impacting the overall performance.

I. Colonnelli, R. Birke, G. Malenza, G. Mittone, A. Mulone, J. Galjaard, L. Y. Chen, S. Bassini, G. Scipione, J. Martinovič, V. Vondrák, and M. Aldinucci, “Cross-Facility Federated Learning,” in Procedia Computer Science, vol. 240, p. 3–12, 2024. doi:10.1016/j.procs.2024.07.003

Categories
Deep Learning

Federated Learning meets HPC and Cloud

The Federated Learning (FL) approach is a paradigmatic example of modern AI applications. FL tackles the problem of collaboratively training a Machine Learning model using distributed data silos, where data cannot leave the owner’s infrastructure to ensure privacy and secrecy. Modelling a FL workflow is challenging because it requires federating infrastructures and iterative execution patterns.

Existing FL architectures

The typical runtime architecture of FL frameworks (e.g., Intel OpenFL and Flower) is a master/worker. Each worker is deployed onto a different silo, where it trains a private copy of a Deep Neural Network (DNN). At the end of each training round, each worker sends its model to the master, which computes an aggregated model using a configurable algorithm and broadcasts it back to workers for the next round.

Some recent FL frameworks drop the constraint of a single centralized aggregator, either relying on a tree-based infrastructure or implementing a fully decentralized peer-to-peer aggregation protocol. However, the communication topology is always an intrinsic characteristic of the framework implementation.

In research scenarios, data providers are usually independent entities with heterogeneous data treatment protocols, storage infrastructures, and access policies. Therefore, cross-silo FL pipelines are perfect candidates to be modeled as hybrid workflows.

StreamFlow application

StreamFlow has been used to execute a cross-cluster FL pipeline, where two independent HPC clusters train a model on two different private datasets and a Cloud VM acts as a centralized aggregator.

As a first step, a Common Workflow Language (CWL) description of a FL pipeline has been designed. The pipeline trains a VGG16 DNN over two datasets: a standard MNIST residing on the CINECA MARCONI100 HPC facility (2×16-core IBM POWER9 AC922, 256 GB RAM, and 4 NVIDIA V100 GPUs per node), and a grayscaled version of SVHN residing in the EPITO bare metal partition of the HPC4AI facility at Università di Torino (80-core Arm Neoverse-N1, 512GB RAM, and 2 NVIDIA A100 GPU per node). Note that, up to version v1.2, CWL does not support iterative constructs. However, this pipeline is the first real case iterative CWL workflow, relying on the recently proposed Loop extension. The code is available on GitHub.

Two different FL configurations have been tested: 100 rounds of 1 epoch each and 50 rounds of 2 epochs each, using the well known Federated Averaging (FedAvg) algorithm. Note that the typical master/worker architecture of FL frameworks requires direct bidirectional communications between the aggregator and each worker node. This is not compatible with the typical network configuration of an HPC facility, where worker nodes cannot open outbound connections. Therefore, StreamFlow is a key enabling technology for cross-cluster FL.

To compare performances with a baseline, the pipeline has also been tested on a pure cloud execution environment, replacing the two clusters with two VMs (8 cores, 32 GB RAM, and 1 NVIDIA T5 GPU each) running on the cloud partition of the HPC4AI facility. The performance obtained with the StreamFlow execution of the pipeline has been compared with an equivalent training workload managed by the Intel OpenFL framework. Collected results are comparable in terms of both accuracy and time-to-solution, showing how general-purpose hybrid workflows are ready to provide adequate performance in the FL field.

I. Colonnelli, B. Casella, G. Mittone, Y. Arfat, B. Cantalupo, R. Esposito, A. R. Martinelli, D. Medić, and M. Aldinucci, “Federated Learning meets HPC and cloud,” in Astrophysics and Space Science Proceedings, vol. 60, p. 193–199, 2023. doi:10.1007/978-3-031-34167-0_39

Categories
Deep Learning

AI-assisted COVID-19 diagnosis with the CLAIRE universal pipeline

At the start of the pandemic, several studies outlined the effectiveness of radiology imaging for AI-assisted COVID-19 diagnosis through chest X-Ray and mainly Computed Tomography(CT), given the pulmonary involvement in subjects affected by the infection. Even if X-Ray represents a cheaper and most effective solution for large-scale screening, its low resolution led AI models to show lower accuracy than those obtained with CT data.

Several research groups worldwide began to develop deep-learning models for the diagnosis of COVID-19, mainly in the form of deep Convolutional Neural Networks (CNN), applying lung disease analysis from CT scans images. As soon as we started analyzing all the proposed solutions, it was evident that it was impossible to select the most promising ones, due to the use of different and non-comparable architectures, pipelines and datasets. So, we started working on defining a reproducible workflow capable of automating the comparison of state-of-the-art deep learning models to diagnose COVID-19.

The CLAIRE task force on COVID-19

When the pandemic broke out, among the initiatives aimed at improving the knowledge of the virus, containing its diffusion, and limiting its effects, the Confederation of Laboratories for Artificial Intelligence Research in Europe (CLAIRE) task force on AI & COVID-19 supported the set up of a novel European group to study the diagnosis of COVID-19 pneumonia assisted by Artificial Intelligence (AI). The group includes fifteen researchers in complementary disciplines (Radiomics, AI, and HPC), led by Prof. Marco Aldinucci, full professor at the University of Torino Computer Science Dept.

The CLAIRE-COVID19 universal pipeline

Such collaboration gave birth to the CLAIRE-COVID19 universal pipeline, designed to compare different training algorithms to define a baseline for such techniques and to allow the community to quantitatively measure AI’s progress in the diagnosis of COVID-19 and similar diseases.

The universal pipeline comprises two initial steps: Image Preprocessing and Segmentation. The first applies standard techniques for cleaning and generating variants of training images, while the second uses a DNN-based encoder (e.g., UNet) to isolate a region of interest from the background information (e.g., lungs from other tissues). The final stages are also typical pipeline components implementing Performance metrics and Explainability measures collection.

The core steps are DNN-based. They are Pre-training and Classification. Pre-training aims to generate a first set of weights for the next fine-tuning step, using either an unsupervised technique (e.g., an auto-encoder) or running a supervised training on a different dataset (e.g., ImageNet). The classification step then labels each image with a class identified with a kind of lesion typical of the disease.

Each step can be implemented using different DNNs, generating different variants of the pipeline. We selected the best DNNs that have been experimented in literature for each stage, together with a systematic exploration of the hyper-parameter space, allowing a deeper search for the best model. Moreover, to obtain more consistent results, we applied 5-fold cross-validation to each training process variant.

StreamFlow application

To set up experiments on the pipeline, we chose the most significant dataset publicly available related to COVID-19’s pathology course, i.e., BIMCV-COVID19+, with more than 120k images from 1300 patients. After pre-processing and segmentation phases and a filtering process to remove spurious images, a single training epoch takes on average 5 mins on the resulting dataset.

Running 50 epochs and 5-folds cross-validation on each network configuration translates in a sequential time of about 52 hours for each experiment. A high level of parallelism is needed to run the analysis at scale.

Thanks to StreamFlow and its seamless integration with HPC workload managers, we were able to run the whole spectrum of training configurations for a single DNN (in particular, a DenseNet-121 model) in parallel on the CINECA MARCONI 100 HPC facility. In detail, we explored 12 combinations of hyperparameters with 5-fold cross-validation for a total of 60 experiments. All the experiments ran in parallel on the MARCONI 100 Slurm queue, requesting an NVIDIA Tesla V100 device for each of them.

To further speed up the training, we introduced an early stopping criterion, terminating the training process after ten epochs without improvements in the validation accuracy. With this setting, the whole terminated after ~80 minutes, with a 33.5x speedup compared to a fully sequential run on a single NVIDIA Tesla V100.

I. Colonnelli, B. Cantalupo, R. Esposito, M. Pennisi, C. Spampinato and M. Aldinucci,
“HPC Application Cloudification: The StreamFlow Toolkit,” in 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2021), doi: 10.4230/OASIcs.PARMA-DITAM.2021.5.