Deep Learning

Federated Learning meets HPC and Cloud

The Federated Learning (FL) approach is a paradigmatic example of modern AI applications. FL tackles the problem of collaboratively training a Machine Learning model using distributed data silos, where data cannot leave the owner’s infrastructure to ensure privacy and secrecy. Modelling a FL workflow is challenging because it requires federating infrastructures and iterative execution patterns.

Existing FL architectures

The typical runtime architecture of FL frameworks (e.g., Intel OpenFL and Flower) is a master/worker. Each worker is deployed onto a different silo, where it trains a private copy of a Deep Neural Network (DNN). At the end of each training round, each worker sends its model to the master, which computes an aggregated model using a configurable algorithm and broadcasts it back to workers for the next round.

Some recent FL frameworks drop the constraint of a single centralized aggregator, either relying on a tree-based infrastructure or implementing a fully decentralized peer-to-peer aggregation protocol. However, the communication topology is always an intrinsic characteristic of the framework implementation.

In research scenarios, data providers are usually independent entities with heterogeneous data treatment protocols, storage infrastructures, and access policies. Therefore, cross-silo FL pipelines are perfect candidates to be modeled as hybrid workflows.

StreamFlow application

StreamFlow has been used to execute a cross-cluster FL pipeline, where two independent HPC clusters train a model on two different private datasets and a Cloud VM acts as a centralized aggregator.

As a first step, a Common Workflow Language (CWL) description of a FL pipeline has been designed. The pipeline trains a VGG16 DNN over two datasets: a standard MNIST residing on the CINECA MARCONI100 HPC facility (2×16-core IBM POWER9 AC922, 256 GB RAM, and 4 NVIDIA V100 GPUs per node), and a grayscaled version of SVHN residing in the EPITO bare metal partition of the HPC4AI facility at Università di Torino (80-core Arm Neoverse-N1, 512GB RAM, and 2 NVIDIA A100 GPU per node). Note that, up to version v1.2, CWL does not support iterative constructs. However, this pipeline is the first real case iterative CWL workflow, relying on the recently proposed Loop extension. The code is available on GitHub.

Two different FL configurations have been tested: 100 rounds of 1 epoch each and 50 rounds of 2 epochs each, using the well known Federated Averaging (FedAvg) algorithm. Note that the typical master/worker architecture of FL frameworks requires direct bidirectional communications between the aggregator and each worker node. This is not compatible with the typical network configuration of an HPC facility, where worker nodes cannot open outbound connections. Therefore, StreamFlow is a key enabling technology for cross-cluster FL.

To compare performances with a baseline, the pipeline has also been tested on a pure cloud execution environment, replacing the two clusters with two VMs (8 cores, 32 GB RAM, and 1 NVIDIA T5 GPU each) running on the cloud partition of the HPC4AI facility. The performance obtained with the StreamFlow execution of the pipeline has been compared with an equivalent training workload managed by the Intel OpenFL framework. Collected results are comparable in terms of both accuracy and time-to-solution, showing how general-purpose hybrid workflows are ready to provide adequate performance in the FL field.

I. Colonnelli, B. Casella, G. Mittone, Y. Arfat, B. Cantalupo, R. Esposito, A. R. Martinelli, D. Medić, and M. Aldinucci, “Federated learning meets HPC and cloud,” in Astrophysics and space science proceedings, Catania, Italy, 2022.

Deep Learning

AI-assisted COVID-19 diagnosis with the CLAIRE universal pipeline

At the start of the pandemic, several studies outlined the effectiveness of radiology imaging for AI-assisted COVID-19 diagnosis through chest X-Ray and mainly Computed Tomography(CT), given the pulmonary involvement in subjects affected by the infection. Even if X-Ray represents a cheaper and most effective solution for large-scale screening, its low resolution led AI models to show lower accuracy than those obtained with CT data.

Several research groups worldwide began to develop deep-learning models for the diagnosis of COVID-19, mainly in the form of deep Convolutional Neural Networks (CNN), applying lung disease analysis from CT scans images. As soon as we started analyzing all the proposed solutions, it was evident that it was impossible to select the most promising ones, due to the use of different and non-comparable architectures, pipelines and datasets. So, we started working on defining a reproducible workflow capable of automating the comparison of state-of-the-art deep learning models to diagnose COVID-19.

The CLAIRE task force on COVID-19

When the pandemic broke out, among the initiatives aimed at improving the knowledge of the virus, containing its diffusion, and limiting its effects, the Confederation of Laboratories for Artificial Intelligence Research in Europe (CLAIRE) task force on AI & COVID-19 supported the set up of a novel European group to study the diagnosis of COVID-19 pneumonia assisted by Artificial Intelligence (AI). The group includes fifteen researchers in complementary disciplines (Radiomics, AI, and HPC), led by Prof. Marco Aldinucci, full professor at the University of Torino Computer Science Dept.

The CLAIRE-COVID19 universal pipeline

Such collaboration gave birth to the CLAIRE-COVID19 universal pipeline, designed to compare different training algorithms to define a baseline for such techniques and to allow the community to quantitatively measure AI’s progress in the diagnosis of COVID-19 and similar diseases.

The universal pipeline comprises two initial steps: Image Preprocessing and Segmentation. The first applies standard techniques for cleaning and generating variants of training images, while the second uses a DNN-based encoder (e.g., UNet) to isolate a region of interest from the background information (e.g., lungs from other tissues). The final stages are also typical pipeline components implementing Performance metrics and Explainability measures collection.

The core steps are DNN-based. They are Pre-training and Classification. Pre-training aims to generate a first set of weights for the next fine-tuning step, using either an unsupervised technique (e.g., an auto-encoder) or running a supervised training on a different dataset (e.g., ImageNet). The classification step then labels each image with a class identified with a kind of lesion typical of the disease.

Each step can be implemented using different DNNs, generating different variants of the pipeline. We selected the best DNNs that have been experimented in literature for each stage, together with a systematic exploration of the hyper-parameter space, allowing a deeper search for the best model. Moreover, to obtain more consistent results, we applied 5-fold cross-validation to each training process variant.

StreamFlow application

To set up experiments on the pipeline, we chose the most significant dataset publicly available related to COVID-19’s pathology course, i.e., BIMCV-COVID19+, with more than 120k images from 1300 patients. After pre-processing and segmentation phases and a filtering process to remove spurious images, a single training epoch takes on average 5 mins on the resulting dataset.

Running 50 epochs and 5-folds cross-validation on each network configuration translates in a sequential time of about 52 hours for each experiment. A high level of parallelism is needed to run the analysis at scale.

Thanks to StreamFlow and its seamless integration with HPC workload managers, we were able to run the whole spectrum of training configurations for a single DNN (in particular, a DenseNet-121 model) in parallel on the CINECA MARCONI 100 HPC facility. In detail, we explored 12 combinations of hyperparameters with 5-fold cross-validation for a total of 60 experiments. All the experiments ran in parallel on the MARCONI 100 Slurm queue, requesting an NVIDIA Tesla V100 device for each of them.

To further speed up the training, we introduced an early stopping criterion, terminating the training process after ten epochs without improvements in the validation accuracy. With this setting, the whole terminated after ~80 minutes, with a 33.5x speedup compared to a fully sequential run on a single NVIDIA Tesla V100.

I. Colonnelli, B. Cantalupo, R. Esposito, M. Pennisi, C. Spampinato and M. Aldinucci,
“HPC Application Cloudification: The StreamFlow Toolkit,” in 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2021), doi: 10.4230/OASIcs.PARMA-DITAM.2021.5.