Deep Learning

Federated Learning meets HPC and Cloud

The Federated Learning (FL) approach is a paradigmatic example of modern AI applications. FL tackles the problem of collaboratively training a Machine Learning model using distributed data silos, where data cannot leave the owner’s infrastructure to ensure privacy and secrecy. Modelling a FL workflow is challenging because it requires federating infrastructures and iterative execution patterns.

Existing FL architectures

The typical runtime architecture of FL frameworks (e.g., Intel OpenFL and Flower) is a master/worker. Each worker is deployed onto a different silo, where it trains a private copy of a Deep Neural Network (DNN). At the end of each training round, each worker sends its model to the master, which computes an aggregated model using a configurable algorithm and broadcasts it back to workers for the next round.

Some recent FL frameworks drop the constraint of a single centralized aggregator, either relying on a tree-based infrastructure or implementing a fully decentralized peer-to-peer aggregation protocol. However, the communication topology is always an intrinsic characteristic of the framework implementation.

In research scenarios, data providers are usually independent entities with heterogeneous data treatment protocols, storage infrastructures, and access policies. Therefore, cross-silo FL pipelines are perfect candidates to be modeled as hybrid workflows.

StreamFlow application

StreamFlow has been used to execute a cross-cluster FL pipeline, where two independent HPC clusters train a model on two different private datasets and a Cloud VM acts as a centralized aggregator.

As a first step, a Common Workflow Language (CWL) description of a FL pipeline has been designed. The pipeline trains a VGG16 DNN over two datasets: a standard MNIST residing on the CINECA MARCONI100 HPC facility (2×16-core IBM POWER9 AC922, 256 GB RAM, and 4 NVIDIA V100 GPUs per node), and a grayscaled version of SVHN residing in the EPITO bare metal partition of the HPC4AI facility at Università di Torino (80-core Arm Neoverse-N1, 512GB RAM, and 2 NVIDIA A100 GPU per node). Note that, up to version v1.2, CWL does not support iterative constructs. However, this pipeline is the first real case iterative CWL workflow, relying on the recently proposed Loop extension. The code is available on GitHub.

Two different FL configurations have been tested: 100 rounds of 1 epoch each and 50 rounds of 2 epochs each, using the well known Federated Averaging (FedAvg) algorithm. Note that the typical master/worker architecture of FL frameworks requires direct bidirectional communications between the aggregator and each worker node. This is not compatible with the typical network configuration of an HPC facility, where worker nodes cannot open outbound connections. Therefore, StreamFlow is a key enabling technology for cross-cluster FL.

To compare performances with a baseline, the pipeline has also been tested on a pure cloud execution environment, replacing the two clusters with two VMs (8 cores, 32 GB RAM, and 1 NVIDIA T5 GPU each) running on the cloud partition of the HPC4AI facility. The performance obtained with the StreamFlow execution of the pipeline has been compared with an equivalent training workload managed by the Intel OpenFL framework. Collected results are comparable in terms of both accuracy and time-to-solution, showing how general-purpose hybrid workflows are ready to provide adequate performance in the FL field.

I. Colonnelli, B. Casella, G. Mittone, Y. Arfat, B. Cantalupo, R. Esposito, A. R. Martinelli, D. Medić, and M. Aldinucci, “Federated learning meets HPC and cloud,” in Astrophysics and space science proceedings, Catania, Italy, 2022.


Cell subpopulation discovery on Cloud-HPC

The single-cell RNA sequencing (scRNA-seq) analysis technique is essential to assess fundamental biological properties of cells populations and biological systems at unprecedented resolution. Identifying subpopulations of cells in scRNA-seq experiments is one of the most frequently performed analysis of single-cell data. Subpopulation discovery commonly relies on clustering algorithms, e.g. the Seurat R package (see also this post).

The rCASC library

The rCASC library is specifically designed to provide an integrated analysis environment for cell subpopulation discovery, providing high flexibility and enabling computation reproducibility. In detail, rCASC supports three different analysis steps: raw data preprocessing, subpopulation discovery via clustering, and cluster-specific gene signature detection.

These three analysis steps require different computational resources and computing models. They are therefore suitable to be described as a hybrid workflow running on a heterogeneous computing environment, composed of CPU and GPU architectures, multi-core machines and multi-server deployments. In particular, clustering is the most computationally demanding activity.

StreamFlow application

The StreamFlow framework has been leveraged to execute three different clustering algorithms (SIMLR, Griph, and tSne) running on up to 8 virtual machines (8 cores, 32 GB RAM each) allocated on the HPC4AI Cloud facility at Università di Torino. In particular, the goal was to measure the speedup achievable using StreamFlow with respect to a single multi-core server.

A single-step workflow running a subpopulation clustering analysis has been implemented in the Common Workflow Language (CWL) format, as shown in the figure above. Implementing it as a workflow allows using the CWL scatter feature to distribute independent portions of the workload across multiple locations for concurrent execution. In this case, the scatter is executed on the index_array input field. The code is available on GitHub.

All three clustering algorithms showed a significant speedup with the progressive increase of compute nodes. In particular, on 8 nodes Griph and SIMLR obtained good speedup values of 4x and 4.7x, respectively. Also the tSne algorithm obtained a still significant speedup of 2.5x, supporting the general usefulness of hybrid workflows in subpopulation clustering analyses.

S. G. Contaldo, L. Alessandri, I. Colonnelli, M. Beccuti, and M. Aldinucci, “Bringing cell subpopulation discovery on a cloud-HPC using rCASC and StreamFlow,” in Single cell transcriptomics: methods and protocols, R. A. Calogero and V. Benes, Eds., New York, NY: Springer US, 2023, p. 337–345. doi: 10.1007/978-1-0716-2756-3_17.

Deep Learning

AI-assisted COVID-19 diagnosis with the CLAIRE universal pipeline

At the start of the pandemic, several studies outlined the effectiveness of radiology imaging for AI-assisted COVID-19 diagnosis through chest X-Ray and mainly Computed Tomography(CT), given the pulmonary involvement in subjects affected by the infection. Even if X-Ray represents a cheaper and most effective solution for large-scale screening, its low resolution led AI models to show lower accuracy than those obtained with CT data.

Several research groups worldwide began to develop deep-learning models for the diagnosis of COVID-19, mainly in the form of deep Convolutional Neural Networks (CNN), applying lung disease analysis from CT scans images. As soon as we started analyzing all the proposed solutions, it was evident that it was impossible to select the most promising ones, due to the use of different and non-comparable architectures, pipelines and datasets. So, we started working on defining a reproducible workflow capable of automating the comparison of state-of-the-art deep learning models to diagnose COVID-19.

The CLAIRE task force on COVID-19

When the pandemic broke out, among the initiatives aimed at improving the knowledge of the virus, containing its diffusion, and limiting its effects, the Confederation of Laboratories for Artificial Intelligence Research in Europe (CLAIRE) task force on AI & COVID-19 supported the set up of a novel European group to study the diagnosis of COVID-19 pneumonia assisted by Artificial Intelligence (AI). The group includes fifteen researchers in complementary disciplines (Radiomics, AI, and HPC), led by Prof. Marco Aldinucci, full professor at the University of Torino Computer Science Dept.

The CLAIRE-COVID19 universal pipeline

Such collaboration gave birth to the CLAIRE-COVID19 universal pipeline, designed to compare different training algorithms to define a baseline for such techniques and to allow the community to quantitatively measure AI’s progress in the diagnosis of COVID-19 and similar diseases.

The universal pipeline comprises two initial steps: Image Preprocessing and Segmentation. The first applies standard techniques for cleaning and generating variants of training images, while the second uses a DNN-based encoder (e.g., UNet) to isolate a region of interest from the background information (e.g., lungs from other tissues). The final stages are also typical pipeline components implementing Performance metrics and Explainability measures collection.

The core steps are DNN-based. They are Pre-training and Classification. Pre-training aims to generate a first set of weights for the next fine-tuning step, using either an unsupervised technique (e.g., an auto-encoder) or running a supervised training on a different dataset (e.g., ImageNet). The classification step then labels each image with a class identified with a kind of lesion typical of the disease.

Each step can be implemented using different DNNs, generating different variants of the pipeline. We selected the best DNNs that have been experimented in literature for each stage, together with a systematic exploration of the hyper-parameter space, allowing a deeper search for the best model. Moreover, to obtain more consistent results, we applied 5-fold cross-validation to each training process variant.

StreamFlow application

To set up experiments on the pipeline, we chose the most significant dataset publicly available related to COVID-19’s pathology course, i.e., BIMCV-COVID19+, with more than 120k images from 1300 patients. After pre-processing and segmentation phases and a filtering process to remove spurious images, a single training epoch takes on average 5 mins on the resulting dataset.

Running 50 epochs and 5-folds cross-validation on each network configuration translates in a sequential time of about 52 hours for each experiment. A high level of parallelism is needed to run the analysis at scale.

Thanks to StreamFlow and its seamless integration with HPC workload managers, we were able to run the whole spectrum of training configurations for a single DNN (in particular, a DenseNet-121 model) in parallel on the CINECA MARCONI 100 HPC facility. In detail, we explored 12 combinations of hyperparameters with 5-fold cross-validation for a total of 60 experiments. All the experiments ran in parallel on the MARCONI 100 Slurm queue, requesting an NVIDIA Tesla V100 device for each of them.

To further speed up the training, we introduced an early stopping criterion, terminating the training process after ten epochs without improvements in the validation accuracy. With this setting, the whole terminated after ~80 minutes, with a 33.5x speedup compared to a fully sequential run on a single NVIDIA Tesla V100.

I. Colonnelli, B. Cantalupo, R. Esposito, M. Pennisi, C. Spampinato and M. Aldinucci,
“HPC Application Cloudification: The StreamFlow Toolkit,” in 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2021), doi: 10.4230/OASIcs.PARMA-DITAM.2021.5.


Single-cell RNA sequencing

The idea behind novel single-cell RNA sequencing (scRNA-seq) pipelines is to isolate single cells through microfluidic approaches and generating sequencing libraries in which the transcripts are tagged to track their cell of origin. Modern scRNA-seq platforms are capable of analysing from 500 to 20,000 cells in each run. Then, combined with massive high-throughput sequencing producing billions of reads, scRNA-seq allows the assessment of fundamental biological properties of cells populations and biological systems at unprecedented resolution.

Single-cell pipeline

A typical pipeline for single-cell transcriptomic data analysis can be broadly divided into two main parts: the creation of the count matrix, performed according to the adopted single-cell experimental technology and the used sequencing approach, and its statistical analysis, usually using ad-hoc developed software in Python or R.

Considering a typical 10x Genomics experiment followed by an Illumina Novaseq sequencing, the first part of the pipeline will be performed using a tool called CellRanger. In particular, this part of the analysis will consist in two steps: the creation of the fastq files (the raw sequences of the four bases) from the flowcell provided in output by the sequencer and the alignment of the reads against the reference genome, in order to find for each gene how many reads have been captured.

The Seurat R package is then used to load data into the R environment and to perform some preliminary operations (such as outlier filtering, normalisation and dimensionality reduction) and clustering, identifying marker genes for each cluster by comparing the expression profile of the cells inside the cluster with all the other cells (see also this post).

Finally, the SingleR package is used to identify the type of each cell (such as Blood Cell, Bone Cell, and Stem Cell) in an unbiased way, leveraging reference transcriptomic datasets of pure cell types to infer the identity of every single cell independently.

The first two steps of the pipeline, related to the creation of the count matrix, have much higher requirements in terms of computing power. Typically, a significant speedup can be appreciated until up to 32 cores and 128GB of memory. Conversely, R packages are not able to fully exploit a such high level of parallelism, resulting in a waste of HPC resources.

StreamFlow application

In such context, StreamFlow has been leveraged to execute the workflow on top of an hybrid cloud-HPC environment without modifying the original codebase. In particular, CellRanger computations have been performed on the C3S HPC facility at Università di Torino, while the remaining steps have been offloaded to a Kubernetes instance running on top of the GARR cloud infrastructure.

The total execution time of the workflow on top of such hybrid infrastructure is comparable with a full-HPC execution, demonstrating how the StreamFlow approach can be beneficial to obtain a more efficient resource allocation without significant performance drops.

I. Colonnelli, B. Cantalupo, I. Merelli and M. Aldinucci, “StreamFlow: cross-breeding cloud with HPC,” in IEEE Transactions on Emerging Topics in Computing, vol. 9, iss. 4, p. 1723-1737, 2021. doi: 10.1109/TETC.2020.3019202.