Research Data Management

Click on each webinar for its recording and materials.

Workflow hacks for large datasets in HPC (2025-May-20)

In this webinar, I briefly highlight some of the previously covered tools for working with large datasets: (1) Lossy 3D data compression can reduce the size of 3D datasets by up to 100X with no visible artifacts, making it ideal for storage and archival. (2) In-situ visualization enables interactive rendering of large in-memory arrays without the need to store them to disk. (3) Distributed storage of large datasets helps manage vast amounts of data across multiple locations. (4) DAR is a modern, high-performance alternative to TAR that offers indexing, differential archives, and faster extraction.

Speaker: Alex Razoumov
PDF slides

Introduction to Globus (2025-Apr-08)

Globus is a data management service that enables seamless file transfers between endpoints—computers connected to the Globus file transfer network. It has become the standard in research and high-performance computing for efficiently moving large datasets, especially when transferring data to or from institutions that already provide a Globus endpoint.

In this session, we provide an overview of Globus, including a demonstration of how to use this service via the Globus web interface and command line (CLI), as well as how to automate certain Globus tasks.

Speakers: Ben Lai and Michael Tang
PDF slides
Globus Command Line Interface (CLI) documentation

Distributed file storage with git-annex (2024-Nov-26)

git-annex is a file synchronization tool designed to simplify the management of large (typically data-oriented) files under version control. Unlike Git, git-annex does not track file contents but rather facilitates the organization of data across multiple locations, both online and offline, enabling the creation of multiple copies for backup and redundancy, ensuring data safety and organization. In the past, we have taught webinars on tools built upon git-annex, such as DataLad. In these tools the core functionality is typically provided by git-annex, so we believe it is crucial to understand how to effectively organize data using git-annex itself, without the distraction of additional features. Personally, I have been utilizing git-annex for several years to manage my extensive collection of archived files across multiple drives stored on a shelf. git-annex provides built-in redundancy, ensuring that each individual repository or drive is aware of the location of all files on other drives, eliminating the need to power them on just to find a file. git-annex also offers online capabilities, allowing file synchronization across multiple filesystems and clusters to help you manage your research data.

Speaker: Alex Razoumov
Online notes

DataFrames on steroids with Polars (2024-May-14)

Polars is a modern open-source and very fast DataFrame framework for Python, Rust, JS, R, and Ruby. In this webinar, I will demo Polars for Python and show how much faster it is compared to pandas while remaining just as convenient.

Speaker: Marie-Hélène Burle
Online slides

Lossy data compression (2024-Apr-23)

You might be familiar with gzip / bzip2 / zip tools that can compress all types of files without losing data. With typical 3D research datasets, these tools reduce your file sizes by ~30-50% -- in some cases more, depending on the nature of your data. Popular scientific data formats such as NetCDF and HDF5 also support built-in lossless compression most commonly implemented via zlib or szip libraries. On the other hand, we have all used lossy compression for audio, video and images. Lossy compression can be applied to multidimensional scientific datasets as well, with far better compression ratio than with lossless compression, as you really are disposing of some of the less important bits. In general, with 3D scalar fields you can expect a compression ratio of approximately 20:1 or even 30:1, without any visible degradation. This is especially fantastic for archiving the results of multidimensional simulations, as you can store your data in much less space than its original footprint. In this webinar we cover two different approaches to lossy 3D data compression. We focus on file (rather than in-memory) compression, with long-term data storage in mind.

Speaker: Alex Razoumov
PDF slides

Version control for data science and machine learning with DVC (2023-Dec-12)

Data version control (DVC) is an open-source tool that brings all the versioning and collaboration capabilities you use on your code with Git to your data and machine learning workflow. If you use datasets in your work, it makes it easy to track their evolution. If you are in the field of machine learning, it additionally allows you to track your models, manage your pipelines from parameters to metrics, collaborate on your experiments, and integrate with the continuous integration tool for machine learning projects CML. This webinar shows how to get started with DVC, first in the simple case where you just want to put your data under version control, then in the more complex situation where you want to manage your machine learning workflow in a more organized and reproducible fashion.

Speaker: Marie-Hélène Burle
Online slides

Managing large hierarchical datasets with PyTables (2023-May-23)

PyTables is a free and open-source Python library for managing large hierarchical datasets. It is built on top of NumPy and the HDF5 scientific dataset library and it focuses both on performance and interactive analysis of very large datasets. For large data streams (think multi-dimensional arrays or billions of records), it outperforms databases in terms of speed, memory usage, and I/O bandwidth. That said, PyTables is not a replacement for traditional relational databases because it does not support broad relationships between dataset variables. PyTables can even be used to organize a workflow with many (thousands to millions) of small files, as you can create a PyTables database of nodes that can be used like regular opened files in Python. This lets you store a large number of arbitrary files in a PyTables database with on-the-fly compression, making it very efficient for handling huge amounts of data.

Speaker: Alex Razoumov
Online notes

Distributed datasets with DataLad (2023-Mar-28)

This webinar provides a more beginner-oriented tutorial to version control of large data files with DataLad. We start with a textbook introduction to DalaLad showing its main features on top of Git and git-annex. Next we demonstrate several simple but useful workflows. Please note that not everything fit into the 50-min presentation, but the notes below contain everything.

two users on a shared cluster filesystem working with the same dataset stored in `/project`,

one user, one dataset spread over multiple drives, with data redundancy,

publishing a dataset on GitHub with annexed files in a special private remote,

publishing a dataset on GitHub with publicly-accessible annexed files on the Alliance's Nextcloud, and

managing multiple Git repositories under one dataset.

Speaker: Alex Razoumov
Online notes

How to create and access MySQL and PostgreSQL databases on DRI systems (2023-Feb-28)

Speaker: Gemma Hoad
PDF slides

Data management with DataLad (2023-Feb-14)

This talk is a brief introduction to version controlling data and data processing workflows. Three illustrative use cases -- taken from neuroimaging, geophysics, and workflows for analyzing housing data respectively -- are used to provide an introduction to the main concepts of git-based file management, collaboration, and analysis.

Speaker: Ian Percel
PDF slides

Hiding large numbers of files in container overlays (2023-Jan-17)

Many unoptimized HPC cluster workflows result in writing large numbers of files to distributed filesystems which can create significant problems for the performance of these shared filesystems. One of the ways to alleviate this is to organize write operations inside a persistent overlay directory attached to an immutable read-only container with your scientific software. These output files will be stored separately from the base container image, and to the host filesystem an overlay appears as a single large file. In this presentation, we demo running parallel OpenFOAM simulations where all output goes into overlay images, and the total number of files on the host filesystem is reduced from several million to several dozen or less. The same approach can be used in post-processing and visualization, where you can read simulation data from multiple overlays both in serial and in parallel. In this webinar we walk you through all stages of creating and using overlays. We assume no prior knowledge of the container technology.

Speaker: Alex Razoumov
PDF slides

Linking databases to code repositories with Throughput (2021-Mar-03)

Automating your backups in Linux and MacOS (2021-Feb-17)

In this presentation I cover two fantastic multi-platform, open-source backup tools (`dar` and `borg`) that I've been using for many years. I combine them both into a single bash function that keeps multiple copies of your data, switch between two methods for redundancy, with a simple option for an off-site backup on a remote Linux server, and provide a simple mechanism for restoring your data. Both tools support incremental backup, compression, encryption, and -- equally important -- write to a sensible number of archive files that you can easily move around, e.g., to switch to a new backup drive, or to use a low-capacity USB drive for an incremental backup of a much larger filesystem.

Speaker: Alex Razoumov
PDF slides
Bash script with all function definitions

Working with multidimensional datasets in xarray (2020-Sep-30)

Speaker: Alex Razoumov
PDF slides

File access control approaches and best practices (2019-Oct-30)

Speaker: Sergiy Stepanenko
PDF slides

Managing many files with Disk ARchiver (DAR) (2019-May-01)

Large parallel filesystems found on HPC clusters -- such as /home, /scratch and /project -- have one weak spot: they were not designed for storing large numbers of small files. Due to this limitation, we always advise our users to reduce the number of files stored in their directories, either by instrumenting their code to write fewer larger files, or by using an archive tool such as the classic Unix utility `tar` to pack their files into archives. There is a little-known, but incredibly useful open-source tool called `dar` that was developed as a faster, modern replacement to `tar`. DAR stands for `disk archive` and supports file indexing, differential and incremental backups, Linux file Access Control Lists (ACL), compression, symmetric and public key encryption, remote archives, and has many other nice features. In this webinar we go through several use cases for `dar` both on Compute Canada clusters and on your own laptop with a bash shell. We show you how to manage directories with many files, how to backup and restore your data, and other workflows.

Speaker: Alex Razoumov
ZIP file with slides and bash functions

RDM Tools, Platforms, and Best Practices for Canadian Researchers (2019-Mar-20)

Speaker: Alex Garnett and Adam McKenzie
PDF slides

Collapse all webinars in this section