Research Data Management

Table of Contents: “DataFrames on steroids with Polars” • “Lossy data compression” • “Version control for data science and machine learning with DVC” • “Managing large hierarchical datasets with PyTables” • “Distributed datasets with DataLad” • “How to create and access MySQL and PostgreSQL databases on DRI systems” • “Data management with DataLad” • “Hiding large numbers of files in container overlays” • “Linking databases to code repositories with Throughput” • “Automating your backups in Linux and MacOS”” • “Working with multidimensional datasets in xarray” • “File access control approaches and best practices” • “Managing many files with Disk ARchiver (DAR)” • “Research Data Management Tools, Platforms, and Best Practices for Canadian Researchers”

“DataFrames on steroids with Polars”

Webinar (2024-May-14) with Marie-Hélène Burle

Polars is a modern open-source and very fast DataFrame framework for Python, Rust, JS, R, and Ruby. In this webinar, I will demo Polars for Python and show how much faster it is compared to pandas while remaining just as convenient.

Online slides

“Lossy data compression”

Webinar (2024-Apr-23) with Alex Razoumov

You might be familiar with gzip / bzip2 / zip tools that can compress all types of files without losing data. With typical 3D research datasets, these tools reduce your file sizes by ~30-50% – in some cases more, depending on the nature of your data. Popular scientific data formats such as NetCDF and HDF5 also support built-in lossless compression most commonly implemented via zlib or szip libraries. On the other hand, we have all used lossy compression for audio, video and images. Lossy compression can be applied to multidimensional scientific datasets as well, with far better compression ratio than with lossless compression, as you really are disposing of some of the less important bits. In general, with 3D scalar fields you can expect a compression ratio of approximately 20:1 or even 30:1, without any visible degradation. This is especially fantastic for archiving the results of multidimensional simulations, as you can store your data in much less space than its original footprint. In this webinar we cover two different approaches to lossy 3D data compression. We focus on file (rather than in-memory) compression, with long-term data storage in mind.

PDF slides

“Version control for data science and machine learning with DVC”

Webinar (2023-Dec-12) with Marie-Hélène Burle

Data version control (DVC) is an open-source tool that brings all the versioning and collaboration capabilities you use on your code with Git to your data and machine learning workflow. If you use datasets in your work, it makes it easy to track their evolution. If you are in the field of machine learning, it additionally allows you to track your models, manage your pipelines from parameters to metrics, collaborate on your experiments, and integrate with the continuous integration tool for machine learning projects CML. This webinar shows how to get started with DVC, first in the simple case where you just want to put your data under version control, then in the more complex situation where you want to manage your machine learning workflow in a more organized and reproducible fashion.

Online slides

“Managing large hierarchical datasets with PyTables”

Webinar (2023-May-23) with Alex Razoumov

PyTables is a free and open-source Python library for managing large hierarchical datasets. It is built on top of NumPy and the HDF5 scientific dataset library and it focuses both on performance and interactive analysis of very large datasets. For large data streams (think multi-dimensional arrays or billions of records), it outperforms databases in terms of speed, memory usage, and I/O bandwidth. That said, PyTables is not a replacement for traditional relational databases because it does not support broad relationships between dataset variables. PyTables can even be used to organize a workflow with many (thousands to millions) of small files, as you can create a PyTables database of nodes that can be used like regular opened files in Python. This lets you store a large number of arbitrary files in a PyTables database with on-the-fly compression, making it very efficient for handling huge amounts of data.

Online notes

“Distributed datasets with DataLad”

Webinar (2023-Mar-28) with Alex Razoumov

This webinar provides a more beginner-oriented tutorial to version control of large data files with DataLad. We start with a textbook introduction to DalaLad showing its main features on top of Git and git-annex. Next we demonstrate several simple but useful workflows. Please note that not everything fit into the 50-min presentation, but the notes below contain everything.

two users on a shared cluster filesystem working with the same dataset stored in /project,
one user, one dataset spread over multiple drives, with data redundancy,
publishing a dataset on GitHub with annexed files in a special private remote,
publishing a dataset on GitHub with publicly-accessible annexed files on the Alliance’s Nextcloud, and
managing multiple Git repositories under one dataset.

Online notes

“How to create and access MySQL and PostgreSQL databases on DRI systems”

Webinar (2023-Feb-28) with Gemma Hoad

PDF slides

“Data management with DataLad”

Webinar (2023-Feb-14) with Ian Percel

This talk is a brief introduction to version controlling data and data processing workflows. Three illustrative use cases – taken from neuroimaging, geophysics, and workflows for analyzing housing data respectively – are used to provide an introduction to the main concepts of git-based file management, collaboration, and analysis.

PDF slides

“Hiding large numbers of files in container overlays”

Webinar (2023-Jan-17) by Alex Razoumov

Many unoptimized HPC cluster workflows result in writing large numbers of files to distributed filesystems which can create significant problems for the performance of these shared filesystems. One of the ways to alleviate this is to organize write operations inside a persistent overlay directory attached to an immutable read-only container with your scientific software. These output files will be stored separately from the base container image, and to the host filesystem an overlay appears as a single large file. In this presentation, we demo running parallel OpenFOAM simulations where all output goes into overlay images, and the total number of files on the host filesystem is reduced from several million to several dozen or less. The same approach can be used in post-processing and visualization, where you can read simulation data from multiple overlays both in serial and in parallel. In this webinar we walk you through all stages of creating and using overlays. We assume no prior knowledge of the container technology.