Webinar (2024-Nov-26) with Alex Razoumov
git-annex is a file synchronization tool designed to simplify the management of large (typically data-oriented) files under version control. Unlike Git, git-annex does not track file contents but rather facilitates the organization of data across multiple locations, both online and offline, enabling the creation of multiple copies for backup and redundancy, ensuring data safety and organization. In the past, we have taught webinars on tools built upon git-annex, such as DataLad. In these tools the core functionality is typically provided by git-annex, so we believe it is crucial to understand how to effectively organize data using git-annex itself, without the distraction of additional features. Personally, I have been utilizing git-annex for several years to manage my extensive collection of archived files across multiple drives stored on a shelf. git-annex provides built-in redundancy, ensuring that each individual repository or drive is aware of the location of all files on other drives, eliminating the need to power them on just to find a file. git-annex also offers online capabilities, allowing file synchronization across multiple filesystems and clusters to help you manage your research data.
Webinar (2024-May-14) with Marie-Hélène Burle
Polars is a modern open-source and very fast DataFrame framework for Python, Rust, JS, R, and Ruby. In this webinar, I will demo Polars for Python and show how much faster it is compared to pandas while remaining just as convenient.
Webinar (2024-Apr-23) with Alex Razoumov
You might be familiar with gzip / bzip2 / zip tools that can compress all types of files without losing data. With typical 3D research datasets, these tools reduce your file sizes by ~30-50% – in some cases more, depending on the nature of your data. Popular scientific data formats such as NetCDF and HDF5 also support built-in lossless compression most commonly implemented via zlib or szip libraries. On the other hand, we have all used lossy compression for audio, video and images. Lossy compression can be applied to multidimensional scientific datasets as well, with far better compression ratio than with lossless compression, as you really are disposing of some of the less important bits. In general, with 3D scalar fields you can expect a compression ratio of approximately 20:1 or even 30:1, without any visible degradation. This is especially fantastic for archiving the results of multidimensional simulations, as you can store your data in much less space than its original footprint. In this webinar we cover two different approaches to lossy 3D data compression. We focus on file (rather than in-memory) compression, with long-term data storage in mind.
Webinar (2023-Dec-12) with Marie-Hélène Burle
Data version control (DVC) is an open-source tool that brings all the versioning and collaboration capabilities you use on your code with Git to your data and machine learning workflow. If you use datasets in your work, it makes it easy to track their evolution. If you are in the field of machine learning, it additionally allows you to track your models, manage your pipelines from parameters to metrics, collaborate on your experiments, and integrate with the continuous integration tool for machine learning projects CML. This webinar shows how to get started with DVC, first in the simple case where you just want to put your data under version control, then in the more complex situation where you want to manage your machine learning workflow in a more organized and reproducible fashion.
Webinar (2023-May-23) with Alex Razoumov
PyTables is a free and open-source Python library for managing large hierarchical datasets. It is built on top of NumPy and the HDF5 scientific dataset library and it focuses both on performance and interactive analysis of very large datasets. For large data streams (think multi-dimensional arrays or billions of records), it outperforms databases in terms of speed, memory usage, and I/O bandwidth. That said, PyTables is not a replacement for traditional relational databases because it does not support broad relationships between dataset variables. PyTables can even be used to organize a workflow with many (thousands to millions) of small files, as you can create a PyTables database of nodes that can be used like regular opened files in Python. This lets you store a large number of arbitrary files in a PyTables database with on-the-fly compression, making it very efficient for handling huge amounts of data.
Webinar (2023-Mar-28) with Alex Razoumov
This webinar provides a more beginner-oriented tutorial to version control of large data files with DataLad. We start with a textbook introduction to DalaLad showing its main features on top of Git and git-annex. Next we demonstrate several simple but useful workflows. Please note that not everything fit into the 50-min presentation, but the notes below contain everything.
/project
,Webinar (2023-Feb-28) with Gemma Hoad
Webinar (2023-Feb-14) with Ian Percel
This talk is a brief introduction to version controlling data and data processing workflows. Three illustrative use cases – taken from neuroimaging, geophysics, and workflows for analyzing housing data respectively – are used to provide an introduction to the main concepts of git-based file management, collaboration, and analysis.
Webinar (2023-Jan-17) by Alex Razoumov
Many unoptimized HPC cluster workflows result in writing large numbers of files to distributed filesystems which can create significant problems for the performance of these shared filesystems. One of the ways to alleviate this is to organize write operations inside a persistent overlay directory attached to an immutable read-only container with your scientific software. These output files will be stored separately from the base container image, and to the host filesystem an overlay appears as a single large file. In this presentation, we demo running parallel OpenFOAM simulations where all output goes into overlay images, and the total number of files on the host filesystem is reduced from several million to several dozen or less. The same approach can be used in post-processing and visualization, where you can read simulation data from multiple overlays both in serial and in parallel. In this webinar we walk you through all stages of creating and using overlays. We assume no prior knowledge of the container technology.
Webinar (2021-Mar-03) by Simon Goring
Webinar (2021-Feb-17) by Alex Razoumov
In this presentation I cover two fantastic multi-platform, open-source backup tools (dar
and borg
) that I’ve been using for many years. I combine them both into a single bash function that keeps multiple copies of your data, switch between two methods for redundancy, with a simple option for an off-site backup on a remote Linux server, and provide a simple mechanism for restoring your data. Both tools support incremental backup, compression, encryption, and – equally important – write to a sensible number of archive files that you can easily move around, e.g., to switch to a new backup drive, or to use a low-capacity USB drive for an incremental backup of a much larger filesystem.
Webinar (2020-Sep-30) by Alex Razoumov
Webinar (2019-Oct-30) by Sergiy Stepanenko
Webinar (2019-May-01) by Alex Razoumov
Large parallel filesystems found on HPC clusters – such as /home, /scratch and /project – have one weak spot: they were not designed for storing large numbers of small files. Due to this limitation, we always advise our users to reduce the number of files stored in their directories, either by instrumenting their code to write fewer larger files, or by using an archive tool such as the classic Unix utility tar
to pack their files into archives. There is a little-known, but incredibly useful open-source tool called dar
that was developed as a faster, modern replacement to tar
. DAR stands for disk archive
and supports file indexing, differential and incremental backups, Linux file Access Control Lists (ACL), compression, symmetric and public key encryption, remote archives, and has many other nice features. In this webinar we go through several use cases for dar
both on Compute Canada clusters and on your own laptop with a bash shell. We show you how to manage directories with many files, how to backup and restore your data, and other workflows.
Webinar (2019-Mar-20) by Alex Garnett and Adam McKenzie