Usage¶
To understand how h5preserve
works, you need to remember the following
concepts:
- dumper
A function which converts your object to a representation ready to be written to an HDF5 file. Has an associated class, version and class label.
- loader
A function which converts a representation of a HDF5 object (group, dataset etc.) to an instance of a specified class. Has an associated version and class label.
- registry
A collection of dumpers and loaders, providing a common namespace.
h5preserve
comes with a few which convert common Python types.- registry collection
A collection of registries. Deals with choosing the correct registry, dumper and loader to use, including version locking.
So a complete example based on the Quickstart example is:
import numpy as np
from h5preserve import (
open as h5open, Registry, new_registry_list, DatasetContainer
)
registry = Registry("experiment")
class Experiment:
def __init__(self, data, time_started):
self.data = data
self.time_started = time_started
@registry.dumper(Experiment, "Experiment", version=1)
def _exp_dump(experiment):
return DatasetContainer(
data=experiment.data,
attrs={
"time started": experiment.time_started
}
)
@registry.loader("Experiment", version=1)
def _exp_load(dataset):
return Experiment(
data=dataset["data"],
time_started=dataset["attrs"]["time started"]
)
my_cool_experiment = Experiment(np.array([1,2,3,4,5]), 10)
with h5open("my_data_file.hdf5", new_registry_list(registry), mode='w') as f:
f["cool experiment"] = my_cool_experiment
with h5open("my_data_file.hdf5", new_registry_list(registry), mode='r') as f:
my_cool_experiment_loaded = f["cool experiment"]
print(
my_cool_experiment_loaded.time_started ==
my_cool_experiment.time_started
)
Whilst for this simple case it’s probably overkill to use h5preserve
, h5preserve
deals
quite easily changing requirements, such as adding additional properties to
Experiment
via versioning, splitting Experiment
into multiple classes via
recursively converting python objects, or even more complex requirements via
being able to only read and convert when needed, or to dump subsets of a class
before dumping the whole class.
The rest of this guide provides information about how to deal with specific
topics (versioning, advanced loading and dumping), but these topics are not
required to use h5preserve
.
How Versioning Works¶
Valid versions for dumpers are either integers or None
.
Valid versions for loaders are integers, None
, any
or all
.
The order in which loaders are used are:
Dumpers are similar:
If a version of a dumper is locked, use that one
None
if availableThe latest version of the dumper available
Using None
should not be done lightly, as it forces that the dumper and loader
not change in any way, as there is no way of overriding which loader h5preserve
uses when None
is available. It may be better to have a dumper with an integer
version and use a loader with a version of all
, which can be modified at the
python level, and not require modification of the existing file.
A versioning example¶
Imagine a class like Experiment
above; you have some data, and some
metadata (to keep the example simple, we’re only going to have one piece of
metadata, and no data):
class ModelOutput:
def __init__(self, a):
self.a = a
a
represents some input parameter to our model. We also write the
associated dumper and loader:
@registry.dumper(ModelOutput, "ModelOutput", version=1)
def _exp_dump(modeloutput):
return DatasetContainer(
attrs={
"a": modeloutput.a
}
)
@registry.loader("ModelOutput", version=1)
def _exp_load(dataset):
return ModelOutput(
a=dataset["attrs"]["a"]
)
However, later on we realise we should have used b
instead of
a
. This could be because we want to radians instead of degrees,
using b
is more meaningful in the model, or some other reason we
have, something which motivates a change to the class. We change our class:
class ModelOutput:
def __init__(self, b):
self.b = b
and create a new dumper and loader for version 2 of this class:
@registry.dumper(ModelOutput, "ModelOutput", version=2)
def _exp_dump(modeloutput):
return DatasetContainer(
attrs={
"b": modeloutput.b
}
)
@registry.loader("ModelOutput", version=2)
def _exp_load(dataset):
return ModelOutput(
b=dataset["attrs"]["b"]
)
But then, how do we load our old data? Let’s assume that \(b = 2 a\). So
we’d write a loader for version 1 which converts a
to b
:
@registry.loader("ModelOutput", version=1)
def _exp_load(dataset):
return ModelOutput(
b = 2 * dataset["attrs"]["a"]
)
What about a dumper? We can write one also, but it may be that we add additional metadata instead of changing its representation, so we can’t store all our metadata in the version 1 format, so we can’t write a dumper for version 1.
One thing h5preserve
cannot do is check that your code is forward or
backwards compatible between different versions, that has to be managed by the
user (there’s some code on providing some tools to help with automated testing
of loaders and dumpers being written, but that will still require having
something to test against).
Locking Dumper Version¶
It is possible to force which dumper version is going to used, via
RegistryContainer.lock_version()
. An example how to do this, given Experiment
is a class you want to dump version 1 of, and registries
is a instance of
RegistryContainer
which contains a Registry
that can dump Experiment
is:
registries = new_registry_list(registry)
registries.lock_version(Experiment, 1)
Controlling how Classes are Dumped¶
h5preserve
will recursively dump arguments passed to GroupContainer
or
DatasetContainer
(as well as any variations on those classes), as long as
the arguments are supported by h5py
for writing (e.g. numpy arrays), or there
exists a dumper for each of the arguments. Hence, dumpers should only need to
worry about name which each attribute of the class is saved to, and whether they
should be saved as group/dataset attributes or as groups/datasets (currently
there is no support for loaders/dumpers that only write group/dataset attributes
without creating a new group/dataset).
Using DatasetContainer
and GroupContainer
¶
The Quickstart example above used DatasetContainer
;
DatasetContainer
takes keyword arguments which are passed on to
h5py.Group.create_dataset()
, as well as an attrs
keyword
argument which is used to set attributes on the associated HDF5 dataset.
GroupContainer
behaves similar to
DatasetContainer
; it
also takes keyword arguments, as well as an additional attrs
keyword
argument. However, these keywords names are used as the name for the subgroup or
dataset created from the keyword arguments. Modifying the Quickstart
example to have it use a group instead of a dataset is simple, we just change
the loader as shown below:
@registry.dumper(Experiment, "Experiment", version=1)
def _exp_dump(experiment):
return GroupContainer(
experiment_data=experiment.data,
attrs={
"time started": experiment.time_started
}
)
The start time is now written to an attribute on the HDF5 group, and
experiment.data
is written to either a dataset or group, depending on
what type it is. If it was as above a numpy array, then it would be written as a
dataset (but it would not have "time started"
as an attribute).
Loading from a group is the same as loading from a dataset:
@registry.loader("Experiment", version=1)
def _exp_load(group):
return Experiment(
data=group["experiment_data"],
time_started=group["attrs"]["time started"]
)
Using On-Demand Loading¶
The purpose of on-demand loading is to deal with cases where recursively loading a group takes up too much memory. On-demand loading requires modifications to the class which contains the objects which are to be loaded on-demand. The modifications are:
Wrapping attributes and other objects which should be loaded on-demand with the
wrap_on_demand()
function when set, and unwrapping the objects when needed.Adding
cls._h5preserve_update()
as a callback function to be called when the class is dumped. This callback must wrap any of the above objects which are to be loaded on-demand withwrap_on_demand()
as above.
wrap_on_demand()
returns an instance of OnDemandWrapper
, which can be called
with no arguments to return the original object (similar to a weakref).
An example of the necessary code for class which subclasses
collections.abc.MutableMapping
and which stores its members in _mapping
is:
def __getitem__(self, key):
value = self._mapping[key]
if isinstance(value, OnDemandWrapper):
value = value()
self._mapping[key] = value # acting as cache, this can be skipped if desired
return value
def __setitem__(self, key, val):
self._mapping[key] = wrap_on_demand(self, key, val)
def _h5preserve_update(self):
for key, val in self.items():
self._mapping[key] = wrap_on_demand(self, key, val)
A workaround where a group/dataset takes up too much memory but on-demand
loading is not set up is to open the file via h5py
or use the h5py_file
or
h5py_group
attribute to access the underlying h5py.Group
. Using this group you
can then access a subset of the groups that would be loaded, which you can pass
to H5PreserveGroup
to use your loaders.
Using Delayed Dumping¶
Delayed dumping is similar to on-demand loading, however it needs less changes
to the containing class. Assigning an instance of DelayedContainer
in the
necessary location in the class is sufficient in preparing h5preserve
for
delayed dumping of the object. When the data is ready to be dumped, calling
write_container()
dumps the data to the file as if it has been dumped when the
containing class had been dumped. In a class where it is an attribute which is
to be dumped later, the following is sufficient:
class ContainerClass:
def __init__(self, data=None):
if data is None:
data = DelayedContainer()
self._data = data
@property
def data(self):
return self._data
@data.setter
def data(self, data):
if isinstance(self._data, DelayedContainer):
self._data.write_container(soln)
self._data = data
else:
raise RuntimeError("Cannot change data")
Built-in Loaders, Dumpers and Registries¶
h5preserve
comes with a number of predefined loader/dumper pairs for built-in
python types. The defaults for new_registry_list()
automatically include these
registries. If you do not wish to use the predefined registries, you should
instead instantiate RegistryContainer
manually.
The following table outlines the supported types, and how they are encoded in the HDF5 file.
Type |
Encoding |
Included by default |
---|---|---|
|
True |
|
a dataset |
True |
|
a dataset |
True |
|
a dataset |
True |
|
a dataset |
True |
|
|
a dataset |
True |
a dataset |
True |
|
a dataset |
True |
Manually Creating the Registry Container¶
To create the Registry Container manually, replace all calls to
new_registry_list()
with RegistryContainer
.
This will allow you to select which built-in registries (if any) you which to
use. For example, if you only want to convert None
to
h5py.Empty
, you would do:
from h5preserve import Registry, RegistryContainer
from h5preserve.additional_registries import none_python_registry
registry = Registry("my cool registry")
registries = RegistryContainer(registry, none_python_registry)
You could then pass registries
to h5preserve.open
, or lock
to a specific version, or anything else you’d do after calling
new_registry_list()
.