ITK/HDF5: Difference between revisions

Revision as of 19:05, 19 April 2011

HDF5 file format and library

HDF5 is both a file format and a library dedicated to reading and writing files in that format.

According to Wikipedia, "HDF5 include only two major types of object:

Datasets, which are multidimensional arrays of a homogenous type
Groups, which are container structures which can hold datasets and other groups

This results in a truly hierarchical, filesystem-like data format. In fact, resources in an HDF5 file are even accessed using the POSIX-like syntax /path/to/resource. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes. In addition to these advances in the file format, HDF5 includes an improved type system, and dataspace objects which represent selections over dataset regions. The API is also object-oriented with respect to datasets, groups, attributes, types, dataspaces and property lists. Because it uses B-trees to index table objects, HDF5 works well for Time series data such as stock price series, network monitoring data, and 3D meteorological data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of a SQL database, but B-Tree access is available for non-array data. The HDF5 data storage mechanism can be simpler and faster than an SQL Star schema."

It is available in BSD-like license.

Use cases

ImageIO

(FromProposals:HDF5_ImageIO)

Chunking (streaming)
Multi-Resolution
Multi-Channel images
Large datasets ( Size > 4Gb )
Single experiment images of size 1024 x 1024 x 75 (XYZ), 2 channels, 1000 time-points
8bit and 16bit
Images stored as 2D PNGs with filenames giving location
Need to support optimized reading (image streaming) of a sub-volume
Eg: Box filtering using a kernel of size 5x5x1x1x3
Cyclic buffer optimization in the ITK reader that keeps overlapping data and only reads new data
Multi-resolution images for heirarchical registration of multiple experimental sets
Compression is not as important in the short term but will be needed in the long term

TransformIO

Protocol

Typing

With HDF5, everything is either a group or a dataset.

ITK must be able to save many different types -- how do we store the actual ITK type in the HDF5?

kent williams This is handled in TransformIO by actually saving the ITK type name in the HDF file. This parallels the other Transform readers. There's a lookup mechanism in the itkTransformIOFactory to handle instantiation by class name.

(Attributes may be an option for that.)

kent williamsThat is true, but once I figured out DataSets, that became my hammer for every nail. There may be some efficiency issues with using datasets when Attributes would do, but in the case of ITK, it shouldn't be an issue

How do we store the template parameters -- do we even need to store them? Glehmann 16:06, 18 April 2011 (EDT)

kent williams in general, no. We can recover the native scalar type of datasets, and for most things, that's enough to decide what ITK object to instantiate, based on context. Where that isn't the case, we could store attributes to disambiguate what class of object should be created.

Atomic objects

Atomic objects or unbreakable basic types. They are (generally?) stored as datasets in the HDF5 files.

Index

Kent WIlliams This is stored as a 1D dataset of Dimension elements.

Size

Point

Matrix

Vector

Composite objects

Composite objects are store as groups in the HDF5 file and are made of one or more atomic or composite objects. Each object is named in the same way it is named in the ITK classes, without the leading "m_".

Version

We may need something simpler to store the version as an attribute. Glehmann 16:06, 18 April 2011 (EDT)

ImageRegion

This is the storage of the class ImageRegion.

Member	Type
Index	Index
Size	Size

TODO: where do we store the Dimension of the ImageRegion?

TODO: Is it good enough to assume that it can be deduced from the dimension of the Index and of the Size?

TODO: What to do when the Index and the Size dimensions mismatch?

Histogram

TODO

ImageBase

This is the storage of the class ImageBase.

Member	Type	Comment
Region	ImageRegion	This is the largest possible region shortened, because the different regions in itk::Image doesn't really make sense in a file storage.
Spacing	Vector
Origin	Point
Direction	Matrix

Image

This is the storage of the class Image. Image inherits the members of ImageBase and adds its own members.

Member	Type	Comment
Pixels	TODO	which type should be used? a dataset directly? an atomic type?

This is not a strict requirement, but images should be saved in chunks to allow them to be efficiently streamed (both read and write) and compressed.

I think the chunk size should be one on all the dimensions but x and y. Wich chunk size to choose on x and y is tricky, and may depend on the use case -- should we choose a size?

LabelObjectLine

This is the storage of the class LabelObjectLine.

Member	Type	Comment
Index	Index
Length	unsigned integer	how do we describe this type?

LabelObject

This is the storage of the class LabelObject.

Member	Type	Comment
Label	integer	how do we describe this type?
Lines	TODO	which type should be used? a group directly? an composite type?

LabelMap

This is the storage of the class LabelMap. LabelMap inherits the members of ImageBase and adds its own members.

Member	Type	Comment
LabelObjects	TODO	which type should be used? a group directly? an composite type?

Base path

By default, the object of interest is stored in /ITK, so it can be either a atomic (HDF5 dataset) or composite object (HDF5 group). Of course it is possible to access the objects by using another or a longer path. Some classes in ITK may not provide a way to change the path of the object of interest (for example HDF5ImageIO).

Managing versions

How to do that? The version should be stored somewhere for sure - should it be:

at the base of the file? in an /ITKVersion group for example?
in each object, as an attribute? This would allow to easily copy an object from one file to another. I think I like much this method Glehmann 16:06, 18 April 2011 (EDT)

@@ Line 54: / Line 54: @@
 ====Index====
-[[User;kentwilliams|Kent WIlliams] This is stored as a 1D dataset of Dimension elements.
+[[User;kentwilliams|Kent WIlliams]] This is stored as a 1D dataset of Dimension elements.
 ====Size====