[Insight-developers] Testing Data

Bill Lorensen bill.lorensen at gmail.com
Mon Feb 7 13:25:25 EST 2011


I think the Kitware Midas approach is good for large data.

The current discussion is about small test data and regression baselines.

Bill

On Mon, Feb 7, 2011 at 2:07 PM, Daniel Blezek <Blezek.Daniel at mayo.edu> wrote:
> Hi Bill,
>
>  While I cast my vote to put test data into the repo directly (which we
> have not done with SimpleITK yet), I appreciate that ITK could use more
> varied and large datasets.  Test writers tend to focus on a subset of images
> that already exist.  Many of these are PNG images, which do not contain
> unusual spacing or orientations.
>
>  I have a lot of data which I could contribute to ITK, from many
> modalities, spacing and orientations, but there is not a good place to put
> 300 DICOM CT images (150M) or a 256^3 rotational angiography image.
>
>  Striking a balance between developer efficiency, and large scale testing
> data will be difficult.
>
> -dan
>
> On 2/7/11 11:50 AM, "Bill Lorensen" <bill.lorensen at gmail.com> wrote:
>
>> Gaëtan ,
>>
>> Thanks for such a detailed analysis of the problems we face with Testing/Data.
>>
>> My preference is well known. I prefer that we make the Testing/Data
>> part of the main repo. In addition to the complexity that you
>> mentioned, your argument for including input data and baselines in the
>> gerrit patches is especially compelling.
>>
>> I hope other developers will speak up so that we can resolve this
>> issue. We have been discussing since last summer.
>>
>> Bill
>>
>> 2011/2/7 Gaëtan Lehmann <gaetan.lehmann at jouy.inra.fr>:
>>>
>>> Hi,
>>>
>>> As asked by Terry, here are my thoughts on the testing data management.
>>> This issue has been discussed several times here, and some parts may not
>>> seem new ‹ this is because they have been copy/pasted from some previous
>>> mails.
>>>
>>> i. git submodule is bad for this task
>>>
>>> The ITK development process has become more efficient in ITK v4, especially
>>> with the usage of git and gerrit, but also significantly more complicated.
>>> I'm afraid this complexity may prevent some new developers to join the
>>> development effort.
>>> The Testing/Data submodule is the worth example to date.
>>>
>>> i.a. The workload for the developer is very significantly higher that what
>>> it was, or what it could be.
>>> Here are a few examples to highlight the differences with other technical
>>> solutions:
>>>
>>> * with cvs (ITK up to version 3.20), adding a new test was with some test
>>> images was:
>>>
>>>   cvs add Testing/Code/...
>>>   cvs add Testing/Data/...
>>>   cvs ci
>>>
>>> * with git alone, it would be:
>>>
>>>   git add Testing/Data/...
>>>   git add Testing/Code/...
>>>   git commit
>>>   git push
>>>
>>> * with git submodule, it is:
>>>
>>>   cd Testing/Data
>>>   git add ...
>>>   git commit
>>>   git push
>>>   cd -
>>>   git add Testing/Data
>>>   git add Testing/Code/...
>>>   git commit
>>>   git config "hooks.Testing/Data.update" 085e657..9dc1292 # copy/paste from
>>> the error message of the previous commit
>>>   git commit
>>>   git push
>>>
>>> 366% of increase of the number of command lines compared to the cvs case.
>>>
>>>
>>> i.b. git submodule is not contribution friendly
>>>
>>> Because the write access is required to push to ITKData, the contributors
>>> who don't have this write access will find very difficult to submit new
>>> tests to gerrit. The contributors can still publish their test images
>>> elsewhere (where?) but then
>>> * the review in gerrit becomes harder, because the reviewer has to get the
>>> testing data by hand,
>>> * the workload for the committer to ITK main repository is increased: he has
>>> to commit the images by himself in the ITKData repository and modify the
>>> submitted patch to point to the right version of the ITKData submodule.
>>> Also, if a patch is rejected in gerrit but the data have been already
>>> committed in ITKData, the useless data will stay forever in ITKData
>>> repository.
>>>
>>>
>>> i.c. git submodule is error prone
>>>
>>> As shown several times already in real life examples, it is very easy get
>>> the wrong ITKData version when merging several patches which have modified
>>> the required ITKData submodule version.
>>> This should be fixed now, by using an extra git hook. This hooks still add
>>> some maintenance complexity though.
>>>
>>>
>>> i.d. git submodule makes harder to read the history
>>>
>>> Because the history of the main repository and the submodule are not tightly
>>> coupled, it is hard to know why a test image was added or which image was
>>> added or modified to fix or add a new test.
>>>
>>>
>>> So, to summarize, I understand that git submodule may have been tempting to
>>> manage ITK's testing data, but real life usage have shown that git submodule
>>> is not well suited for this task. I'll personally be glad when we'll move
>>> away from this solution.
>>>
>>>
>>>
>>> ii. Testing/Data is not that big
>>>
>>>  ITKData repository takes 74 MB.
>>>  ITK repository takes 154 MB.
>>>  ITK build directory takes 1.3 GB ­ 8.4 GB if we don't take care to remove
>>> the temporary data after running the tests.
>>>  ITK build with wrapping takes 5.3 GB.
>>>
>>> and it could be smaller. The files, without the .git directory, use 36 MB,
>>> and could be reduced to 22 MB by removing the few files bigger than 512 KB.
>>> This is the result of 10 years of developments. Continuing at that pace
>>> seems quite reasonable.
>>>
>>>
>>>
>>> iii. ITK needs to be able to store large data files
>>>
>>> Kitware's solution seems fine for this task, even if it seems to have
>>> several potential problems at this time:
>>>
>>> * Added complexity to manage the testing data ‹ but can be enhanced, see
>>> below
>>> * No ability to commit offline as promised by the switch to git
>>> * Will give problems to run the tests offline
>>> * Would incite the developers to submit bigger testing data that needed
>>> which may, in the long term, lead to a significant network traffic and
>>> storage usage, and probably to a longer testing time.
>>>
>>>
>>>
>>> iv. How to store the testing data
>>>
>>> iv.a. Using two solutions
>>>
>>> My preference still goes to commit the testing data with the tests in the
>>> main ITK repository. Having code, tests and testing data stored in the same
>>> place and in the same commit, as a transactional set, seems logical. What is
>>> the sense of a test without its data?
>>> A hook is already in place to limit the size of the files in this
>>> repository. While using two methods for this task may not seem optimal, this
>>> would
>>>
>>> * Keep the workload quite low for the developers.
>>> * Incite the developers to use small baselines.
>>> * Make easier the review of the new or fixed tests and their data in gerrit,
>>> by allowing the submission to include their testing data.
>>> * Make easy to select to run only a subset of the test if internet connexion
>>> is available.
>>>
>>> The large data would be stored online using Kitware's solution.
>>>
>>>
>>> iv.b. Using Kitware's solution only
>>>
>>> A single solution means less to learn for the developer. All the developers
>>> may not have to upload large testing data though.
>>> The good point: after git submodules, it is very likely that this solution
>>> would be more convenient than the current one!
>>> See also iii. for more details.
>>>
>>>
>>>
>>> v. Enhancing the developer experience
>>>
>>> If it is decided to use the Kitware's solution, I would like to see those
>>> goals reached:
>>>
>>> * Don't require any new user registration on a new website
>>>
>>> Gerrit already requires to register to be able to submit a change. This
>>> account should be enough.
>>>
>>> * Keep every data management in git subcommands and aliases.
>>>
>>> We already have added several aliases to make gerrit usage easier, and it
>>> works very well.
>>> The same should be done for the data management to keep all the development
>>> management in git. This is very related to git anyway, because the .md5
>>> files will have to be commited in git.
>>>
>>> * Use very few command lines - ideally, not much than what a developer would
>>> have to use with a git only solution.
>>>
>>> For example, it can be:
>>>
>>>   git adddata Testing/Data/...
>>>   git add Testing/Code/...
>>>   git commit
>>>   git push    # or gerrit-push
>>>
>>> the first command, git adddata, would
>>>  - convert the files in md5 hashes,
>>>  - git add the .md5 files produced by the previous step,
>>>  - and upload the files on a remote host.
>>>
>>> Uploading should be possible even for the lambda contributors, like it is
>>> now for gerrit, not only for the ITK developers with the write access to the
>>> main repository.
>>> On the user side, the extra steps which may be required for the data
>>> management ‹ for example, moving the data from a temporary location to the
>>> final one ‹ should be transparent and not imply a user action.
>>>
>>> * Retaining the ability to test ITK and commit offline would be nice. This
>>> would require
>>>  - a tool to get all the needed testing data at once without having to build
>>> anything
>>>  - the ability to put the testing data in a cache if it cannot be uploaded
>>> immediately, and trigger the upload once connected.
>>>
>>> * Incite the developers to reuse the existing testing data when possible
>>> instead of uploading a new large data set. Not sure how to do that ‹ any
>>> idea welcome.
>>>
>>> Then the points listed in iii. would be mostly gone.
>>>
>>>
>>> Regards,
>>>
>>> Gaëtan
>>>
>>>
>>>
>>> PS: I've noted during my trip to the namic week and the itk v4 meeting that
>>> I'm still far to get the subtleties of the english language ‹ I still don't
>>> understand how "simple" may upset anyone in the name SimpleITK for example ‹
>>> If you feel offended by anything in that mail, please don't be, there is no
>>> such intention on my side.
>>>
>>>
>>> --
>>> Gaëtan Lehmann
>>> Biologie du Développement et de la Reproduction
>>> INRA de Jouy-en-Josas (France)
>>> tel: +33 1 34 65 29 66    fax: 01 34 65 29 09
>>> http://voxel.jouy.inra.fr  http://www.itk.org
>>> http://www.mandriva.org  http://www.bepo.fr
>>>
>>>
>>> _______________________________________________
>>> Powered by www.kitware.com
>>>
>>> Visit other Kitware open-source projects at
>>> http://www.kitware.com/opensource/opensource.html
>>>
>>> Kitware offers ITK Training Courses, for more information visit:
>>> http://kitware.com/products/protraining.html
>>>
>>> Please keep messages on-topic and check the ITK FAQ at:
>>> http://www.itk.org/Wiki/ITK_FAQ
>>>
>>> Follow this link to subscribe/unsubscribe:
>>> http://www.itk.org/mailman/listinfo/insight-developers
>>>
>>>
>> _______________________________________________
>> Powered by www.kitware.com
>>
>> Visit other Kitware open-source projects at
>> http://www.kitware.com/opensource/opensource.html
>>
>> Kitware offers ITK Training Courses, for more information visit:
>> http://kitware.com/products/protraining.html
>>
>> Please keep messages on-topic and check the ITK FAQ at:
>> http://www.itk.org/Wiki/ITK_FAQ
>>
>> Follow this link to subscribe/unsubscribe:
>> http://www.itk.org/mailman/listinfo/insight-developers
>
> --
> Daniel Blezek, PhD
> Medical Imaging Informatics Innovation Center
>
> P 127 or (77) 8 8886
> T 507 538 8886
> E blezek.daniel at mayo.edu
>
> Mayo Clinic
> 200 First St. S.W.
> Harwick SL-44
> Rochester, MN 55905
> mayoclinic.org
> "It is more complicated than you think." -- RFC 1925
>
>


More information about the Insight-developers mailing list