[Insight-developers] Testing Data

Johnson, Hans J hans-johnson at uiowa.edu
Mon Feb 7 14:01:30 EST 2011


I agree completely that we need to make the management of testing data
much easier than it currently is.  The current situation is far too
difficult and error prone.

I, however, think it is a very bad idea to store testing data in git,
especially baseline images.  Downloading the git repository requires that
every baseline image that has ever been used must be downloaded for every
source code build of ITK.   It is not that uncommon to need to update
baseline images.

For example, I have a proposal to change the testing data so that it has
non-identity direction and spacing so that it actually exercises the code
for the expected behavior of real data.  I estimate that this change will
double the size of a git checkout.

I'd also like to remind developers that we have a large user community
(BRAINS/Slicer) that checkout the full git repository of ITK each night on
many different tested platforms.  The download of ITK from git is already
quite a time sink, if the data got much larger it would become even more
combersome.

Hans



On 2/7/11 12:29 PM, "Matthew McCormick (thewtex)" <matt at mmmccormick.com>
wrote:

>I think Gaëtan makes many good observations.  A solution should keep
>the advantages of git:
>
>- It is easy to fork and contribute.
>- Local caching and network independence.
>- Complete backups everywhere.
>
>If the options were as they were a few months ago, I too would be in
>favor of merging the smaller Testing/Data into mainline.  But I think
>that is no longer necessary.  I see that Brad King has been working on
>an improved solution.
>http://itk.org/gitweb?p=ITK.git;a=commit;h=b27d76a437a48aea1dbe1d7e4fafda8
>1c4186c7d
>
>This seems to solve the network dependence, makes it easy to point the
>data source to a different location, including local sources, and
>simplifies things.
>
>A remaining issue seems to be pushing proposals to Gerrit that
>reviewers can easily check out.  Maybe there could be a data server
>that could be pushed to using Gerrit authentication for proposals.
>Scripts could be made to push Gerrit proposals to the main Midas
>repository when ready to merge.  A reviewer would set his data source
>CMake configuration to the Gerrit server?
>
>Matt
>
>On Mon, Feb 7, 2011 at 11:50 AM, Bill Lorensen <bill.lorensen at gmail.com>
>wrote:
>> Gaëtan ,
>>
>> Thanks for such a detailed analysis of the problems we face with
>>Testing/Data.
>>
>> My preference is well known. I prefer that we make the Testing/Data
>> part of the main repo. In addition to the complexity that you
>> mentioned, your argument for including input data and baselines in the
>> gerrit patches is especially compelling.
>>
>> I hope other developers will speak up so that we can resolve this
>> issue. We have been discussing since last summer.
>>
>> Bill
>>
>> 2011/2/7 Gaëtan Lehmann <gaetan.lehmann at jouy.inra.fr>:
>>>
>>> Hi,
>>>
>>> As asked by Terry, here are my thoughts on the testing data management.
>>> This issue has been discussed several times here, and some parts may
>>>not
>>> seem new ‹ this is because they have been copy/pasted from some
>>>previous
>>> mails.
>>>
>>> i. git submodule is bad for this task
>>>
>>> The ITK development process has become more efficient in ITK v4,
>>>especially
>>> with the usage of git and gerrit, but also significantly more
>>>complicated.
>>> I'm afraid this complexity may prevent some new developers to join the
>>> development effort.
>>> The Testing/Data submodule is the worth example to date.
>>>
>>> i.a. The workload for the developer is very significantly higher that
>>>what
>>> it was, or what it could be.
>>> Here are a few examples to highlight the differences with other
>>>technical
>>> solutions:
>>>
>>> * with cvs (ITK up to version 3.20), adding a new test was with some
>>>test
>>> images was:
>>>
>>>   cvs add Testing/Code/...
>>>   cvs add Testing/Data/...
>>>   cvs ci
>>>
>>> * with git alone, it would be:
>>>
>>>   git add Testing/Data/...
>>>   git add Testing/Code/...
>>>   git commit
>>>   git push
>>>
>>> * with git submodule, it is:
>>>
>>>   cd Testing/Data
>>>   git add ...
>>>   git commit
>>>   git push
>>>   cd -
>>>   git add Testing/Data
>>>   git add Testing/Code/...
>>>   git commit
>>>   git config "hooks.Testing/Data.update" 085e657..9dc1292 # copy/paste
>>>from
>>> the error message of the previous commit
>>>   git commit
>>>   git push
>>>
>>> 366% of increase of the number of command lines compared to the cvs
>>>case.
>>>
>>>
>>> i.b. git submodule is not contribution friendly
>>>
>>> Because the write access is required to push to ITKData, the
>>>contributors
>>> who don't have this write access will find very difficult to submit new
>>> tests to gerrit. The contributors can still publish their test images
>>> elsewhere (where?) but then
>>> * the review in gerrit becomes harder, because the reviewer has to get
>>>the
>>> testing data by hand,
>>> * the workload for the committer to ITK main repository is increased:
>>>he has
>>> to commit the images by himself in the ITKData repository and modify
>>>the
>>> submitted patch to point to the right version of the ITKData submodule.
>>> Also, if a patch is rejected in gerrit but the data have been already
>>> committed in ITKData, the useless data will stay forever in ITKData
>>> repository.
>>>
>>>
>>> i.c. git submodule is error prone
>>>
>>> As shown several times already in real life examples, it is very easy
>>>get
>>> the wrong ITKData version when merging several patches which have
>>>modified
>>> the required ITKData submodule version.
>>> This should be fixed now, by using an extra git hook. This hooks still
>>>add
>>> some maintenance complexity though.
>>>
>>>
>>> i.d. git submodule makes harder to read the history
>>>
>>> Because the history of the main repository and the submodule are not
>>>tightly
>>> coupled, it is hard to know why a test image was added or which image
>>>was
>>> added or modified to fix or add a new test.
>>>
>>>
>>> So, to summarize, I understand that git submodule may have been
>>>tempting to
>>> manage ITK's testing data, but real life usage have shown that git
>>>submodule
>>> is not well suited for this task. I'll personally be glad when we'll
>>>move
>>> away from this solution.
>>>
>>>
>>>
>>> ii. Testing/Data is not that big
>>>
>>>  ITKData repository takes 74 MB.
>>>  ITK repository takes 154 MB.
>>>  ITK build directory takes 1.3 GB ­ 8.4 GB if we don't take care to
>>>remove
>>> the temporary data after running the tests.
>>>  ITK build with wrapping takes 5.3 GB.
>>>
>>> and it could be smaller. The files, without the .git directory, use 36
>>>MB,
>>> and could be reduced to 22 MB by removing the few files bigger than
>>>512 KB.
>>> This is the result of 10 years of developments. Continuing at that pace
>>> seems quite reasonable.
>>>
>>>
>>>
>>> iii. ITK needs to be able to store large data files
>>>
>>> Kitware's solution seems fine for this task, even if it seems to have
>>> several potential problems at this time:
>>>
>>> * Added complexity to manage the testing data ‹ but can be enhanced,
>>>see
>>> below
>>> * No ability to commit offline as promised by the switch to git
>>> * Will give problems to run the tests offline
>>> * Would incite the developers to submit bigger testing data that needed
>>> which may, in the long term, lead to a significant network traffic and
>>> storage usage, and probably to a longer testing time.
>>>
>>>
>>>
>>> iv. How to store the testing data
>>>
>>> iv.a. Using two solutions
>>>
>>> My preference still goes to commit the testing data with the tests in
>>>the
>>> main ITK repository. Having code, tests and testing data stored in the
>>>same
>>> place and in the same commit, as a transactional set, seems logical.
>>>What is
>>> the sense of a test without its data?
>>> A hook is already in place to limit the size of the files in this
>>> repository. While using two methods for this task may not seem
>>>optimal, this
>>> would
>>>
>>> * Keep the workload quite low for the developers.
>>> * Incite the developers to use small baselines.
>>> * Make easier the review of the new or fixed tests and their data in
>>>gerrit,
>>> by allowing the submission to include their testing data.
>>> * Make easy to select to run only a subset of the test if internet
>>>connexion
>>> is available.
>>>
>>> The large data would be stored online using Kitware's solution.
>>>
>>>
>>> iv.b. Using Kitware's solution only
>>>
>>> A single solution means less to learn for the developer. All the
>>>developers
>>> may not have to upload large testing data though.
>>> The good point: after git submodules, it is very likely that this
>>>solution
>>> would be more convenient than the current one!
>>> See also iii. for more details.
>>>
>>>
>>>
>>> v. Enhancing the developer experience
>>>
>>> If it is decided to use the Kitware's solution, I would like to see
>>>those
>>> goals reached:
>>>
>>> * Don't require any new user registration on a new website
>>>
>>> Gerrit already requires to register to be able to submit a change. This
>>> account should be enough.
>>>
>>> * Keep every data management in git subcommands and aliases.
>>>
>>> We already have added several aliases to make gerrit usage easier, and
>>>it
>>> works very well.
>>> The same should be done for the data management to keep all the
>>>development
>>> management in git. This is very related to git anyway, because the .md5
>>> files will have to be commited in git.
>>>
>>> * Use very few command lines - ideally, not much than what a developer
>>>would
>>> have to use with a git only solution.
>>>
>>> For example, it can be:
>>>
>>>   git adddata Testing/Data/...
>>>   git add Testing/Code/...
>>>   git commit
>>>   git push    # or gerrit-push
>>>
>>> the first command, git adddata, would
>>>  - convert the files in md5 hashes,
>>>  - git add the .md5 files produced by the previous step,
>>>  - and upload the files on a remote host.
>>>
>>> Uploading should be possible even for the lambda contributors, like it
>>>is
>>> now for gerrit, not only for the ITK developers with the write access
>>>to the
>>> main repository.
>>> On the user side, the extra steps which may be required for the data
>>> management ‹ for example, moving the data from a temporary location to
>>>the
>>> final one ‹ should be transparent and not imply a user action.
>>>
>>> * Retaining the ability to test ITK and commit offline would be nice.
>>>This
>>> would require
>>>  - a tool to get all the needed testing data at once without having to
>>>build
>>> anything
>>>  - the ability to put the testing data in a cache if it cannot be
>>>uploaded
>>> immediately, and trigger the upload once connected.
>>>
>>> * Incite the developers to reuse the existing testing data when
>>>possible
>>> instead of uploading a new large data set. Not sure how to do that ‹
>>>any
>>> idea welcome.
>>>
>>> Then the points listed in iii. would be mostly gone.
>>>
>>>
>>> Regards,
>>>
>>> Gaëtan
>>>
>>>
>>>
>>> PS: I've noted during my trip to the namic week and the itk v4 meeting
>>>that
>>> I'm still far to get the subtleties of the english language ‹ I still
>>>don't
>>> understand how "simple" may upset anyone in the name SimpleITK for
>>>example ‹
>>> If you feel offended by anything in that mail, please don't be, there
>>>is no
>>> such intention on my side.
>>>
>>>
>>> --
>>> Gaëtan Lehmann
>>> Biologie du Développement et de la Reproduction
>>> INRA de Jouy-en-Josas (France)
>>> tel: +33 1 34 65 29 66    fax: 01 34 65 29 09
>>> http://voxel.jouy.inra.fr  http://www.itk.org
>>> http://www.mandriva.org  http://www.bepo.fr
>>>
>>>
>>> _______________________________________________
>>> Powered by www.kitware.com
>>>
>>> Visit other Kitware open-source projects at
>>> http://www.kitware.com/opensource/opensource.html
>>>
>>> Kitware offers ITK Training Courses, for more information visit:
>>> http://kitware.com/products/protraining.html
>>>
>>> Please keep messages on-topic and check the ITK FAQ at:
>>> http://www.itk.org/Wiki/ITK_FAQ
>>>
>>> Follow this link to subscribe/unsubscribe:
>>> http://www.itk.org/mailman/listinfo/insight-developers
>>>
>>>
>> _______________________________________________
>> Powered by www.kitware.com
>>
>> Visit other Kitware open-source projects at
>> http://www.kitware.com/opensource/opensource.html
>>
>> Kitware offers ITK Training Courses, for more information visit:
>> http://kitware.com/products/protraining.html
>>
>> Please keep messages on-topic and check the ITK FAQ at:
>> http://www.itk.org/Wiki/ITK_FAQ
>>
>> Follow this link to subscribe/unsubscribe:
>> http://www.itk.org/mailman/listinfo/insight-developers
>>
>_______________________________________________
>Powered by www.kitware.com
>
>Visit other Kitware open-source projects at
>http://www.kitware.com/opensource/opensource.html
>
>Kitware offers ITK Training Courses, for more information visit:
>http://kitware.com/products/protraining.html
>
>Please keep messages on-topic and check the ITK FAQ at:
>http://www.itk.org/Wiki/ITK_FAQ
>
>Follow this link to subscribe/unsubscribe:
>http://www.itk.org/mailman/listinfo/insight-developers



________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged.  If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited.  Please reply to the sender that you have received the message in error, then delete it.  Thank you.
________________________________


More information about the Insight-developers mailing list