[Insight-developers] Testing Data

Bill Lorensen bill.lorensen at gmail.com
Mon Feb 7 12:50:58 EST 2011


Gaëtan ,

Thanks for such a detailed analysis of the problems we face with Testing/Data.

My preference is well known. I prefer that we make the Testing/Data
part of the main repo. In addition to the complexity that you
mentioned, your argument for including input data and baselines in the
gerrit patches is especially compelling.

I hope other developers will speak up so that we can resolve this
issue. We have been discussing since last summer.

Bill

2011/2/7 Gaëtan Lehmann <gaetan.lehmann at jouy.inra.fr>:
>
> Hi,
>
> As asked by Terry, here are my thoughts on the testing data management.
> This issue has been discussed several times here, and some parts may not
> seem new — this is because they have been copy/pasted from some previous
> mails.
>
> i. git submodule is bad for this task
>
> The ITK development process has become more efficient in ITK v4, especially
> with the usage of git and gerrit, but also significantly more complicated.
> I'm afraid this complexity may prevent some new developers to join the
> development effort.
> The Testing/Data submodule is the worth example to date.
>
> i.a. The workload for the developer is very significantly higher that what
> it was, or what it could be.
> Here are a few examples to highlight the differences with other technical
> solutions:
>
> * with cvs (ITK up to version 3.20), adding a new test was with some test
> images was:
>
>   cvs add Testing/Code/...
>   cvs add Testing/Data/...
>   cvs ci
>
> * with git alone, it would be:
>
>   git add Testing/Data/...
>   git add Testing/Code/...
>   git commit
>   git push
>
> * with git submodule, it is:
>
>   cd Testing/Data
>   git add ...
>   git commit
>   git push
>   cd -
>   git add Testing/Data
>   git add Testing/Code/...
>   git commit
>   git config "hooks.Testing/Data.update" 085e657..9dc1292 # copy/paste from
> the error message of the previous commit
>   git commit
>   git push
>
> 366% of increase of the number of command lines compared to the cvs case.
>
>
> i.b. git submodule is not contribution friendly
>
> Because the write access is required to push to ITKData, the contributors
> who don't have this write access will find very difficult to submit new
> tests to gerrit. The contributors can still publish their test images
> elsewhere (where?) but then
> * the review in gerrit becomes harder, because the reviewer has to get the
> testing data by hand,
> * the workload for the committer to ITK main repository is increased: he has
> to commit the images by himself in the ITKData repository and modify the
> submitted patch to point to the right version of the ITKData submodule.
> Also, if a patch is rejected in gerrit but the data have been already
> committed in ITKData, the useless data will stay forever in ITKData
> repository.
>
>
> i.c. git submodule is error prone
>
> As shown several times already in real life examples, it is very easy get
> the wrong ITKData version when merging several patches which have modified
> the required ITKData submodule version.
> This should be fixed now, by using an extra git hook. This hooks still add
> some maintenance complexity though.
>
>
> i.d. git submodule makes harder to read the history
>
> Because the history of the main repository and the submodule are not tightly
> coupled, it is hard to know why a test image was added or which image was
> added or modified to fix or add a new test.
>
>
> So, to summarize, I understand that git submodule may have been tempting to
> manage ITK's testing data, but real life usage have shown that git submodule
> is not well suited for this task. I'll personally be glad when we'll move
> away from this solution.
>
>
>
> ii. Testing/Data is not that big
>
>  ITKData repository takes 74 MB.
>  ITK repository takes 154 MB.
>  ITK build directory takes 1.3 GB – 8.4 GB if we don't take care to remove
> the temporary data after running the tests.
>  ITK build with wrapping takes 5.3 GB.
>
> and it could be smaller. The files, without the .git directory, use 36 MB,
> and could be reduced to 22 MB by removing the few files bigger than 512 KB.
> This is the result of 10 years of developments. Continuing at that pace
> seems quite reasonable.
>
>
>
> iii. ITK needs to be able to store large data files
>
> Kitware's solution seems fine for this task, even if it seems to have
> several potential problems at this time:
>
> * Added complexity to manage the testing data — but can be enhanced, see
> below
> * No ability to commit offline as promised by the switch to git
> * Will give problems to run the tests offline
> * Would incite the developers to submit bigger testing data that needed
> which may, in the long term, lead to a significant network traffic and
> storage usage, and probably to a longer testing time.
>
>
>
> iv. How to store the testing data
>
> iv.a. Using two solutions
>
> My preference still goes to commit the testing data with the tests in the
> main ITK repository. Having code, tests and testing data stored in the same
> place and in the same commit, as a transactional set, seems logical. What is
> the sense of a test without its data?
> A hook is already in place to limit the size of the files in this
> repository. While using two methods for this task may not seem optimal, this
> would
>
> * Keep the workload quite low for the developers.
> * Incite the developers to use small baselines.
> * Make easier the review of the new or fixed tests and their data in gerrit,
> by allowing the submission to include their testing data.
> * Make easy to select to run only a subset of the test if internet connexion
> is available.
>
> The large data would be stored online using Kitware's solution.
>
>
> iv.b. Using Kitware's solution only
>
> A single solution means less to learn for the developer. All the developers
> may not have to upload large testing data though.
> The good point: after git submodules, it is very likely that this solution
> would be more convenient than the current one!
> See also iii. for more details.
>
>
>
> v. Enhancing the developer experience
>
> If it is decided to use the Kitware's solution, I would like to see those
> goals reached:
>
> * Don't require any new user registration on a new website
>
> Gerrit already requires to register to be able to submit a change. This
> account should be enough.
>
> * Keep every data management in git subcommands and aliases.
>
> We already have added several aliases to make gerrit usage easier, and it
> works very well.
> The same should be done for the data management to keep all the development
> management in git. This is very related to git anyway, because the .md5
> files will have to be commited in git.
>
> * Use very few command lines - ideally, not much than what a developer would
> have to use with a git only solution.
>
> For example, it can be:
>
>   git adddata Testing/Data/...
>   git add Testing/Code/...
>   git commit
>   git push    # or gerrit-push
>
> the first command, git adddata, would
>  - convert the files in md5 hashes,
>  - git add the .md5 files produced by the previous step,
>  - and upload the files on a remote host.
>
> Uploading should be possible even for the lambda contributors, like it is
> now for gerrit, not only for the ITK developers with the write access to the
> main repository.
> On the user side, the extra steps which may be required for the data
> management — for example, moving the data from a temporary location to the
> final one — should be transparent and not imply a user action.
>
> * Retaining the ability to test ITK and commit offline would be nice. This
> would require
>  - a tool to get all the needed testing data at once without having to build
> anything
>  - the ability to put the testing data in a cache if it cannot be uploaded
> immediately, and trigger the upload once connected.
>
> * Incite the developers to reuse the existing testing data when possible
> instead of uploading a new large data set. Not sure how to do that — any
> idea welcome.
>
> Then the points listed in iii. would be mostly gone.
>
>
> Regards,
>
> Gaëtan
>
>
>
> PS: I've noted during my trip to the namic week and the itk v4 meeting that
> I'm still far to get the subtleties of the english language — I still don't
> understand how "simple" may upset anyone in the name SimpleITK for example —
> If you feel offended by anything in that mail, please don't be, there is no
> such intention on my side.
>
>
> --
> Gaëtan Lehmann
> Biologie du Développement et de la Reproduction
> INRA de Jouy-en-Josas (France)
> tel: +33 1 34 65 29 66    fax: 01 34 65 29 09
> http://voxel.jouy.inra.fr  http://www.itk.org
> http://www.mandriva.org  http://www.bepo.fr
>
>
> _______________________________________________
> Powered by www.kitware.com
>
> Visit other Kitware open-source projects at
> http://www.kitware.com/opensource/opensource.html
>
> Kitware offers ITK Training Courses, for more information visit:
> http://kitware.com/products/protraining.html
>
> Please keep messages on-topic and check the ITK FAQ at:
> http://www.itk.org/Wiki/ITK_FAQ
>
> Follow this link to subscribe/unsubscribe:
> http://www.itk.org/mailman/listinfo/insight-developers
>
>


More information about the Insight-developers mailing list