[Insight-developers] New workflow to add test data

Mon Jun 20 14:37:43 EDT 2011

On 06/20/2011 01:31 PM, Cory Quammen wrote:
> My biggest trouble was thinking I had all the scripts for doing this
> up to date.
[snip]
> shouldn't be a problem as people base branches on more recent commits
> to master.

Okay, thanks.

> The content link and real data object aren't interchangeable if you
> want to edit the real data object after generating the content link.
> Editing the data object may be a relatively rare thing to do compared
> to totally regenerating it and copying it into the tree, but I think
> it will happen.

The approach centers around copying test baselines and inputs from
outside into the source tree.  It is not intended for incremental
in-place editing of a data file as if it were a source file.  If a
test output is regenerated one needs to copy it back into the source
tree anyway.

What kind of files are you editing?  If a file is editable text one
might argue that it is a source file and should not be treated like
data with this approach in the first place.

> If the process continues to remove the data object files, then it
> would be good to add a warning to the wiki page stating that the real
> data object will disappear after running CMake, so keeping a copy of
> the data object outside the source tree is advisable.

I added a note in the new instructions page in the extra-info column
on the right of that step:

  http://www.itk.org/Wiki/ITK/Git/Develop/Data#Run_CMake

It links to the details discussion at the bottom of the page which
says where the data go.

>>> It would be nice if the test data could be left in place
>>
>> I originally had this goal in mind.  However, I later realized that it
>> is both hard to do and conceptually incorrect:

One more point:

- The real data object must go into a .ExternalData_${algo}_${hash} file.
  If the original file were to stay around then we would need to make a
  copy.  Since the approach is meant to work equally well with large data
  files we should keep as few copies as possible.  Right now we make no
  copies and just move/rename the original file a couple of times.

>>  so this requires a separate "git add" for every content link rather
>>  than "git add ." in the directory.
> 
> That is acceptable to me.

The DATA{} syntax supports image series as documented in the ExternalData
module so some developers may add a large number of files at once.  I
think it is too likely that a real file may leak through in this case.

> Not to get too off topic, but it seems like there may also be a
> problem if a developer moves source code files via "mv" into the tree.
> Is that the case?

Yes, but data files are much more likely to be moved, especially in the
case of test baselines that are originally written to the build tree.

> I was thinking that you could prevent the commit of a data file if
> there were a content link corresponding to the data file name (e.g., I
> accidentally commit "test.png" but its content link "test.png.md5" is
> also in the commit). If there is no .md5 file for a given image file,
> then it should be up to Gerrit reviews to determine whether that file
> gets in, as you say.

Great idea, thanks!  I'll look at adding that because it is useful
independent of the above discussion.

-Brad