i didn't read all specs
but one feature i see that could be needed is some frame by frame text
that we could use to store standardly formatted data
perhaps xyz coordinates of some points,
to be used on playback time
(driving lighting or motion controlled rig or anything else )
and sound space information
and subtitles