From: "Nathan Shipley" <AE-List@media-motion.tv>
MIME-Version: 1.0
In-Reply-To: <list-6467848@media-motion.tv>
References: <list-6467848@media-motion.tv>
Date: Wed, 20 Jun 2018 11:42:55 -0700
Message-ID: <CAMtaTCza+B6j+sURXGkgiFu1rM=p281aeZ0Mf7j7fBM=QdqT9w@mail.gmail.com>
Subject: Re: [AE] nVidias Slo mo demo
To: After Effects Mail List <AE-List@media-motion.tv>
Content-Type: multipart/alternative; boundary="000000000000f359d9056f172bec"

--000000000000f359d9056f172bec
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

It sounds like the Nvidia tech is generating those mattes automatically and
using them to blend optical flow retimed versions of the footage, but that
it's helped by a trained neural net.  Here's the summary of the white paper
<https://arxiv.org/pdf/1712.00080.pdf> they released; I bolded some key
parts:

Given two consecutive frames, video interpolation aims at generating
intermediate frame(s) to form both spatially and temporally coherent video
sequences. While most existing methods focus on single-frame interpolation,
we propose an end-to-end convolutional neural network for variable-length
multi-frame video interpolation, where the motion interpretation and
occlusion reasoning are jointly modeled. *We start by computing
bi-directional optical flow between the input images* using a U-Net
architecture. These flows are then linearly combined at each time step to
approximate the intermediate bi-directional optical flows. These
approximate flows, however, only work well in locally smooth regions
and *produce
artifacts around motion boundaries*. To address this shortcoming, we employ
another U-Net to *refine the approximated flow and also predict soft
visibility maps*. *Finally, the two input images are warped and linearly
fused to form each intermediate frame. By applying the visibility maps to
the warped images before fusion, we exclude the contribution of occluded
pixels to the interpolated intermediate frame to avoid artifacts*. Since
none of our learned network parameters are time-dependent, our approach is
able to produce as many intermediate frames as needed. We use 1,132 video
clips with 240-fps, containing 300K individual video frames, to train our
network. Experimental results on several datasets, predicting different
numbers of interpolated frames, demonstrate that our approach performs
consistently better than existing methods.

So, yeah, David - it sounds like they've improved the video analysis part
to deal with objects moving separately with machine learning.  Which
current optical-flow based tech doesn't do, to the best of my knowledge.

Looks quite cool!  It'd be nice to see some samples of how it does on
footage that isn't already slow-motion.  There are some still frames at the
end of the above-linked white paper, but no motion.

- Nathan

On Wed, Jun 20, 2018 at 11:31 AM, David Baud <AE-List@media-motion.tv>
wrote:

> My understanding is that in order to get good result with any of the
> optical flow solutions, the system needs to be able to define the contour
> of your moving =E2=80=9Cobjects=E2=80=9D (i.e a person, a ball, a car, et=
c=E2=80=A6) in your frame.
> Better the system is capable of recognizing these objects, better results
> you will get. Twixtor Pro version will let you =E2=80=9Chelp=E2=80=9D the=
 system to define
> these contour by providing a mask for your object. As we know it can be
> time consuming to do rotoscoping. Where I think these systems can improve
> is in the automatic recognition of the moving objects in your frame, i.e
> recognizing "the person walking" and "the picket fence" as two different
> objects. I am not familiar with the proposed technology by nVidia, but
> maybe they improved the analysis of a video and the system is capable of
> calculating automatically the displacement of all objects in a frame
> separately?
>
> Maybe Peter with RE:Vision will chime in this discussion and correct me i=
f
> I am wrong =F0=9F=98=89=E2=80=A6 and maybe gives us a better understandin=
g of the optical
> flow technology in general=E2=80=A6 without revealing his secret sauce fo=
r Twixtor!
>
> David Baud
> Colorist & Finishing Editor
> david at kosmos-productions.com
>
> On Jun 20, 2018, at 11:59 , Jim Curtis <AE-List@media-motion.tv> wrote:
>
> Optical Flow and Twixtor have limitations.  Try a slo-mo of a person
> walking next to a picket fence, and see how wacky the pickets become with
> any method besides frame-blending.  There have been occasions where I=E2=
=80=99ve
> stitched together the different methods with masking and editing, as ther=
e
> seems not to be a Silver Bullet so far.  If this is it, I=E2=80=99m inter=
ested!
> Thanks for the head=E2=80=99s up.
>
>
>

--000000000000f359d9056f172bec
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">It sounds like the Nvidia tech is generating those mattes =
automatically and using them to blend optical flow retimed versions of the =
footage, but that it&#39;s helped by a trained neural net.=C2=A0 Here&#39;s=
 the summary of the <a href=3D"https://arxiv.org/pdf/1712.00080.pdf">white =
paper</a> they released; I bolded some key parts:<div><br></div><blockquote=
 style=3D"margin:0 0 0 40px;border:none;padding:0px"><div>Given two consecu=
tive frames, video interpolation aims at generating intermediate frame(s) t=
o form both spatially and temporally coherent video sequences. While most e=
xisting methods focus on single-frame interpolation, we propose an end-to-e=
nd convolutional neural network for variable-length multi-frame video inter=
polation, where the motion interpretation and occlusion reasoning are joint=
ly modeled. <b>We start by computing bi-directional optical flow between th=
e input images</b> using a U-Net architecture. These flows are then linearl=
y combined at each time step to approximate the intermediate bi-directional=
 optical flows. These approximate flows, however, only work well in locally=
 smooth regions and <b>produce artifacts around motion boundaries</b>. To a=
ddress this shortcoming, we employ another U-Net to <b>refine the approxima=
ted flow and also predict soft visibility maps</b>. <b>Finally, the two inp=
ut images are warped and linearly fused to form each intermediate frame. By=
 applying the visibility maps to the warped images before fusion, we exclud=
e the contribution of occluded pixels to the interpolated intermediate fram=
e to avoid artifacts</b>. Since none of our learned network parameters are =
time-dependent, our approach is able to produce as many intermediate frames=
 as needed. We use 1,132 video clips with 240-fps, containing 300K individu=
al video frames, to train our network. Experimental results on several data=
sets, predicting different numbers of interpolated frames, demonstrate that=
 our approach performs consistently better than existing methods.</div><div=
><br></div></blockquote>So, yeah, David - it sounds like they&#39;ve improv=
ed the video analysis part to deal with objects moving separately with mach=
ine learning.=C2=A0 Which current optical-flow based tech doesn&#39;t do, t=
o the best of my knowledge.<div><br></div><div>Looks quite cool!=C2=A0 It&#=
39;d be nice to see some samples of how it does on footage that isn&#39;t a=
lready slow-motion.=C2=A0 There are some still frames at the end of the abo=
ve-linked white paper, but no motion.</div><div><br></div><div>- Nathan</di=
v></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, J=
un 20, 2018 at 11:31 AM, David Baud <span dir=3D"ltr">&lt;<a href=3D"mailto=
:AE-List@media-motion.tv" target=3D"_blank">AE-List@media-motion.tv</a>&gt;=
</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex"><div style=3D"word-wrap:br=
eak-word">My understanding is that in order to get good result with any of =
the optical flow solutions, the system needs to be able to define the conto=
ur of your moving =E2=80=9Cobjects=E2=80=9D (i.e a person, a ball, a car, e=
tc=E2=80=A6) in your frame. Better the system is capable of recognizing the=
se objects, better results you will get. Twixtor Pro version will let you =
=E2=80=9Chelp=E2=80=9D the system to define these contour by providing a ma=
sk for your object. As we know it can be time consuming to do rotoscoping. =
Where I think these systems can improve is in the automatic recognition of =
the moving objects in your frame, i.e recognizing &quot;the person walking&=
quot; and &quot;the picket fence&quot; as two different objects. I am not f=
amiliar with the proposed technology by nVidia, but maybe they improved the=
 analysis of a video and the system is capable of calculating automatically=
 the displacement of all objects in a frame separately?<div><br></div><div>=
Maybe Peter with RE:Vision will chime in this discussion and correct me if =
I am wrong =F0=9F=98=89=E2=80=A6 and maybe gives us a better understanding =
of the optical flow technology in general=E2=80=A6 without revealing his se=
cret sauce for Twixtor!<br>
<br><div>
<div style=3D"color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-=
indent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wra=
p:break-word"><span class=3D"m_-7146124434611830375Apple-style-span" style=
=3D"border-collapse:separate;color:rgb(0,0,0);font-variant-ligatures:normal=
;font-variant-caps:normal;font-variant-east-asian:normal;letter-spacing:nor=
mal;line-height:normal;text-align:-webkit-auto;text-indent:0px;text-transfo=
rm:none;white-space:normal;word-spacing:0px;border-spacing:0px"><div style=
=3D"word-wrap:break-word"><div style=3D"word-wrap:break-word"><span class=
=3D"m_-7146124434611830375Apple-style-span" style=3D"border-collapse:separa=
te;color:rgb(0,0,0);font-variant-ligatures:normal;font-variant-caps:normal;=
font-variant-east-asian:normal;letter-spacing:normal;line-height:normal;tex=
t-align:-webkit-auto;text-indent:0px;text-transform:none;white-space:normal=
;word-spacing:0px;border-spacing:0px"><div style=3D"word-wrap:break-word"><=
span class=3D"m_-7146124434611830375Apple-style-span" style=3D"border-colla=
pse:separate;color:rgb(0,0,0);font-variant-ligatures:normal;font-variant-ca=
ps:normal;font-variant-east-asian:normal;letter-spacing:normal;line-height:=
normal;text-align:-webkit-auto;text-indent:0px;text-transform:none;white-sp=
ace:normal;word-spacing:0px;border-spacing:0px"><div style=3D"word-wrap:bre=
ak-word"><span class=3D"m_-7146124434611830375Apple-style-span" style=3D"bo=
rder-collapse:separate;color:rgb(0,0,0);font-variant-ligatures:normal;font-=
variant-caps:normal;font-variant-east-asian:normal;letter-spacing:normal;li=
ne-height:normal;text-align:-webkit-auto;text-indent:0px;text-transform:non=
e;white-space:normal;word-spacing:0px;border-spacing:0px"><div style=3D"wor=
d-wrap:break-word"><span class=3D"m_-7146124434611830375Apple-style-span" s=
tyle=3D"border-collapse:separate;font-variant-ligatures:normal;font-variant=
-caps:normal;font-variant-east-asian:normal;letter-spacing:normal;line-heig=
ht:normal;text-indent:0px;text-transform:none;white-space:normal;word-spaci=
ng:0px;border-spacing:0px"><div style=3D"word-wrap:break-word"><span class=
=3D"m_-7146124434611830375Apple-style-span" style=3D"border-collapse:separa=
te;font-variant-ligatures:normal;font-variant-caps:normal;font-variant-east=
-asian:normal;letter-spacing:normal;line-height:normal;text-indent:0px;text=
-transform:none;white-space:normal;word-spacing:0px;border-spacing:0px"><sp=
an class=3D"m_-7146124434611830375Apple-style-span" style=3D"border-collaps=
e:separate;border-spacing:0px;font-variant-ligatures:normal;font-variant-ca=
ps:normal;font-variant-east-asian:normal;letter-spacing:normal;line-height:=
normal;text-indent:0px;text-transform:none;white-space:normal;word-spacing:=
0px"><span class=3D"m_-7146124434611830375Apple-style-span" style=3D"border=
-collapse:separate;border-spacing:0px;font-variant-ligatures:normal;font-va=
riant-caps:normal;font-variant-east-asian:normal;letter-spacing:normal;line=
-height:normal;text-indent:0px;text-transform:none;white-space:normal;word-=
spacing:0px"><div style=3D"font-weight:normal;font-size:14px;color:rgb(0,0,=
0);margin:0px">David Baud</div><div style=3D"color:rgb(0,0,0);margin:0px"><=
span style=3D"font-size:12px">Colorist &amp; Finishing Editor</span></div><=
/span></span></span></div></span></div></span></div></span></div></span></d=
iv><span style=3D"font-size:12px">david at <a href=3D"http://kosmos-product=
ions.com" target=3D"_blank">kosmos-productions.com</a></span></div></span><=
/div>
</div><span class=3D"">

<br><div><blockquote type=3D"cite"><div>On Jun 20, 2018, at 11:59 , Jim Cur=
tis &lt;<a href=3D"mailto:AE-List@media-motion.tv" target=3D"_blank">AE-Lis=
t@media-motion.tv</a>&gt; wrote:</div><br class=3D"m_-7146124434611830375Ap=
ple-interchange-newline"><div><div style=3D"word-wrap:break-word">Optical F=
low and Twixtor have limitations.=C2=A0 Try a slo-mo of a person walking ne=
xt to a picket fence, and see how wacky the pickets become with any method =
besides frame-blending.=C2=A0 There have been occasions where I=E2=80=99ve =
stitched together the different methods with masking and editing, as there =
seems not to be a Silver Bullet so far.=C2=A0 If this is it, I=E2=80=99m in=
terested!=C2=A0 Thanks for the head=E2=80=99s up.</div></div></blockquote><=
/div><br></span></div></div></blockquote></div><br></div>

--000000000000f359d9056f172bec--