Estimating Sceneflow using Deep Convolutional Neural Networks

This is an overview of the work done on my thesis. Detailed report will be available soon.


Analyzing movement of objects in a 3-dimensional space is a very important problem of computer vision. Its applications range from motion estimation, 3D shape acquisition, object detection, pose Estimation and many more. Now, with the availability of cheap RGB-D cameras, depth information can also be leveraged to attain even higher accuracy in these given problems. There has been quite a lot of progress in estimating Sceneflow recently.

To put it simply, sceneflow is same as predicting optical flow with the difference of having further information about the problem in hand. That is, we consider the depth information of the scene to predict even more accurate optical flow and depth change values of a scene.

Core Work

The problem of predicting Optical flow has been the center of attentions since quite some time now. Prof. Brox et. al (link), where he trained a convNet on a synthetic dataset and later predicting the flow estimation on realistic scenarios.

Since there is a scarcity of realistic labeled datasets when it comes to sceneflow estimation, we decided to take a more unsupervised approach towards solving this problem, while also leveraging the power of labeled datasets. Hence we furthered the approach proposed by Prof. Brox et. al and implemented unsupervised losses in his FlowNet Simple architecture to implement a semi-supervised flowNet. We used sceneflow synthetic dataset for training our models.


There were 3 major unsupervised losses that were tried during this work.

  • Photoconsistency Loss
  • Forward Backward Loss
  • FlowNet as a GAN

I recommend you to watch this short video in order to understand how the losses work. We used the process of warping our images with the optical flow values predicted from our network. Here's what a warped image from the driving scene of the mentioned synthetic dataset for images I1 and I2 looks like.
The above images are the video sequences from the scene. While, below is the warped resulting image. The black area around the image represents occlusions.

Photoconsistency Loss

Let's say we've two images I1(RGBD) and I2(RGBD) taken at time t1 and t2 respectively. When we're training our network, we input these images sandwiched together as an 8 channel image input to the network. The network than provides us with predicted estimation of flow values in x and y direction for these images. We can simply generate the first image by warping it with the predicted flow values ( read detailed report to understand this process ). Once we've the warped image, we need to ensure that our warped_image is closest to the first image and minimize this loss. More formally PC loss can be defined as.

We warp I2 with the predicted flow values in x(u) and y(v) dimension and find the difference of the resulting warped image with I1.


Forward Backward Loss

Instead of just inputting the images to the network in the forward flow direction. We also input the network in the backward flow direction. The idea behind training forward backward loss is to ensure the flow values are accurate enough, such that if we move from I1 to I2 and than from I2 to I1 back, we should end up at the same place from where we started.

To understand this loss, take a look at the figure given below. Where A forward flow is performed to reach from pixel P11 from first image to reach P23 in the second image. Now after performing the backward flow we see that instead of ending up back at P11 we end up at P31 and this is the loss that we need to minimize.

More formally, we can define forward backward loss as follows.

Where 'f' represents the forward flow while 'b' represents the backward flow. Here we simply warp the backward flow with respect to the forward flow to ensure we are calculating the backward flow from position P23 instead of P11 of the second image.

FlowNet as a GAN

GANs are useful for learning the distribution of the data using two separate networks. GANs have the following architectures.

Source: https://deeplearning4j.org/generative-adversarial-network

Where we train two separate networks, out of which the Generator's responsibility is to learn the underlying distribution of the data and ensure that it's resulting images are very close to the real images we want to generate. The Discriminator's job is to identify that if the image being fed into it is an image which was taken from the real dataset or a fake image generated by the generator. In short, the generator has to fool the discriminator by generating realistic images.

We introduced a discriminator in front of the FlowNet with an additional adversarial loss. The adversarial loss is defined as follows.

Source: GAN Paper

Since already have labeled data, so the idea behind the final GAN loss was taken from this paper to ensure that our network has combination of both adversarial and supervised losses together. Hence, our final loss becomes.

Where Lambda_adv represents a weight assigned to the adversarial loss and EPE represents the endpoint loss. Which was also being used as a supervised loss in our case.


Results will be posted soon.