DECONVOLUTION METHOD FOR VIEW INTERPOLATION USING MULTIPLE IMAGES OF CIRCULAR CAMERA ARRAY Akira Kubota1∗ 1
Kazuya Kodama2
Yoshinori Hatori1
Interdisciplinary Graduate School of Science and Technology, Tokyo Institute of Technology, Japan 2 Research Organization of Information and Systems, National Institute of Informatics, Japan ABSTRACT
This paper deals with a view interpolation problem using multiple images captured with a circular camera array. A novel deconvolution method for reconstructing a virtual image at the center of the camera array is presented and discussed in a framework of image restoration in the frequency domain. The reconstruction filter does not depend on the scene structure; therefore the presented method requires no depth estimation. The simulation result using synthetic images shows that increasing the number of the cameras improves the quality of the virtual image.
Scene
1 2 0 3
Virtual camera 4 Circular camera array (with 5 cameras in this figure)
1. INTRODUCTION Image interpolation or restoration is a traditional problem that has been well discussed in image processing. In contrast, view interpolation — a problem of generating a virtual view using reference images captured at different positions— has hardly been discussed in image processing, although there exist many computer vision based approaches. This is because view interpolation becomes an ill-posed problem in the most cases when any scene geometrical information such as a depth map is not available. However, the goal of view interpolation is not to recover the scene geometry but to generate the virtual view; hence it would be meaningful to discuss a possibility of signal processing approach that can generate the virtual view, bypassing an estimation of scene information. In this paper, we present a novel approach based on a deconvolution (inverse filtering) for view interpolation, not requiring an estimation of the scene information. In the view interpolation problem we deal with here, using reference images captured with a circular camera array where all cameras are located on a circle with directed to the scene (see Fig. 1), we generate a virtual image at the center of the circle. By modeling linear formulations for both the reference images and the desired virtual image, we derive the inverse filter that is independent of the scene. Our novel approach differs from Plenoptic sampling [1] in which it derived an ideal lowpass filter and the maximum camera spacing allowed for nonaliasing view interpolation based on sampling theory. This ∗
[email protected] Fig. 1: Setting of circular camera array and virtual camera. paper also investigates an effect of number of images (cameras) on improvement of quality of the virtual image we generate, leading an important fact that increasing the number of cameras makes the inverse filter stable although the distance between the virtual and the reference cameras is a constant (i.e., the radius of the circle). 2. VIEW INTERPOLATION BY DECONVOLUTION The problem setting of the view interpolation in this paper is shown in Fig. 1. We capture N reference images with N cameras arranged in a circle centered at the origin O of XY Z coordinate system. All the cameras are set on the XY plane with parallel to Z axis (which represents the depth from the camera array). Each camera position is represented by the polar coordinate (R, θn ) and the image captured with the camera by in (x), where R is the circle radius, x denotes the pixel coordinate (x, y)T ,and θn = 2πn/N for n = 0, 1, . . . , N − 1. We assume focal length for all the cameras including the virtual camera is the same f . The goal is to generate a virtual image iv (x) at the center of the circle (i.e., at the origin O). In this section, we present a novel approach that consists of two steps: generating multiple candidate images for the virtual image by a plane-sweeping method and reconstructing the virtual view from them using inverse filtering.
2.1. Plane-sweeping view generation
2.3. Deconvolution method
In the first step of our approach, we generate candidate image for the desired virtual image by light field rendering (LFR) [2] based on an assumed plane Z = L (called focal plane). Let the candidate image be g(x; L), it is generated as an average of the corresponding pixel values between the reference images in (x) based on the assumed depth as follows:
We wish to establish the inverse filter that generates the desired image iv (x) modeled by eq. (5) from the 3D signal g(x, l) modeled by eq. (2). By taking 3D Fourier transform of eq. (2) with respect to x and l, we obtain a simple form of
g(x; L) =
N −1 1 X in (x − tn /L) N n=0
(1)
where tn = (Rf cos θn Rf sin θn )T . Note that tn /L is an amount of disparity between the virtual image and the reference image. By sweeping the focal plane along with Z axis and changing its depth inversely proportional to L, we generate multiple candidate images and treat them as a 3D signal g(x, l), where l = 1/L. 2.2. Signal formation model Assuming that the scene has lambertian surfaces and less occlusion, we can derive the following linear and spatial invariant formation model of the 3D signal g(x, l) using a 3D convolution (see Appendix): g(x, l) = hN (x, l) ∗ s(x, l)
(2)
where hN (x, l) is a 3D point spread function (PSF) when the number of cameras is N , s(x, l) represents the 3D intensity distribution of the scene texture, which is the scene information itself, and ∗ denotes a 3D convolution operation. The 3D PSF depends only on the arrangement of the camera array and can be represented by N −1 1 X δ(x − l tn ) − ¯l ≤ l ≤ ¯l hN (x, l) = (3) N n=0 0 otherwise where δ is Dirac delta function, −¯l ≤ l ≤ ¯l is the support of the PSF in l variable. The scene information s(x, l) is defined using the desired virtual image as .
s(x, l) = iv (x) · δ(d(x) − 1/l)
(4)
where d(x) is the depth map that is supposed to be obtained from the origin. The scene information is a 3D signal consisting of 2D textures visible form the origin. The scene information and the depth map are unknown. Now the desired image iv (x) is a projection of the scene information onto the virtual view position; therefore it can be formulated by the integration of the scene information. Z iv (x) = s(x, l) dl. (5)
G(u, w) = HN (u, w)S(u, w)
(6)
where G(u, w), HN (u, w) and S(u, w) is the 3D Fourier transform of g(x, l), hN (x, l) and s(x, l), respectively, and u = (u v)T and w denotes frequency with respect to x = (x y)T and l. Using Fourier projection theorem, the 2D Fourier transform of eq. (5) with respect to x is expressed by Iv (u) = S(u, 0)
(7)
where Iv (u) is the 2D Fourier transform of iv (x). Neglecting unknown S(u, 0) from equations (6) and (7) yields the following inverse filtering (deconvolution) method: 0, |HN (u, 0)| ≤ ² G(u, 0) (8) Iv (u) = , otherwise HN (u, 0) where ² is a constant value and HN (u, 0) is calculated as HN (u, 0) =
N −1 1 X sin{2π¯l(ut tn )} . N n=0 π(ut tn )
(9)
Since HN (u, 0) has zero points according to the number of cameras, the thresholding process in eq. (8) is needed to suppress associated artifacts. This idea was the same used in traditional image restoration methods. It should be noted that, using the presented deconvolution method, Iv (u) can be generated from G(u, 0) without estimating the scene information S(u, w). The next section evaluates the presented method and discusses an effect of the number of cameras onto the quality of the virtual image. 3. SIMULATION RESULTS AND DISCUSSIONS The proposed deconvolution method was tested using synthetic reference images (320x240 pixels) for a CG scene. The scene structure and camera array position set in this simulation were illustrated as a view from Y axis in Fig. 2. The simulation conditions were set as follows. The depth range of the scene: 75-100; the radius of the circle of the camera array R=3; horizontal field of view: 30 degree; and focal length f =1. Example of the candidate virtual images generated by LFR in the first step of our method is shown in Fig. 3. Figure 3 (a) shows the candidate images generated on different depths of the focal plane when 4 cameras were used. In each image, the regions not in the focal plane suffer from ghosting artifact due to pixel mis-correspondence and contain 4 different
30 deg.
100
75
(a) Effect of the focal plane depth for the case of 4 cameras.
30 deg. 320x240 pixels
O 3
X
3
Fig. 2: Scene and camera setting in simulation.
N=1
N=3
N = 12
(b) Effect of number of cameras (the focal plane is set at the surface of the sphere).
Fig. 3: Effect of ghosting artifacts when the focal plane depth and the number of camera are changed. textures blended. Effect of the number of cameras can be observed in Fig. 3 (b) where the focal plane was set at depth of the sphere surface in the scene. It can be seen that the background texture is degraded much more when the number of cameras increases, while the texture on the sphere surface remains sharp. The generated image for N = 1 does not appear blurry, but it is the globally shifted version of the reference image and differs from the desired virtual image. Figure 4 shows the virtual images finally generated by the presented deconvolution method for various numbers of camera. The ground truth image is also shown in this figure for comparison. The size of dimension l is 64 and ² was set 0.064. These results indicate that increasing the number of cameras improves the quality of the virtual image. This observation is followed by measured PSNRs in Fig. 5. The reason is that HN (u, 0) has less zero points with increase of number of cameras, which could be shown using eq. (9); as a result, the more stable reconstruction is possible. The virtual images generated for N =1, 2, 3, and 6 are degraded and contain scratched patterns. In addition, it is observed that the result for N = 1 and that for N = 2 become almost same in quality and the results for N = 3 and 6 are also similar. This can be theoretically explained by the fact that HN (u, 0) has more zero points for N =1, 2, 3, and 6 than other cases and that relationships H1 (u, 0) = H2 (u, 0) and H3 (u, 0) = H6 (u, 0) hold true (this fact is led from eq. (9)). This fact also indicates that it is more efficient to use odd number of cameras for generating the virtual image with better quality. When more than 9 cameras were used, the virtual images are generated with sufficient quality, without visible artifacts even in the occluded boundaries, although the signal
formation model is incorrect for occluded boundaries. 4. CONCLUSION We have presented a novel view interpolation method based on deconvolution for the virtual image at the center of the camera array arranged on a circle. By introducing linear spatialinvariant models for the reference images and the virtual image, we derived the inverse filter that can generate the virtual image without estimating the scene geometry. It was theoretically and experimentally shown that increasing the number of cameras improves the image quality. As a future work, for applying the presented method to a set of real reference images, we need to take into account some regularization techniques and other optimizations to improve the quality. One application of our method would be a teleconferencing system that enables eye contact; using a number of images captured by cameras arranged around the monitor, we generate the virtual image at the center of the monitor. Appendix: Derivation of signal formation model First, consider the case when the scene has a single texture s(x; L0 ) at depth L0 and assume that the scene contains texture of lambertian surface. The virtual image generated by LFR based on the depth L0 can be simply expressed by g(x; L0 )
=
in (x − tn /L0 ), for ∀ n.
(10)
The virtual image g(x; L0 ) can be also expressed by s(x; L0 ) itself; hence the following equation holds in (x − tn /L0 ) = s(x; L0 ), for ∀ n.
(11)
40
Red Green Blue
PSNR [dB]
35
30
25
20
15 1
Ground truth
2
3
4
5
6
7
8
9
10
11
12
13
14
Number of cameras
Fig. 5: Measured PSNR of the generated virtual images as a function of the number of cameras. When the virtual image generated based on an arbitrary depth L, it is given from equations (1) and (11) as N=1
N=2
g(x; L) =
N −1 1 X s(x − tn /(L − L0 ); L0 ). N n=0
(12)
Next, for a general scene, the scene is an integration of texture s(x; L0 ) over L0 , if an effect due to occlusion is ignored. By transforming images in both side of eq. (12) into 3D signal using l = 1/L and l0 = 1/L0 , and by integrating the equation over l0 , we obtain N=3
Z
N=4
g(x, l) =
N −1 1 X s(x − (l − l0 )tn , l0 ) dl0 . N n=0
(13)
As shown below, the 3D convolution relationship between 3D PSF hN (x, l) and the scene information s(x, l) equals the right-hand side of eq. (13); therefore, eq. (2) was derived. ZZ N=5
N=6
.
hN (x, l) ∗ s(x, l) = ZZ s(x0 , l0 )
= Z =
N=9
N = 13
Fig. 4: Simulation results: the ground truth and the finally generated virtual image.
s(x0 , l0 )hN (x − x0 , l − l0 )dx0 dl0
N −1 ´ 1 X ` δ x − x0 − (l − l0 )tn dx0 dl0 N n=0
N −1 1 X s(x − (l − l0 )tn , l0 )dl0 N n=0
(14)
5. REFERENCES [1] J-X. Chai, X. Tong, S.-C. Chany and H.-Y. Shum, “Plenoptic Sampling,” proc. of SIGGRAPH2000, pp. 307-318, 2000 [2] M. Levoy and P. Hanrahan, “Light field rendering”, proc. of SIGGRAPH96, pp. 31-42, 1996