3d image segmentation implementation on fpga using the em/mpm ...

Report 8 Downloads 37 Views
3D IMAGE SEGMENTATION IMPLEMENTATION ON FPGA USING THE EM/MPM ALGORITHM Yan Sun and Lauren A. Christopher Department of Electrical and Computer Engineering, Indiana University Purdue University, Indianapolis, IN 46202 USA

ABSTRACT In this paper, 3D image segmentation is implemented on Field Programmable Gate Array (FPGA) hardware. Segmentation is performed using the EM/MPM Bayesian algorithm. This algorithm segments 3D images using neighboring pixels in a Markov Random Field (MRF) clique. We have implemented this iterative algorithm in Xilinx FPGA, and have achieved greater than 100 times improvement over standard desktop computing hardware. Key to achieving this speed are: a pipelined, parallel data path structure and a novel memory interface for solving the external memory bandwidth problem. Index EM/MPM

Terms—3D

image

segmentation,

FPGA,

1. INTRODUCTION Due to its significant advantages in visualization, 3D images are becoming more and more popular in several aspects in our life. In the medical area, because of the complexity and diversity of human organs as well as the unpredictable location of lesions, it is difficult to obtain accurate and complete tissue segmentation from 2D images. Therefore, fast 3D image processing allows doctors to view 3D rendered tissues and organs for diagnosis, treatment planning and even surgical assistance in the operating room. In all, the first step for a good visualization is image segmentation. Several 3D image segmentation algorithms have been published recently. We focus here on the Bayesian algorithm that combines Expectation Maximization with the Maximization of Posterior Marginals (EM/MPM) and is considered as an effective one in many segmentation tasks [1, 2, 3]. This algorithm classifies every pixel in an image by assigning a cost to the number of misclassified pixels, and iteratively finding the best probabilistic solution which fits the data. High pixel count in 3D images results in Gigabytes of data to process. The standard computing architectures are not well suited because of fixed memory bandwidth and large instruction set overhead. FPGA and ASICs have distinct advantages when it comes to a specific task with large data sets. Hardware implementations can have significant parallelism and one

978-1-4244-4296-6/10/$25.00 ©2010 IEEE

670

can design dedicated high-speed electronic circuits. FPGAs can execute digital signal processing algorithms and can implement complex, high-speed logic. We have implemented 3D image segmentation in a Xilinx FPGA using two major innovations. First we optimize the 3D data path speed, taking advantage of the iterative nature of the algorithm. The second is improving the efficiency of the memory transfers reducing repetitive data accessing. To cope with the memory bandwidth problem, a variety of methods have been proposed. Li [4] presented a “brick” caching scheme for 3D medical imaging aiming at speeding up the processing speed on FPGA. It implied that parallel memory access and brick pre-fetching can be possible, but some ideas were left for future study. A parallel processor array for filtered backprojection was developed in [5] to speed up processing. I.Goddard et al [6] did high-speed cone-beam reconstruction based on embedded systems approach and S.Coric et al [7] did parallel-beam back projection in an FPGA platform for medical imaging. Accelerated volume rendering and tomographic reconstruction are demonstrated by B.Cabral [8] using texture mapping hardware. K.Mueller et al [9] reported fast and accurate three-dimensional reconstruction from conebeam projection data using algebraic methods in his PhD dissertation. Much work has been done to port medical algorithms to hardware or FPGAs, however, the external memory bandwidth limitation still restricts the speed of the algorithms referenced above. In our paper, we propose an efficient on-chip system for the key segmentation algorithm (EM/MPM) using parallel and pipeline structure for the data path and “pingpong” internal memory architecture. For the specific EM/MPM task, we have eliminated the external memory interface from the critical path. In section 2, we provide a brief introduction describing EM/MPM, and the detailed implementation process is followed by in section 3. Finally, section 4 concludes that the processing speed improvement is shown using the parallel structure in FPGA platform. 2. 3D EM/MPM 2.1 Introduction The two parts of the algorithm are EM and MPM. Expectation Maximization (EM) is employed for hyperparameter estimation of the mean and variance of Gaussian

ICASSP 2010

Statistical models of the segmentation classes. MPM assigns the segmentation class per pixel using a Markov Random Field and iterative optimization. MPM uses the Gaussian estimated parameters from EM calculations. The EM/MPM optimization executes seven MPM iterations per one EM iteration. MPM therefore is our main target for parallelism and improved processing speed. MPM uses a 6-pixel neighborhood to drive the segmentation. EM uses class persistence from the seven MPM iterations to estimate new means and variances. After tens of EM/MPM iterations, the result of the algorithm will typically converge to the optimal (probabilistic) segmentation. 2.2 Expectation-Maximization EM has two phases: the expectation step and the maximization step. The EM algorithm estimates the Gaussian segmentation class means and variances represented by the vector: . The segmentation is done in the expectation step, iterating to find the best log-likelihood of the probability that a particular pixel belongs to one of the classes. The EM function is as follows, and is the standard Bayesian form, after iterations it will converge to the optimum solution. is the posterior (Gaussian) distribution The and is the prior distribution.

2.3. 3D Maximization of Posterior Marginals At each pixel, the MPM optimization uses the Gaussian distribution of each class and the class probability of the neighborhood pixels. As in the literature [1, 2, 3], the 3D pixel neighborhood is defined by the function , where is the center pixel to be assigned, and are the nearest 6 pixel neighbors: up, down, left, right, next slice, previous slice. This neighborhood prior probability is defined below as the probability of the segmentation choice of , given the segmentation choices of the neighbors.

=normalizing value =weighting factor for amount of spatial interaction =cost factor for class “k”, used for atlas weighting The Gaussian and the prior probability are combined in an equation to be optimized by the FPGA hardware computational block. This is calculated in the log domain, eliminating constants and exponentials where the current segmentation estimate, , is computed by:

671

The maximization also contains a randomization factor, similar to simulated annealing, allowing the algorithm to avoid local minimum solutions. MPM and EM have a strong interrelationship, but the MPM iterations are the majority of the computational processing, therefore the use of dedicated hardware is targeted to this algorithm. 3. IMPLEMENTATION PROCESS In this section, we introduce our “pingpong” structure in RAMs. Then the implementation using EM/MPM on FPGA is demonstrated in detail. Finally, some of the implementation results are shown. MPM iterates seven times before it gets intermediate results. Thus for our implementation, seven iterations defines the size of the internal memory. If more iterations are desired, we can increase the size of the internal memory. Software results have shown that the use of seven MPM iterations is an excellent choice with a large variety of data. 3.1. “Pingpong” structure for RAMs Figure 1 shows the “Pingpong” structure used in this hardware implementation. We know frequent external RAM data access would result in low system processing speed. So we describe the use of two internal RAMs, RAM A and RAM B. Initially, the original data are read in and stored in RAM A as source, processed once and then saved in RAM B. Now RAM B data is considered as the source data after the 1st iteration. The same procedure will be executed repeatedly until the 7th iteration is reached, this defines the latency. This technique significantly reduces the data exchange between internal and external memory by keeping intermediate results on-chip. Finally, when the final iteration result is obtained it is written into the external memory. After the latency period of 7 iterations, the data transfer for external memory is continuous at one slice update per iteration. The external memory data transfer matches the internal processing of data, eliminating the memory bandwidth limitation.

RAM A

Computational block

RAM B

Fig.1 “Pingpong” structure in RAMs 3.2 Detailed “Pingpong” implementation In our design, we optimized the slice relevance and volume data exchange between internal and external memory. Slice relevance refers to the 3D neighborhood. While the central pixel is in process, the algorithm uses four pixels in the same slice and pixels located in the previous slice and next slice. This property is used in 3D image segmentation (and other 3D algorithms) and is difficult to

parallelize. In addition the iterative nature of the algorithm means that a pixel that is processed for n iterations will be needed for the n+1 iteration. A parallel structure called “tiered structure” is proposed here to deal with the slice relevance and iterative requirements. We mark the slice sequence starting from to along with the time (or spatial) horizon. Then 8 slices are loaded from external memory into internal memory RAM A and then delivered to the computational blocks as a set of original data. After being processed one iteration, the first result slice is transferred to RAM B named as . The remaining seven slices are processed accordingly and stored in RAM B as , , , , , and . For the next partial iteration, we use RAM B as source memory. The most important point is we stop reading new slices data from the external memory until is processed for seven iterations and is ready to be sent out to the external memory.

The remaining slices have only been processed partially (since we need new slice data). So the maintenance of the slice relevance is necessary: because the front and back slice which have been processed for m times are needed to process the middle slice for m+1, so then can be processed only if and are available. Where S: slice number. S2.0: the original data of slice 2. S2.2: the second slice which has been processed for two times. Obviously, each slice should approach to when it has been processed and is ready to be sent to the external memory. Table 1 shows how the data exchange works between RAM A and RAM B during these seven processes. Table II shows the results when the first seven iterations are completed.

TABLE I DATA TRANSFORMATION RECORDS DURING SEVEN PROCESSES Very beginning: After first time: After second time: (from RAMR to (from RAMB to RAMB) RAMA) RAM A RAM B RAM A RAM B RAM A RAM B

TABLE II DATA TRANSFORMATION RESULTS AFTER THE FIRST PROCESS

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

After third time: (from RAMA to RAMB) RAM A RAM B

After fourth time: (from RAMB to RAMA) RAM A RAM B

Where T: MPM iteration number. When is processed seven times it is ready to go to external memory. Afterwards, this internal memory space is released and we read in another slice marked as and send it to the computational block along with to develop slice 8.1. Then this will make up the space of in RAM B (see Table III for details). This tiered structure has two advantages: 1. although every slice should be processed for 7 times, it just needs to be read in once. 2. For a slice related algorithm, the only data in 8 slices need to be copied during processes.

After third time: (from RAMA to RAMB) RAM A RAM B

TABLE III DATA TRANSFORMATION RESULTS After sixth time: (from RAMB to RAMA) RAM A RAM B

After seventh time: (from RAMA to RAMB) RAM A RAM B

3.3 Pixels between rows and columns in parallel structure Here are 8 parallel computational modules performing the maximization described in section 2.3, where 8 pixels will be processed in parallel during one clock cycle. These 8

672

pixels and their neighbors are simultaneously presented as inputs to the eight computation modules. We now present the addressing of RAM A as RAM group A, including RAM A.1 to RAM A.7. Below is the parallel addressing modes to accomplish this. RAM A: RAM A.1 --- stored slice 1 and any other slice whose remainder is 1 after the slice number division by 7. RAM A.1.1 --- stored pixels in slice 1 whose remainders are 1 after the column num. division by 8. …RAM A.1.7 --- stored pixels in slice 1 whose remainders are 7 after the column num. division by 8. RAM A.1.0 --- stored pixels in slice 1 whose column number can be divided exactly by 8. This arranges the elements from RAM A.1 to RAM A.8 in RAM group A (the current 8 slices). This kind of RAM arrangement can give us 8 pixels and their neighbors during two RAM clock cycles in parallel. 80.00

80

Figure 2 Segmentation Result Comparison (Left: PC Right: Xilinx hardware) 4. CONCLUSION This paper discusses implementation of MPM part of segmentation in Xilinx FPGA to optimize processing speed in dealing with large 3D medical images. We have shown the improvement using parallel computation and memory bandwidth reduction techniques. Finally, the experimental results show that we have achieved greater than 100 times improvement over standard desktop computing hardware. For further research, we will design the EM algorithm on the FPGA.

0 70

REFERENCES

0 60

[1]Christopher, L. A., Delp, E. J., Meyer, C. R., & Carson, P. L.. "3-D Bayesian ultrasound breast image segmentation using the EM-MPM algorithm". Proceedings of the IEEE Symposium on Biomedical Imaging. IEEE (2002). [2] Christopher, L. A., Delp, E. J., Meyer, C. R., & Carson, P. L.. "New approaches in 3D Ultrasound segmentation". Proceedings SPIE and IST Electronic Imaging and Technology Conference (2003). [3] Comer, M. L., & Delp, E. J.. "The EM-MPM algorithm for segmentation of textured images: Analysis and further experimental results". IEEE Transactions on Image Processing , vol. 9, no. 10. (2000) [4] Li, J.C., Shekhar, R.,& Papachristou,C."A "brick" caching scheme for 3D medical imaging".Biomedical Imaging:Nano to Macro,pp. 563-566,April,15~18, 2004. [5] Schmitt, T.,Fimmel, D.,Kortke,M.,et al."Parallel processor arrays for filtered backprojection".Computer Aided Systems theoryEUROCAST'99,pp.127-141,Springer,2000. [6] Goddard, I., & Trepanier,M."High-speed cone-beam reconstruction an embedded systems approach".Proc. SPIE Medical Imaging,vol.4681,pp.483-491,2002. [7] Coric,S.,Leeser, M., Miller, E. et al. "Parallel-beam backprojection:an FPGA implementation optimized for medical imagine."Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-progammable gate arrays,pp.217-226,2002. [8] Cabral, B.,Cam, N.,& Foran, J."Accelerated volume rendering and tomographic reconstruction using texture mapping hardware".Proceedings of the 1994 symposium on volume visualization,pp.91-98,1994. [9] Mueller, K.."Fast and accurate three-dimensional reconstruction from cone-beam projection data using algebraic methods,PhD dissertation.The Ohio State University,1998

50 40 30

25.00

27.45

20 10 0

1 4

1

# of processors

9.75 4 8 6.30 2.52 0.638 1 4

0.32

Time to complete ompletee (m (min)

Figure 1 Segmentation Speed Comparison 4. RESULTS When implemented the parallel, memory optimized MPM on the Xilinx Vertex5 XC5VLX330T FPGA, using a first test case of 128 by 128 pixel images. The synthesis was performed with utilization for this test case of 41% slice registers, 38% DSP48Es, 29% Block RAM/FIFO, 33% bonded IOBs and 39% slice LUTs. We have simulated the test case, and achieved 2 ms latency, followed by each subsequent slice available at 0.3ms (the clock speed of 20 nanoseconds times the pixel count). Figure 1 compares the Xilinx result with PC and Linux parallel computing resources. Figure 2 shows the first iteration segmentation results of PC and Xilinx hardware for one slice of a Liver Ultrasound 3D medical volume (128x128x128).

673