Newsletter

Reducing Doppler Filtering Processing in STAP Implementations





TechOnline

The number of operations required to compute Space Time Adaptive Processing (STAP) solutions is one of the main factors requiring the use of high-performance clusters for radar array processing. We need to introduce several heuristic optimizations to reduce cluster price and physical size, hence reducing the number of processors and the size of the cluster.

All pulsed radar STAP approaches require Doppler filtering during the computation. Depending on approach, the Doppler filtering can be performed either with pulse compression or done separately. We consider the case where these two computations are separate. In beamspace post-Doppler and beamspace pre-Doppler approaches, the Doppler filtering process is performed after reduction in beamspace, leaving a small number of elements on which to perform the operation. If we partition the data cube along the pulse-range dimension, and have a sufficient number of processors, each processor is responsible for an entire (or a part of) pulse-range plane. More importantly, each processor has data that spans all the staggers per individual time sample. Using this setup and an important property of the Fast-Fourier Transform (FFT), we can introduce substantial savings into the process. Over the entire data cube the number of floating point operations (FLOPS) for that phase is reduced by up to 40%.

In this article, we consider the pulsed Doppler processing, where the data is collected by several elements of the radar array. In particular, we examined Doppler filtering in beamspace pre- and post-Doppler approaches. The input data to a radar processor is a three-dimensional data cube of dimensions L, M, and N. The dimension L is the range gate dimension, M represents the number of pulses, and N represents the number of elements. For the sake of calculations, we assign these dimensions some realistic values—before the Doppler filtering phase, let L=500, M=20, and N=50. Each data point in the data cube is a complex point representing real and imaginary (I and Q) components of one range bin.

The physical machine that would be considered for the task of data processing is a multiprocessor machine consisting mostly of Digital Signal Processors (DSPs). The reason for using DSPs, besides their FFT processing capabilities, is that the programmer can manage the DSP's cache. This important property enables us to introduce a heuristic optimization.

We use a property of the decimation-in-time FFT circuit—when the FFT over some vector is computed, the odd and even points are grouped and separate intermediate results are computed and then joined to obtain the result. The key observation is that intermediate results are independent of each other.

In order to use the reduction of the operations in Doppler filtering, the data must be partitioned in the following fashion. Each processor receives the data from all pulses, from a particular beam, and at some time sample. This data encloses all the staggers, meaning that the data has not been broken up into individual staggers; notion of a stagger is explained in the Heuristic Optimization section of this article. Each processor can then break up its data into individual staggers. Notice that the stagger data overlaps from one stagger to the next in such a way that the first even point of a first stagger is a first odd point of the next stagger, hence pointing to the FFT property. It is important for the data to be small enough to fit into the cache, otherwise, the savings will not have a significant impact.

The basic savings from one stagger to another give us the following percent savings:

However, it is usually the case that there are more then two staggers; hence, the savings can be extended over the consecutive staggers.


Fast-Fourier Property
The FFT is a process that takes an input data in time domain and transfers it into the frequency domain. Data needs to be in the frequency domain before Doppler processing takes place.

You can evaluate the FFT using a parallel Cooley-Tukey FFT algorithm. The decimation-in-time algorithm decomposes x[n] into successively smaller subsequences. The algorithm we would use is Radix-2 Decimation-in-Time and can be represented by:

The equivalent parallel FFT circuit allows us to compute K/2 independent operations up to log K-1 times, where the depth of the circuit is log (K). The independent operations are performed in two groups—the first group includes only the even points and the second group only the odd points of the time-domain vector. Figure 1 shows an implementation of the 16-point FFT circuit.

Figure 1:  Flow graph of 16-point FFT based on decimation-in-time


Heuristic Optimization
Doppler processing is implemented by applying a discrete Fourier transform of length K across M pulses of the preprocessed data for a given range cell and channel. K is the length of the stagger and the number of staggers is equal to the number of possible shifts of K across M. In our case, K=16, M=20; hence, we have five staggers. Therefore, the stagger is simply a different way of looking at the data. At this point the data is still in the form of the data cube and it has not been broken up into individual staggers. The data further requires frequency-domain weighting.

Each stagger is a small data cube of dimensions: 16 pulses, over 50 beams, and 500 range gates long. Therefore, in the data cube before the Doppler filtering, the first stagger starts at pulse 1 and ends at pulse 16, the second stagger starts at pulse 2 and ends at pulse 17, and so on.

In order for us to exploit odd/even point FFT and DSP properties, we have to map data to the processors in a specific way. Each processor has to receive data from all staggers at the particular range gate and beam. This means that the data has to be partitioned in the [pulse x range gate] plane.

Each processor receives one or more of these vectors. Since we have 50 x 500 vectors in the [beam x range gate] plane, we need about 25,000 processors, if each would only receive a single vector. However, assume that we have less then 25,000 processors and that each processor will receive just enough vectors so that the vectors would fit in the cache of each processor. There are two important observations: the data sent to each processor fits into the cache and, at this point, the data cube was not broken up into individual staggers, hence the data to transfer remains small.

Figure 2:  Stagger data overlap

Data from the first stagger overlaps with the second stagger by exactly K-1 points. Consider staggers numbers one and two. First point of the overlap between the two vectors is point one of stagger one, which is point zero of the second stagger (Figure 2). The same pattern is repeated up to point K of stagger one, which is point K-1 of stagger two. Therefore, the odd points of stagger one are the even points of stagger two. Hence, when we compute the FFT of stagger one, we can save the intermediate results and reuse them for the computation of stagger two as even-point results (Figure 3).

Figure 3:  Odd/even point savings in consecutive staggers

For simplicity, let us assume that the multiplication, addition, and subtraction take the same amount of time. With this assumption, we have 5K log K floating-point operations (FLOPS) per FFT. For the case of a 16-point FFT, the total number of floating point operations is 320 and the floating-point operations in the gray box of Figure 1 (8-point FFT) requires 120 FLOPS. Additional savings from any stagger to a consecutive stagger extend by using the intermediate results of points 2 and 10, which translate to points 1 and 9, to the next stagger, points 4 and 12 (3, 11), and 6 and 14 (4, 13). This gives us the 170 FLOPS, the minimum number of operations required to compute the next stagger (Figure 4). Computation of stagger two costs us only (5K log K - 5(K/2) log (K/2) - 5*3(K/8) log (K/8)), or (320-120-30)=170 FLOPS, or 53% of the required number of operations. Of course, the savings will vary with the size of K and the number of staggers, since more or less intermediate results could be saved.

Figure 4:  Extended savings over consecutive staggers

Another significant block of savings comes three staggers down, where intermediate results from the 4-point FFT can be used. Therefore, the last two staggers require only 150 operations (Figure 4).

To compute five 16-point staggers without savings would cost 1600 floating-point operations. Using parallel FFT techniques, we need only 960 floating-point operations. This is a 40% improvement, for the entire data cube, since we need to perform 24 instead of 40 Mflops.


Conclusions
DSP processing gives the programmer a valuable benefit, the capability to control content of cache, certain memory locations can be locked/unlocked and forced to remain in or be flushed out of cache. Using DSP and odd/even point FFT properties, you can save the intermediate results and reuse them to reduce the number of operations. The costs of remapping the data were not taken into consideration, however. If possible, this optimization could bring substantial improvement in performance, hence possibly reducing the size of the cluster or speeding up the process.

This work was supported by Northrop Grumman of Bethpage, New York.


 






 Featured Jobs
ROHM Electronics seeking Automotive Design Application Engineer in Novi, MI

Shure Incorporated seeking Project Manager in Niles, IL

Agilent Technologies seeking NPI Project Manager in Shanghai, CN

Agilent Technologies seeking Manufacturing Technician in Chandler, AR

Videon Central seeking Software Engineer in State College, PA

More jobs on EETimesCareers
 Sponsor
 CAREER CENTER
Ready to take that job and shove it?
SEARCH JOBS:

 SPONSOR

 RECENT JOB POSTINGS
For more great jobs, career related news, features and services, please visit EETimes' Career Center.