The number of operations required to compute Space Time Adaptive
Processing (STAP)
solutions is one of the main factors requiring
the use of high-performance clusters for radar array processing. We
need to introduce several heuristic optimizations to reduce cluster
price and physical size, hence reducing the number of processors
and the size of the cluster.
All pulsed radar STAP approaches require Doppler filtering
during the computation. Depending on approach, the Doppler
filtering can be performed either with pulse compression or done
separately. We consider the case where these two computations are
separate. In beamspace post-Doppler and beamspace pre-Doppler
approaches, the Doppler filtering process is performed after
reduction in beamspace, leaving a small number of elements on which
to perform the operation.
If we partition the data cube along the
pulse-range dimension, and have a sufficient number of processors,
each processor is responsible for an entire (or a part of)
pulse-range plane. More importantly, each processor has data that
spans all the staggers per individual time sample. Using this setup
and an important property of the Fast-Fourier Transform (FFT), we
can introduce substantial savings into the process. Over the entire
data cube the number of floating point operations (FLOPS) for that
phase is reduced by up to 40%.
In this article, we consider the pulsed Doppler processing,
where the data is collected by several elements of the radar array.
In particular, we examined Doppler filtering in beamspace pre- and
post-Doppler approaches. The input data to a radar processor is a
three-dimensional data cube of dimensions L, M, and N. The
dimension L is the range gate dimension, M represents the number of
pulses, and N represents the number of elements. For the sake of
calculations, we assign these dimensions some realistic
valuesbefore the Doppler filtering phase, let L=500, M=20,
and N=50.
Each data point in the data cube is a complex
point representing real and imaginary (I and Q) components of one
range bin.
The physical machine that would be considered for the task of
data processing is a multiprocessor machine consisting mostly of
Digital Signal Processors (DSPs). The reason for using DSPs,
besides their FFT processing capabilities, is that the programmer
can manage the DSP's cache.
This important property enables us to introduce a
heuristic optimization.
We use a property of the decimation-in-time FFT
circuitwhen the FFT over some vector is computed, the odd and
even points are grouped and separate intermediate results are
computed and then joined to obtain the result. The key observation
is that intermediate results are independent of each other.
In order to use the reduction of the operations in Doppler
filtering, the data must be partitioned in the following fashion.
Each processor receives the data from all pulses, from a particular
beam, and at some time sample. This data encloses all the staggers,
meaning that the data has not been broken up into individual
staggers; notion of a stagger is explained in the Heuristic
Optimization section of this article. Each processor can then break
up its data into individual staggers. Notice that the stagger data
overlaps from one stagger to the next in such a way that the first
even point of a first stagger is a first odd point of the next
stagger, hence pointing to the FFT property. It is important for
the data to be small enough to fit into the cache, otherwise, the
savings will not have a significant impact.
The basic savings from one stagger to another give us the
following percent savings:
However, it is usually the case that there are more then two
staggers; hence, the savings can be extended over the consecutive
staggers.
Fast-Fourier Property
The FFT is a process that takes an input data in time domain and
transfers it into the frequency domain. Data needs to be in the
frequency domain before Doppler processing takes place.
You can evaluate the FFT using a parallel Cooley-Tukey FFT
algorithm. The decimation-in-time algorithm decomposes x[n] into
successively smaller subsequences. The algorithm we would use is
Radix-2 Decimation-in-Time and can be represented by:
The equivalent parallel FFT circuit allows us to compute K/2
independent operations up to log K-1 times, where the depth of the
circuit is log (K). The independent operations are performed in two
groupsthe first group includes only the even points and the
second group only the odd points of the time-domain vector.
Figure 1 shows an implementation of the 16-point FFT
circuit.
Figure 1: Flow graph of 16-point FFT based on
decimation-in-time
Heuristic Optimization
Doppler processing is implemented by applying a discrete Fourier
transform of length K across M pulses of the preprocessed data for
a given range cell and channel. K is the length of the stagger and
the number of staggers is equal to the number of possible shifts of
K across M. In our case, K=16, M=20; hence, we have five
staggers.
Therefore, the stagger is simply a different way
of looking at the data. At this point the data is still in the form
of the data cube and it has not been broken up into individual
staggers. The data further requires frequency-domain weighting.
Each stagger is a small data cube of dimensions: 16 pulses, over
50 beams, and 500 range gates long. Therefore, in the data cube
before the Doppler filtering, the first stagger starts at pulse 1
and ends at pulse 16, the second stagger starts at pulse 2 and ends
at pulse 17, and so on.
In order for us to exploit odd/even point FFT and DSP
properties, we have to map data to the processors in a specific
way. Each processor has to receive data from all staggers at the
particular range gate and beam. This means that the data has to be
partitioned in the [pulse x range gate] plane.
Each processor receives one or more of these vectors. Since we
have 50 x 500 vectors in the [beam x range gate] plane, we need
about 25,000 processors, if each would only receive a single
vector. However, assume that we have less then 25,000 processors
and that each processor will receive just enough vectors so that
the vectors would fit in the cache of each processor. There are two
important observations: the data sent to each processor fits into
the cache and, at this point, the data cube was not broken up into
individual staggers, hence the data to transfer remains small.
Figure 2: Stagger data overlap
Data from the first stagger overlaps with the second stagger by
exactly K-1 points. Consider staggers numbers one and two. First
point of the overlap between the two vectors is point one of
stagger one, which is point zero of the second stagger (Figure
2). The same pattern is repeated up to point K of stagger one,
which is point K-1 of stagger two. Therefore, the odd points of
stagger one are the even points of stagger two. Hence, when we
compute the FFT of stagger one, we can save the intermediate
results and reuse them for the computation of stagger two as
even-point results (Figure 3).
Figure 3: Odd/even point savings in consecutive
staggers
For simplicity, let us assume that the multiplication, addition,
and subtraction take the same amount of time. With this assumption,
we have 5K log K floating-point operations (FLOPS) per FFT. For the
case of a 16-point FFT, the total number of floating point
operations is 320 and the floating-point operations in the gray box
of Figure 1 (8-point FFT) requires 120 FLOPS. Additional
savings from any stagger to a consecutive stagger extend by using
the intermediate results of points 2 and 10, which translate to
points 1 and 9, to the next stagger, points 4 and 12 (3, 11), and 6
and 14 (4, 13). This gives us the 170 FLOPS, the minimum number of
operations required to compute the next stagger (Figure 4).
Computation of stagger two costs us only (5K log K - 5(K/2) log
(K/2) - 5*3(K/8) log (K/8)), or (320-120-30)=170 FLOPS, or 53% of
the required number of operations. Of course, the savings will vary
with the size of K and the number of staggers, since more or less
intermediate results could be saved.
Figure 4: Extended savings over consecutive
staggers
Another significant block of savings comes three staggers down,
where intermediate results from the 4-point FFT can be used.
Therefore, the last two staggers require only 150 operations
(Figure 4).
To compute five 16-point staggers without savings would cost
1600 floating-point operations. Using parallel FFT techniques, we
need only 960 floating-point operations. This is a 40% improvement,
for the entire data cube, since we need to perform 24 instead of 40
Mflops.
Conclusions
DSP processing gives the programmer a valuable benefit, the
capability to control content of cache, certain memory locations
can be locked/unlocked and forced to remain in or be flushed out of
cache.
Using DSP and odd/even point FFT properties, you
can save the intermediate results and reuse them to reduce the
number of operations. The costs of remapping the data were not
taken into consideration, however. If possible, this optimization
could bring substantial improvement in performance, hence possibly
reducing the size of the cluster or speeding up the process.
This work was supported by Northrop Grumman
of Bethpage, New York.