 |
| |
Mark Arnold
is currently at the University of Manchester Institute
of Science and Technology (UMIST) on leave of absence
from the University of Wyoming. He is the author or
co-author of more than a dozen papers on Logarithmic
Number Systems. He is also the author of Verilog
Digital Computer Design: Algorithms into Hardware
(Prentice Hall, 1999), and winner of the best paper
award at the 1997 International Verilog Conference. He
holds three patents.
|
|
 |
The Logarithmic Number System (LNS) is an
alternative to fixed- and floating-point arithmetic that
converts numbers into a logarithmic format once, and keeps them
in this format during computation. Using LNS reduces
multiplication, division, and square-root implementation to
trivial circuits while improving power and area compared to
conventional techniques.
Introduction
This article discusses three forms of LNS: unsigned (ULNS),
signed (SLNS), and dual redundant (DRLNS). With careful design,
ULNS and DRLNS are useful for special applications, while you
can use SLNS as a replacement for floating-point arithmetic in
many applications with little design effort. LNS addition and
subtraction require table lookups. Most applications suitable
for LNS need less precision than the IEEE-754 floating-point
standard provides. Thus, table-based LNS is economical for such
word sizes, making LNS a natural choice for the limited tables
available in FPGAs. Direct table lookup is useful only for very
limited precision. Interpolation can achieve higher LNS
precision with limited table size at the cost of a small
fixed-point multiplier.
This article describes the mathematics of such an
approximation as well as the corresponding Verilog description
targeted for Xilinx FPGAs. The dual-ported Block RAM offered by
the Xilinx Virtex series is well suited to this task since the
slope for interpolation need not be stored in a separate table,
but rather can be computed from the two outputs of the block
RAM.
Representing numbers in LNS
, instead of the more common fixed- and
floating-point formats, allows designers to implement common
numeric applications, such as FIR filters
, graphics
, neural networks
, and the FFT
, more economically. Compared to floating
point, LNS has similar dynamic range but slightly better
precision
for the same word size since, unlike floating
point, the relative error of LNS is constant.
Compared to fixed point, where the relative
precision varies dramatically as a function of the value the
number represents, you can encode similar average relative
precision and dynamic range in LNS using far fewer bits. This
means that you can achieve a comparable signal-to-noise ratio
with LNS with fewer bits.
Another advantage is that high bits in LNS
change less frequently than those in a comparable fixed-point
system, resulting in power savings.
Also, like floating point but unlike fixed
point, LNS does not require the designer to scale each
computation, thereby reducing time-to-market by eliminating the
need for difficult floating- to fixed-point conversion.


Over one hundred papers have been written on LNS, and
several successful hardware and software implementations have
been reported. Some of the recent innovations are proprietary,
for example, a major semiconductor manufacturer recently
fabricated LNS chips running at 120 MHz capable of the
logarithmic equivalent of 1 GFLOP IEEE-754 (32-bit word, 23-bit
precision) floating-point operations.
Also, the European Union is funding the
development of a prototype VLSI microprocessor that uses LNS.
LNS has found commercial application in music
synthesizers developed by Yamaha
and in aircraft controls used by Boeing.
LNS has even been exploited by a
special-purpose workstation for the animation of children's
television series.
However, the vast majority of the most useful
LNS techniques are in the public domain"having their
inspiration in the centuries-old mathematics behind the once
ubiquitous engineer's slide rule.
Unsigned, Signed, and Redundant LNS
We need to distinguish between a real value and its
representation in a circuit or computer. In this article,
lower-case variables will be positive real values, and
upper-case variables will be their logarithmic representation.
Thus, with an Unsigned Logarithmic Number System (ULNS) in
which we may assume that every value is positive, we have X =
logb(x), where b is the base of the logarithm.
Typically, for compatibility with conventional binary, b = 2;
however, other choices impact range and precision. To output
the value being represented, we turn this around: x =
bX. In practice, X is truncated to some number of
bits, resulting in a round-off error, similar to but slightly
better than the error observed in floating-point arithmetic.
There is a small problem that must be addressed if x is
negative, as the logarithm of a negative argument is not a real
number. There are two ways to cope with the representation of
negative numbers:
- Let X = logbx and use a separate bit,
XS = sign(x), to record the sign.
- Keep separate representations, XP and
XN, for positive and negative real numbers.
The former is known as the Sign/Logarithm Number System
(SLNS)
, and the latter is known as the
Dual-Redundant Logarithmic Number System (DRLNS).
SLNS has several advantages: it is more
compact than DRLNS and it is completely analogous to
floating-point arithmetic, making it easier for first-time
users to port an application to LNS. On the other hand, as the
paper will point out later, DRLNS operates faster and, in some
respects, is easier to implement than SLNS.
Multiplication and Division
LNS transforms multiplication and division into fixed-point
adders. The simplicity of these operations is one of the major
motivations for adopting LNS. For algorithms that consist of a
high proportion of multiplication and division, LNS offers a
significant advantage, since a fixed-point adder is a much
simpler and smaller circuit than a fixed-point multiplier. The
switching activity that occurs during fixed-point addition is
much less than what occurs during fixed-point multiplication
(since the latter consists of multiple additions)"thus
resulting in a power savings.
Given the ULNS representations, X = logb(x) and Y
= logb(y), it is easy and economical to compute the
ULNS representation of the product, logb(x * y) = X
+ Y. The output produces no additional round-off errors"it
simply propagates the errors present in X and Y. In the
unlikely case these inputs were exact, the LNS product will be
exact.
Signed LNS multiplication is almost as simple, requiring
trivial logic to compute the sign of the product.
Given the SLNS representation
(XS,X) for x and the SLNS representation
(YS,Y) for y, you get the SLNS representation
(PS,P) for the product (p = x * y) as the exclusive
OR of the signs (PS = XS
YS) and the sum of the logarithms (P = X + Y).
Again, if the inputs are exact then the result will be
exact.
Multiplication of two DRLNS values is a difficult operation,
not only in terms of hardware complexity, but also in terms of
error propagation.
Multiplication of two DRLNS values causes an
error growth that exceeds that of SLNS (or floating-point).
Thus, general DRLNS is seldom implemented. On the other hand,
mixed multiplication (DRLNS times SLNS) is simple and exact.
Given the DRLNS representation (XP, XN)
and the SLNS representation (YS, Y), you can form
the representation of the DRLNS representation (PP,
PN) of the product (p = x * y) by two additions that
may be carried out in parallel:
This requires two fixed-point adders and two multiplexers.
The multiplexers reverse the roles of
PP and PN when multiplying by a negative
value of y. Such mixed SLNS/DRLNS multiplications are useful in
a variety of common applications, such as the FFT.
Reciprocals, Roots, and Powers
In ULNS or SLNS, integer power or roots of x correspond to
fixed-point multiplication or division of X. For example,
logb(x3) = 3X. Similarly,
logb(1/x) = logb(x-1) = -X.
With the almost universal implementation choice that the
logarithm X be represented in binary, square and square root
are even simpler. Squaring of x is transformed into simply a
left shift of X and square root of x is simply a right shift of
X. The prominence of these operations in graphics means LNS is
well suited to such demanding applications.
Unsigned Addition
LNS addition is more challenging than operations like LNS
multiplication, division, or square root. To start, let us
consider only ULNS addition. This is sufficient for some
applications, such as Hidden-Markov speech recognition.
Direct Table Lookup
For ULNS addition, rather than directly computing x + y in the
most obvious way (taking the anti-logarithms of X and Y, adding
them, and converting back to logarithmic form), we can compute
y*(1+z), where z = x/y. As we saw in the Multiplication and
Division section, multiplication and division are trivial in
LNS. This leaves just the (1 + z) term with which to deal. You
can pre-compute and store the equivalent LNS function,
sb(Z) = logb(1 + bZ), in a
table (ROM or RAM) when the word size is small (say under 12
bits). This is possible because sb(Z) is a function
of a single variable, Z = logb(z) = X-Y, rather than
a function of the two variables X and Y.
Interpolation
For larger word sizes, the direct ROM approach is not
economical so you can use techniques such as interpolation.
There are many variations of interpolation
for approximating sb(Z) for the implementation of
LNS addition. Henkel
considers a linear Chebychev method, Lewis
considers both linear-Taylor
and quadratic-Lagrange
methods. Coleman et al consider a
quadratic-Taylor-error-correcting method.
All of the aforementioned implementations
partition the range of Z so that the distance between tabulated
points is non-uniform, a technique useful to achieve precision
similar to the IEEE-754 floating-point standard. The non-uniform spacing reduces table size
for large Z, where the function is relatively flat.
This article will consider how to implement a more moderate
precision LNS (20-bit word, 12-bit precision) using
linear-Lagrange interpolation with the simplifying assumption
that the spacing between tabulated points is uniform. With this
assumption, the bus (Z) can be split into a high part
(ZH) and a low part (ZL) where Z =
ZH + ZL.
Referring to Figure 1, the function sb(Z)
is approximated as sb(ZH) +
(sb(ZH + D) -
sb(ZH))/D)*ZL, where D is the spacing between tabulated points.
Because of this uniform spacing, D =
2-N is a constant power-of-two near zero that
determines the maximum value that ZL ever obtains,
in other words, D > ZL
> 0. N, which determines the size of D, is the number of fractional bits in
ZH.
Figure 1: The function sb is tabulated at
every multiple of D.
Thus,(ZH + D) and
ZH are multiples of D at
which the correct values, sb(ZH + D) and sb(ZH), are
known. To approximate sb(Z), linear-Lagrange
interpolation uses a straight line whose slope is
(sb(ZH + D) -
sb(ZH))/D.
This is multiplied by ZL, which is the distance from
the tabulated point to the Z under consideration. The
approximation is the sum of this fixed-point product and the
tabulated sb(ZH).
By choosing a larger N (corresponding to a smaller D), the approximation becomes more
accurate at the cost of increased memory requirements. The
number of valid bits produced
from this approximation is 2N + 5. Different
LNS implementers have used different accuracy criteria. Perhaps
the most sensible criterion is faithful rounding
, where the result should be rounded either to
the nearest representable point or to the next-nearest point.
Faithful rounding requires one extra bit of accuracy from the
approximation, thus the usable precision (F) available from the
linear-Lagrange approximation is F = 2N + 4, which makes N =
(F - 4)/2
.
When Z > F, sb(Z)
Z which means the hardware can simply use Z
without a lookup in such cases, which is equivalent to saying
the hardware
may assume Z < F in the other cases when
table lookups are required. Combining this with the fact that
it is only necessary to tabulate the function for positive Z
{we can use the identity sb(-Z) =
sb(Z)-Z} we have reduced the range of the address
down to 0 < Z < F. Thus, sb(Z)
table lookups occur only for 0 < Z < F, and
there need to be
log2(F)
additional address bits to the left of the radix
point of Z (beyond N bits on the right of radix point of Z)
required to access the table. The total number of address bits
is
log2(F)
+
(F-4)/2
. For the case we are considering (F = 12), this
is
log2(12)
+
8/2
= 8.
FPGA Interpolator
Figure 2 shows a block diagram for a linear-Lagrange
interpolator that implements the approximation
sb(ZH) + (sb(ZH +
D)- sb(ZH))/
D)*ZL. This design uses a
dual-port memory to obtain sb(ZH + D) and sb(ZH)
simultaneously from the same table. Such memories are available
on some FPGAs, such as the Xilinx Virtex-300.
(Lewis
uses a device similar to a dual-ported
memory, known as interleaved ROM, to achieve the simultaneous
access of adjacent tabulated points in his interpolator.) The
contents of the dual-port memory are the suitably rounded
values of sb at all multiples of D within the range. One of the addresses
input to the memory must be incremented by
D. The two outputs of the memory are subtracted to
obtain the slope of the interpolation line. Aside from the
dual-port memory, the largest-cost component in the
interpolator is a fixed-point multiplier. One input to this
multiplier is the low part of the Z bus, ZL, which
is J = F-N = F/2 + 2 bits wide, or J = 8 bits here. The other
input is the J + 1 most significant bits of the slope. The
product is added to the tabulated output
{sb(ZH)} from the memory to produce the
sb(z) output with guard bits. LNS addition can use
these guard bits to achieve faithful rounding.
Figure 2: The bus Z is split into ZH and
ZL. The dual-port memory is addressed simultaneously
by ZH+ D and
ZH producing sb(ZH+ D) and sb(ZH). These
values are subtracted to form the slope, which is multiplied by
ZL. The approximated sb(Z) output of this
linear-Lagrange interpolator is the product added to
sb(ZH).
It may seem ironic to introduce a multiplier into a system
whose most obvious advantage is that LNS replaces multipliers
with fixed-point adders, yet it is economic to do so. In the
case of F = 12 considered here, interpolation reduces table
size from 65,536 words down to only 256 words, or in other
words a 256-fold reduction. In general, this reduction would be
a factor of 2F/2 - 2.
Although some consideration





has been given to performing the
interpolation multiplication using logarithmic arithmetic (at
the cost of including logb(ZL) and
anti-log ROMs), this is perhaps not the most cost-effective
thing to do. The size of the interpolation multiplier in Figure 2 can be smaller than that required for the
equivalent of the floating- or fixed-point system. The
fixed-point multiplier previously used to implement LNS
interpolation has an area proportional to ALNS = (F/2 + 2)(F/2
+ 3) = 0.25 F2 + 2.5 F + 6.
A floating-point multiplier with precision equivalent to
such an LNS unit has an area proportional to AFP
> (F + 1)2. The F+1 term results from the hidden
bit. The issue of equivalency of LNS to fixed-point is somewhat
less clear. Paliouras and Stouraitis
propose an Average Relative Representational
Error (ARRE) model, in which a fixed-point system has a word
size one bit shorter than the equivalent b = 2 LNS word size.
Using this model, a fixed-point multiplier requires all bits of
the word to be input; it thus has an area proportional to
AFX = (F +
log2(F)
- 1)2. The multiplier cost savings
asymptotically approaches:
In the specific case of F = 12, ALNS = 72,
AFP = 169, and AFX = 225. This makes
AFX/ALNS = 2.4 and
AFP/ALNS = 3.2. Of course, the total cost
of each system depends on other factors aside from multiplier
size, but this appears to indicate in this case that the
inclusion of a small interpolation multiplier is
affordable.
Verilog Code
The attached Verilog code
implements the interpolator shown in
Figure 2.
Here blockRAM is an instantiation of an 18 x 256
Virtex-dual-port RAM. There are two guard bits provided in
output from the RAM. The "{ ... ,prod[7:0]}" indicates
concatenation of the low-order eight bits of the product for
later use in rounding. The test of (z >19'hc000) implements
the approximation sb(z)
z for z > F. There are ten extra guard bits
output.
Pipelined LNS
You can reduce the latency of multiple-term pipelined
summation
by rearrangement: x1 +
x2 + x3 + x4 = x1 +
x2 + x3(1 + x4/x3)
= x1 + x2(1 + x3/x2
(1 + x4/x3)) = x1 (1 +
x2/x1 (1 + x3/x2(1
+ x4/x3))). Implementation of such
formulae in LNS requires a special interpolator
that accepts both positive and negative
Z.
Unsigned Subtraction
ULNS subtraction is similar to addition, except the function
for differences, db(Z) = log(1-bZ), is
harder to interpolate

because it has a singularity near zero. There
are several ways to cope with this problem:
- Use a larger and power-of-two-partitioned memory for
db



- Use co-transformations that convert subtraction to easier
computations





- Use DRLNS to reduce and postpone subtraction

- Use a large but slow DRAM to hold db
.
You can reduce subtraction cost by noting the accuracy can
be lowered when approximating db(Z) for Z near zero
due to almost complete cancellation that occurs when
subtracting nearly equal values.


Signed and Redundant Addition/Subtraction
SLNS addition
uses one sb computation if
XS is the same as Ys, and otherwise uses a
db computation. Thus, you need to implement
db either as a co-transformation, a larger table, or
a combination of both. The sign of the result will be the sign
of the larger of the operands. Subtraction occurs by simply
complementing YS prior to addition.
DRLNS addition
uses two parallel sb computations:
one to add XP to YP and the other to add
XN to YN. Subtraction occurs by
interchanging YP with YN prior to
addition.
Mixed SLNS/DRLNS addition uses a single sb
computation. If YS = 0, the sb operates
on XP, otherwise the sb operates on
XN. Like pure SLNS, mixed subtraction occurs by
complementing Ys prior to addition.
Conclusions
This article discussed three forms of LNS: unsigned (ULNS),
signed (SLNS), and redundant (DRLNS). Multiplication, division,
and square root are trivial in ULNS and SLNS, offering power
and area savings compared to fixed- and floating-point. All
forms of LNS require the easily interpolated function
sb, but SLNS needs the more difficult db
function. Although you can get these functions by direct lookup
for limited-precision systems, interpolation is useful to
obtain higher precision. This article also described a simple
linear-Lagrange interpolator for sb that implements
a 20-bit LNS with F = 12 bits of precision. This interpolator
capitalizes on the dual-port memory available in Xilinx Virtex
FPGAs to obtain the slope by fetching adjacent tabulated
points.