As engineers look for new ways to scale the performance of programmable Network Processors (NPUs), they must take a fresh look at the architecture of the individual processing elements they employ in their designs. Traditionally, architects have turned to derivatives of general purpose control flow manner. While this can work for low bandwidths, hardware inefficiency and complexity become an issue when scaling this approach to 10Gbps and above. At these bandwidths, high power dissipation and complex programming models can make cost effective, timely system-level design difficult.

This paper introduces an approach that uses a synchronous data flow architecture for individual processing elements, greatly simplifying the process of organizing them into arrays that behave in a data flow manner. The simplicity of this approach translates into great hardware efficiency which translates into smaller die size and lower power dissipation.