High Performance, Low Cost…

The Parallax Propeller P2

Its now more than 10 years since the introduction of the original Parallax Propeller chip. Information about this chip (now referred to as P1) can be found here.

Work on a second generation Propeller - the “P2”, has been on-going at Parallax for a considerable time. The development of the P2 has been very interesting to observe, with considerable input and involvement of members on the Parallax forums. A design approach actively seeking feedback from a user community is very unusual and has a great upside, capturing/harnessing bright ideas - but with this comes some risk that the project becomes bogged down by continual requests for feature additions, improvements etc.

A very significant hurdle in 2014 that became known as “P2 Hot” initially seemed it might be a “show-stopper” but fortunately this appears not to be the case and recently, in advance of actual silicon, P2’s designer - Chip Gracey has been releasing versions of P2 that can be run on various FPGA platforms to emulate the actual chip. These iterative releases are proving an important and critical step in the testing and debugging cycle.

From my perspective both the P1 and XMOS processors used extensively in the instruments described on this website already have more than enough horsepower to handle most designs. There are, however, occasions where more I/O pins, more memory and faster execution speeds would be very desirable, if not essential.

This page is devoted to some very preliminary exploration of some of the exciting new features of the P2 with a view to seeing how they might be useful in future projects. Given the rest of the projects described elsewhere on this website it will come as no surprise that my approach is to interact with an FPGA-based P2 via a LabVIEWTM vi. This marriage allows the P2 to be quickly evaluated and compared to the existing P1 and XMOS platforms. This is definitely a work in progress and information will be added as time permits.

P2 Features

Preliminary documentation for the P2 is available here. Depending on the user’s FPGA platform, various versions of P2 are available having up to 16 COG’s. Each COG has 512 longs (32 bits) of COG RAM as well as 512 longs of look-up-table (LUT) RAM. In addition, up to 1 MB of HUB RAM can be addressed; although the actual amount available to the user depends on the particular FPGA platform that is available. Programs can be executed in all of these three areas. The P2 can also access 64 I/O pins; twice the number that are available on the P1.

The P2 has an extensive instruction set, including fast CORDIC solver functions for computing trigonometric, square root and logarithm functions (among others). In the current FPGA emulation the P2 runs at 80 MHz and all math and logic instructions execute in 2 cycles, or 25 ns.

From the point of view of instrumentation perhaps the two most interesting and powerful features on the P2 are its “streamer” and “smart pins”. The streamer allows extremely rapid transfers of data from HUB or LUT RAM to pins/DACs, and vice-versa. Once configured, the streamer can move up to 32 bits every clock cycle (@ 80 MHz = 12.5 ns) giving an effective data transfer rate of 320 megabytes/second ! This capability would make the P2 ideal for doing the heavy lifting to implement instrumentation such as logic analyzers and custom waveform generators.

A P2 can also access up to 64 smart pins, again the exact number depending on the FPGA platform in use. As described in the P2 documentation, “each I/O pin has a smart pin circuit which when enabled, performs some autonomous function on the pin”. Smart pins liberate the P2’s cogs from micro-managing many I/O operations by independently providing high-bandwidth functions. Here, the user has access to a rich set of smart pin operating modes to generate custom waveforms, monitor/count/time input pulses/levels and generate 8-bit DAC waveforms.

In the work to be described below, a Terasic DE2-115 FPGA development board is running the P2 emulation. Using this FPGA board - which has a Cyclone IV FPGA with 115 logic elements (LE’s), the P2 configuration supports 8 cogs, 6 smart pins and 256 kB of HUB RAM.

P2 Explorer LabVIEWTM Front Panel

The screen capture below shows the current state of a vi that is being developed to interact with the P2 FPGA emulation (at the time of writing this was version 16a). Controls are provided at the top left of screen to exercise the Hub RAM, with buttons to clear it, to fill it (with linear and quadratic functions) and to upload this memory space to LabVIEWTM. Datasets uploaded from P2’s Hub RAM are displayed in three different ways - into a graph window, in tabular form and also in a logic analyzer format (see lower right area of the screen image).

Below the HUB RAM buttons are two additional buttons that control streamer actions; the left button inputs pins to Hub RAM (32 bits at a time) while the right one outputs Hub Ram to pins. The logic analyzer window allows data from consecutive longs in Hub RAM to be inspected graphically. Currently just the lower half of each HUB long (i.e. D15..0) gets displayed in this window - but it is a trivial matter to switch this so the upper word can also be visualized. As is usual with LabVIEWTM graph windows, scrollbars are provided so that the user can easily navigate the data space.

An 80 MHz logic analyzer sampling rate while achieving full 32 bit data captures is easily achieved with the FPGA emulation by coding just a few streamer instructions, and the real silicon will run faster (160 MHz ?). Currently, buffers of 65536 bytes = 16384 longs are moved between the P2 emulation’s HUB RAM and the host PC via a USB-to-serial bridge operating at 921,600 baud - but both the buffer length and baud rate can easily be increased.

Some CORDIC functionality has also been implemented. The user can enter a 32 bit decimal value into a text field and then operate on it with a nominated function. In the screen shown below, function 1 (a square root) has been applied to the value 1234567890, returning the result 35136.

A CORDIC square root can also be applied to the contents of Hub RAM by pressing another button. In the screen capture shown below the fill Hub RAM button has first been used to fill memory with the quadratic function n2; one operation of Sqrt(Hub) then returns the linear function as shown, while performing a second Sqrt(Hub) operation returns a square root function (not shown). The data on screen (red trace) and the logic analyzer view allow this behaviour to be very easily verified.

In the upper right portion of the vi are some controls to exercise the P2 emulation’s smart pins. A mode selector gives the user a choice of smart pin modality, however just three smart pin modes are currently exercised here.

In “transition output” mode, the user enters a number of period_cycles - this number determines the (equal) high and low periods (here 80 = 1 us) of the resulting waveform as well as the number of times this waveform is to be repeated. For example, by setting repeats to 8 a total of 4 clock pulses are generated on smart pin P0, taking a total of 8 us.

In the second test smart pins P0 and P1 are connected. This time P0 is configured to operate in “pulse/cycle” mode. Again the user specifies a full cycle period (here period_cycles = 80 = 1 us) and the number of cycles the output should remain high in each full cycle (here high_cycles = 1 = 12.5 ns). As before, the length of the resulting pulse train is controlled by the repeat count.

Smart pin P1 is configured to count pulse rising edges over an interval of CountA_Cycles. Configured in this way (and with P0 physically connected to P1) a RDPIN operation on P1 should yield a count that is identical to the number of pulses generated on P0 (providing of course that CountA_Cycles is sufficiently large). As a convenient check, the .spin2 code I’ve developed uses the resulting smart pin P1 count value to illuminate LED’s (on PORTB) on the DE2-115 board.

When running each of the above tests, the waveforms generated have been confirmed using an HP 54620A logic analyser.

Finally, as mentioned earlier, the P2 has an additional lookup table (LUT) RAM space. By pressing the button located at the bottom centre of screen, sine and cosine waveforms (1024 bytes each, 512 longs in total) are computed and stored into the LUT, while a second button uploads the LUT to LabVIEWTM and displays the results. See below for a few final thoughts about the P2 FPGA emulation….

Initial Impressions and Further Work

The experiments using the LabVIEWTM vi I’ve described here just barely scratch the surface of what is possible with the P2. My initial impressions are that it could become a powerful platform for instrument development. The streamer and smart pin modes in particular offer very high performance while the code needed to use these features remains extremely compact.

At this stage, documentation for the P2 is somewhat scant, but as further code examples become available and further information is gleaned, more features will be added to the vi just described. In some respects the vast array of options and features available on the P2 may seem somewhat overwhelming - but having a way to quickly test out code and visually see the results makes it easy to overcome this feeling and realise just how powerful the eventual silicon offering is likely to be.

Of immediate interest for further investigation are the many more smart pin modes - including DAC outputs, USB and synchronous/asychronous serial data transmission - and many other things not touched on here as well.