Sunday, December 23, 2012

ZPUino - The "was", "is" and "will be"

Ok, guys, I'm not very fan of copy/paste, but here's a copy of what I just posted to ZPU mailing list:

http://mail.zylin.co...ber/001913.html

Hi guys,

Since it's almost Christmas it's perhaps time to get you all updated about ZPUino, what has been done and accomplished so far, what is being done right now, and
what future holds.

The ZPUino project started back in 2010 and published first alpha release in December the same year. The objective of the project was to implement an Arduino
(wiring) compatible platform, but running with a ZPU core and devices similar to those present on Arduino AVR devices. The project developed in several phases
and with several hardware versions for each phase. It started by a simple SoC using the traditional ZPU core, and with some basic devices like UART and SPI. A
software bootloader/programmer was also implemented, using the standard serial port and a variant (very variant) of HDLC protocol for communication with
programmer devices - ZPUino was designed to bootstrap its "sketches" from an external SPI flash, and logic for programming those flash devices was split between
the host programmer (which now is known to run on major operating systems, like Microsoft Windows, Linux and MacOS), and the device programmer.

Everything was set up to allow almost seamless migration of Arduino code into ZPUino code.

During this first phase the Arduino IDE/Wiring library was adapted to support ZPUino, and a new compiler mode was then implemented, since it did not support
multi platform (as of now, it does, but I still keep the "make" approach I designed back then).

The second phase relied on hardware design. A new core was implemented (ZPUino Premium), which had a full 3-stage pipeline and was able to execute most basic
instructions in one clock.
Some new core devices were also added, like Audio (sigma-delta), and complex PWM-able timers. The main IO interface is wishbone compliant, so any wishbone
compliant device should work with the design (I've tested a few, like OpenCores I2C, and works like a charm). A few design variants were written, like memory
mapped VGA, DMA VGA (such as the ZX Spectrum version), audio synthesis, and many more. But only internal RAM (BRAM) was supported.

There was a singular variant of this design, one which actually implemented a new instruction (which I called FMUL16), which could perform a 16.16 fixed point
multiplication, and speed up some operations. This variant was used in the SoundPuddle project.

Let me now tell you about the SoundPuddle project.

Back in April this (2012) year, I was contacted by John English from Colorado, US, asking if ZPUino could do real time signal analysis for a project he wanted
to show in Apogaea 2012.
After some initial analysis I said it was feasible, and so we moved to implement the thing on ZPUino in a S3E500 board (Papilio One), from Gadget Factory. It
was indeed feasible, and it was a huge success. It was improved and shown at Burning Man festival the same year. Feedback was awesome.

For some low-level details on this one:

A 1024-point FFT was implemented in software, whose inputs came from an external ADC. The FFT code was entirely done in assembly code (a whopping 177 bytes!),
using the FMUL16 instruction. This was fast enough for what the project needed (actually, it ended up being too fast, and we had to add some delays). The real
constraint here was the amount of memory available of the device. The system ran with around 40KB. Tough, but possible.

Intro video for Kickstacker is here: http://kck.st/MAu7oQ

Almost at same time, Jack Gasset  (from Gadget Factory) started the Retrocade Synth project:
http://www.kickstart...to-rule-them-al . This uses now the Extreme core, as described below.

Both projects were successfully funded, and are now shipping to its supporters.

Back to the design:

The core, due to it's pipelined design, required fast memory since it needed to simultaneously read the instruction stream, read stack values and write back
stack. And we were
very limited on block RAM, so it was time to move to another design.

ZPUino Extreme was then born.

ZPUino Extreme took another approach - it used block RAM for the stack (which was fixed, 4KB or 8KB), and used external memory for the program area and data. In
order to do so, we designed memory interfaces (SRAM, SDRAM and DDR-SDRAM), all working in wishbone pipelined mode, and added a simple, direct-mapped instruction
cache. This allowed us to run larger codebases, and access more memory than usual. This is still the fastest core if you need large code/data, and can live with
the limited, non-switchable stack. For most single-task applications, this is indeed the core you need.

But for complex designs this was still not enough. The fixed, limited stack prevented us from running more complex applications. At first a simple
write-back-stack, read-new-stack approach was tried, but was somewhat complex, and very slow.

So, ZCoreV3 was born :)

Yes, I decided to change the name for the core. I was running out of acronyms :P - now, seriously, I though a lot about the naming of ZPUino cores, and they
wouldn't cope with further development improvements, so I went radical.

First of all, ZCoreV3 is not yet in production, although it's considered (by me) stable. It's stability will be proven during next months, although I'm feeling
confident. A few improvements are also being thought of, so it might take a while before a first stable version is available to you all.

So, what's so different about ZCoreV3 ? Well, something simple, but something very complex: the stack is no longer fixed.

Although this might look like a simple thing, it's indeed the most complex thing I did in hardware!!!

ZCoreV3 shares the same pipeline and instruction cache as ZPUino Extreme, and adds a data cache, direct-mapped, one-way associative, dual-ported, write-back,
which can in "hit" scenarios attain a 1-clock read delay, and 0-clock write delay. Only one of the ports is writeable, though. Conflicts (r/w) are handled by
the cache itself,  so the core does not need to address that. The core is also slightly different, featuring not only TOS cacheing, bu also NOS cacheing (but
TOS is always written back for stack push operations).  Further improvements are to identify "hot" cache lines (those being accessed as stack) and perform
write-through for some memory accesses (or eventually convert it to a two-way associative cache).

So, since ZCoreV3 design is able to address a lot of memory, and not many restrictions on it's use (if any), we can probably put it to some real work....

... and it now runs Linux (MMU-less version)!

There are still some things needing implementation on Linux side (and uClibc), and a few stability issues, but things now look very promising.

I'm uploading a small video of it running on Gadget Factory Papilio Pro board (S6LX9), with 8MB SDRAM, and a real SD card. You can see it here:


A few things still to address. Some stability issues need to be addressed (all those are software, eventually related to kernel stack switch), some functions
(memcpy, memset, string functions) need some optimizations (ie., assembler versions, memcpy already has one), the SPI controller is limited to 8-bit, which
makes it very slow (as you can see from the video, takes some time to exec. the first application), and some more, which I'll address. First, make it run
stable, then optimize.

I'm hoping to get this to run on S3ESK soon, at same speed (96MHz), so you guys can also help (I know some of you have this board at home).

Plans for the future: oh, well, first, get Linux and other operating systems running stable, getting DMA to work properly with the dcache, some new VGA
adaptors, what else....

Let's hope 2013 is a good year for ZPU and ZPUino.

A few thank-you:

- To all ZPU and ZPUino users, we're doing this for you, thank you !
- My family, for their support (although they don't know what I'm doing! :P )
- Jack Gassett, and Gadget Factory, for they support with hardware and ideas! Thanks Jack!
- John English, the SoundPuddle Engineer, for the real-world use of ZPUino and a lot more!
- All those who helped with ZPUino, they are so many I won't risk forgetting anyone, so you're all included!
- All ZPU fans!

As always, any doubts, questions, opinions, so on, are very very welcome!

And have a merry Christmas!

Alvie

PS: I'm not explaining something here - it's a challenge to your intellect and HDL knowledge :P I'll just say "data cache", hopefully someone will question how
is it possible. lol!


And merry Christmas to you all :)
Alvie

1 comment:

  1. This looks very good! Even though I haven't had time to use the ZPU stuff, I follow your work. I think the co-development of hardware, compiler and software is an immensly interesting field.

    ReplyDelete