mupuf.org // we are octopimupuf.org

FPGA: Why So Few Open Source Drivers for Open Hardware?

Field-Pro­gram­ma­ble Gate Ar­rays (FPGA) have been an in­ter­est of mine for well over a decade now. Be­ing able to gen­er­ate com­plex sig­nals in the tens of MHz range with nanosec­ond ac­cu­racy, deal­ing with fast data streams, and do­ing all of this at a frac­tion of the power con­sump­tion of fast CPUs, they re­ally have a lot of po­ten­tial for fun. How­ever, their pro­hib­i­tive cost, pro­pri­etary tool­chains (some run­ning only on Win­dows), and the in­sanely-long bit­stream gen­er­a­tion made them look more like a cu­rios­ity to me rather than a prac­ti­cal so­lu­tion. Fi­nally, writ­ing ver­ilog / VHDL di­rectly felt like the equiv­a­lent of writ­ing an OS in as­sem­bly and thus felt more like tor­ture than fun for the young C/C++ de­vel­oper that I was. Lit­tle did I know that 10+ years later, I would find HW de­vel­op­ment to be the most amaz­ing thing ever!

The first thing that changed is that I got in­volved in re­verse en­gi­neer­ing NVIDIA GPUs’ power man­age­ment in or­der to write an open source dri­ver, writ­ing in a re­verse-en­gi­need as­sem­bly to im­ple­ment au­to­matic power man­age­ment for this dri­ver, cre­at­ing my own smart wire­less modems which de­tects the PHY pa­ra­me­ters of in­com­ing trans­mis­sions on the fly (mod­u­la­tion, cen­ter fre­quency) by us­ing soft­ware-de­fined ra­dio, and hav­ing fun with ar­duinos, sin­gle-board com­put­ers, and de­sign­ing my cus­tom PCBs.

The sec­ond thing that changed is that Moore’s law has grinded to a halt, lead­ing to a more ar­chi­tec­ture-cen­tric in­stead of a fab-ori­ented world. This re­duced the ad­van­tage ASICs had on FP­GAs, by cre­at­ing a soft­ware eco-sys­tem that is more geared to­wards par­al­lelism rather than high-fre­quency sin­gle-thread per­for­mance.

Fi­nally, FP­GAs along with their com­mu­nity have got­ten a whole lot more at­trac­tive! From the FP­GAs them­selves to their tool­chains, let’s re­view what changed, and then ask our­selves why this has not trans­lated to up­stream Linux dri­vers for FPGA-based open source de­signs.

Even hob­by­ists can make use­ful HW de­signs

Pro­gram­ma­ble logic el­e­ments have gone through mul­ti­ple ages through­out their life. Since their hum­ble be­gin­ning, they have al­ways ex­celled at low-vol­une de­signs by spread­ing the cost of cre­at­ing a new ASIC onto as many cus­tomers as pos­si­ble. This has en­abled start-ups and hob­by­ists to cre­ate their own niche and get into the mar­ket with­out break­ing the bank.

Nowa­days, FP­GAs are all based around Lookup Ta­bles (LUT) rather than a set of logic gates as they can re-cre­ate any logic func­tion and can also serve as flip-flops (mem­ory unit). Let’s have a quick look at what changed through­out the “stack” that makes de­sign­ing FPGA-based HW de­signs so ap­proach­able even to hob­by­ists.

Price per LUT

His­tor­i­cally, FP­GAs have com­pared neg­a­tively to ASICs due to their in­creased la­tency (lim­it­ing the max­i­mum fre­quency of the de­sign), and power ef­fi­ciency. How­ever, just like CPUs and GPUs, one can com­pen­sate for these lim­i­ta­tions by mak­ing a wider/par­al­lel de­sign op­er­at­ing at a lower fre­quency. Wider de­signs how­ever re­quire more logic el­e­ments / LUTs.

For­tu­nately, the price per LUT has fallen dra­mat­i­cally since the in­tro­duc­tion of FP­GAs, to the point that pretty much all but the biggest de­signs would fit in them. Since then, the fo­cus has shifted on pro­vid­ing hard IPs (fixed func­tions) in­stead. This en­ables a $37 part (XC7A12T) to be able to fit over 3 Linux-wor­thy RISC-V proces­sors run­ning at 180 MHz, with 80 kB of block RAM avail­able for caches, FI­FOs, or any­thing else. By rais­ing the bud­get to the $100 mark, the specs im­prove dra­mat­i­cally with an FPGA ca­pa­ble of run­ning 40 Linux-wor­thy RISC-V CPUs and over 500 kB of block RAM avail­able for caches!

And just in case this would not be enough for you, you could con­sider the Alveo line up such as the Alveo U250 which has 1.3M LUTs and a peak through­put in INT8 op­er­a­tions of 33 TOPs and 64 GB of DDR4 mem­ory (77 GB/s band­width). For mem­ory-band­width-hun­gry de­signs, the Alveo U280 brings 8 GB of HBM2 mem­ory to the table (460GB/s band­width) and 32 GB of DDR4 mem­ory (38 GB/s of band­width), at the ex­pense of hav­ing “only” 24.5 INT8 TOPs and 1M LUTs. Both mod­els can be found for ~$3000 on ebay, used. What a bar­gain :D !

Tool­chains

Pro­pri­etary tool­chains

Linux is now re­ally sup­ported by the ma­jor play­ers of the in­dus­try. Xil­inx’s sup­port came first (2005), while Al­tera joined the club in 2009. Both are how­ever the de­f­i­n­i­tion of bloated, with tool­chains weigh­ing mul­ti­ple GB (~6GB for Al­tera, while Xil­inx is at a whoop­ing 27 GB)!

Open source tool­chains for a few FP­GAS

Pro­ject ices­torm cre­ated a fully-func­tional fully-open­source tool­chain for Lat­tice’s ice40 FP­GAs. Its reg­u­lar struc­ture made the re­verse en­gi­neer­ing and writ­ing the tool­chain eas­ier. Since then, the more com­plex Lat­tice ECP5 FPGA got full sup­port, and Xil­inx’s 7-se­ries is un­der way. All these pro­jects are now work­ing un­der the Symb­i­flow um­brella, which aims to be­come the GCC of FP­GAs.

Lan­guages:

Mi­gen / LiteX

VHDL/Ver­ilog are er­ror-prone and do not land them­selves to com­plex pa­ram­e­triza­tion. This re­duces the re-us­abil­ity of mod­ules. On the con­trary, the Python lan­guage ex­cels at meta-pro­gram­ming, and Mi­gen pro­vides a way to gen­er­ate ver­ilog from rel­a­tively-sim­ple python con­structs.

On top of Mi­gen, LiteX pro­vides easy-to-use and space-ef­fi­cient mod­ules to cre­ate your own Sys­tem On Chip (SoC) in less than an hour! It al­ready has sup­port for 16+ pop­u­lar boards, gen­er­ates ver­ilog, builds, and loads the bit­stream for you. Doc­u­men­ta­tion is how­ever quite sparse, but I would sug­gest you read the LiteX for Hard­ware En­gi­neers guide if you want to learn more.

High-level Syn­the­sis (HLS)

For com­plex al­go­rithms, Mi­gen/VHDL/Ver­ilog are not the most ef­fi­cient lan­guages as they are too low-level and are akin to writ­ing im­age recog­ni­tion ap­pli­ca­tions in as­sem­bly.

In­stead, high-level syn­the­sis en­ables writ­ing an un­timed model of the de­sign in C, and con­vert it in an ef­fi­cient Ver­ilog/VHDL mod­ule. This makes it easy to val­i­date the model, and to tar­get mul­ti­ple FPGA ven­dors with the same code with­out an ex­pen­sive rewrite of the mod­ule. More­over, changes in the al­go­rithm or la­tency re­quire­ments will not re­quire an ex­pen­sive rewrite and re-val­i­da­tion. Sounds amaz­ing to me!

The bad part is that most of C/C++-com­pat­i­ble HLS tools are pro­pri­etary or seem to be aca­d­e­mic toy pro­jects. I hope I am wrong though, so I’ll need to look more into them as the prospects are just too good to pass! Let me know in the com­ments which pro­jects are your favourite!

Hard IPs (Fixed func­tions)

Ini­tially, FP­GAs were only made of a ton of gates / LUTs, and de­signs would be fully im­ple­mented us­ing them. How­ever, some func­tions could be bet­ter im­ple­mented as a fast and ef­fi­cient fixed func­tion: block mem­ory, Se­ri­al­izer/De­se­ri­al­izer (par­al­lel to se­r­ial and vice versa, of­ten call SERDES), PLLs (clock gen­er­a­tors), mem­ory con­trol­ers, PCIe, …

These fixed-func­tion blocks are called Hard IPs, while the part im­ple­mented us­ing the pro­gram­ma­ble part of the FPGA is by ex­ten­sion called a soft IP. Hard IPs used to be re­served to higher-end parts, but they are nowa­days found on most FP­GAs, save the cheap­est and small­est ones which are de­signed for low-power and self-re­liance.

For ex­am­ple, the $100 part men­tioned ear­lier in­cludes mul­ti­ple SERDES that are suf­fi­cient to achieve HDMI 1.4 com­pli­ance, a PCIe 2.0 with 4 lanes block, and a DDR3 mem­ory con­troler. This makes it suf­fi­cient for im­ple­ment­ing dis­play con­trol­ers with mul­ti­ple out­puts and in­puts, as seen on the NeTV2 open hard­ware board.

Hard IPs can also be the ba­sis of pro­pri­etary soft IPs. For in­stance, Xil­inx sells HDMI 1.4/2.0 re­ceivers IPs that use the SERDES hard IPs to achieve the nec­es­sary 18Gb/s band­width needed to achieve HDMI com­pli­ance.

Soft-CPUs

One might won­der why use an FPGA to im­ple­ment a CPU. In­deed, phys­i­cal CPUs which are dirt-cheap and bet­ter-per­form­ing could sim­ply be in­stalled along­side the FPGA! So, why waste LUTs on a CPU? This ar­ti­cle ad­dresses it bet­ter than I could, but the gist of it is that they re­ally com­ple­ment fixed-logic well for less la­tency-ori­ented parts and pro­vide a lot of value. The in­con­ve­nients are that an ad­di­tional firmware is needed for the SoC, but that is no dif­fer­ent from hav­ing ex­ter­nal CPUs.

There has been quite a few open source toy soft-CPUs for FP­GAs, and some pro­pri­etary ven­dor-pro­vided ones. The prob­lem has been that their tool­chain was of­ten out of tree, and/or Linux couldn’t run on them. This re­ally changed with the in­tro­duc­tion of RISC V, which is pretty ef­fi­cient, is sup­ported in main­line Linux and GCC, and can fit com­fort­ably in even the small­est FP­GAs from Al­tera and Xil­inx. What’s there not to love?

Open de­sign / open hard­ware boards

So, all of these nice im­prove­ments in FP­GAs and their com­mu­nity is great, but it wouldn’t be as at­trac­tive if not for all the cheap and rel­a­tively-open boards (if not fully-OSHW-com­pli­ant) with their in­o­v­a­tive de­signs us­ing them:

  • Fomu ($50): an ice40-based FPGA that fits in your USB port and is suf­fi­cient to play with RISC V and a cou­ple of IOs us­ing a full-open­source tool­chain!
  • Ice­Breaker ($69): a more tra­di­tional ice40-based board that is ori­ented to­wards IOs, low-cost, and a full-open­source tool­chain.
  • ULX3S ($115-200): the ul­ti­mate ECP5-based board? It can be used as a com­plete hand­held or sta­tic game con­sole (in­clud­ing wire­less con­trol­ers) with over-the-air up­dates, a USB/Wire­less dis­play con­troler, an ar­duino-com­pat­i­ble home-au­toma­tion gate­way in­clud­ing sur­veil­lance cam­eras. All of that with a full-open­source tool­chain.
  • NeTV2: Video-ori­ented plat­form with 2 HDMI in­puts and 2 HDMI out­puts which can run as a stand­alone de­vice with USB and Eth­er­net con­nec­tiv­ity, or as an ac­cel­er­a­tor us­ing the PCIe 2.0 4x con­nec­tor. The most ex­pen­sive board has enough gates to get into se­ri­ous com­put­ing power which could be used to cre­ate a slow GPU, with a pretty-de­cent dis­play con­troler! Be­ing Xil­inx’s Ar­tix7-based, the open­source tool­chains is not yet com­plete, but by the time you will be done im­ple­ment­ing your de­sign, I am sure the tool­chain will be ready!

Ul­ti­mately, these boards pro­vide a good plat­form for any sort of pro­ject, fur­ther re­duc­ing the cost of en­try in the hobby / mar­ket, and pro­vid­ing ready-made de­signs to be in­cor­po­rated in your pro­jects. All seem pretty good on the hard­ware side, so why don’t we have a huge com­mu­nity around a board that would pro­vide the flex­i­bil­ity of ar­duinos but with Rasp­berry-Pi-like fea­ture set?

Open source hard­ware blocks ex­ist

We have seen that board avail­abil­ity, tool­chains, lan­guages, speed, nor price are lim­it­ing even hob­by­ists from get­ting into hard­ware de­sign. So, there must be open blocks that could be in­cor­po­rated in de­signs, right?

The an­swer is a re­sound­ing YES! The first pro­ject I would like to talk about is LiteX, which is a HDL lan­guage with bat­ter­ies in­cluded (like Python). Here is a trimmed-down ver­sion of the dif­fer­ent blocks it pro­vides:

  • LiteX
    • Soft CPUs: black­par­rot, cv32e40p, lm32, mi­crowatt, min­erva, mor1kx, pi­corv32, rocket, serv, and vexriscv
    • In­put/Out­puts: GPIO, I2C, SPI, I2S, UART, JTAG, PWM, XADC, …
    • Wish­bone bus: En­able MMIO ac­cess to the dif­fer­ent IPs for the soft-CPUs, or through dif­fer­ent buses (PCIe, USB, eth­er­net, …)
    • Clock do­mains, ECC, ran­dom num­ber gen­er­a­tion, …
  • Lit­e­DRAM: A SDRAM con­troller soft IP, or wrap­per for DDR/LPDDR/DDR2/DDR3/DDR4 hard IPs of Xil­inx or DDR3 for the ECP5.
  • LiteEth: A 10/100/1000 eth­er­net soft IP which also al­lows you to ac­cess the wish­bone bus through it!
  • LiteP­CIe: Wrap­per for the PCIe Gen2 x4 hard IPs of Xil­inx and In­tel
  • Lite­SATA / LiteS­D­Card: Soft IP to ac­cess SATA dri­ves / SD Cards, pro­vid­ing ex­ten­sive stor­age ca­pa­bil­i­ties to your soft CPU.
  • Lite­V­ideo: HDMI in­put/out­put soft IPs, with DMA, triple buffer­ing, and color space con­ver­sion.

Us­ing LiteX, one may cre­ate a com­plete Sys­tem of Chip in a mat­ter of hours. Adding a block is as sim­ple as adding two lines of code to the SoC: One line to in­stan­ti­ate the block (like one would in­stan­ti­ate an ob­ject), and one to ex­pose it through the wish­bone bus. And if this isn’t enough, check out the new Open WiFi pro­ject, or the Open­Cores pro­ject which seems to have pretty much every­thing one could hope for.

So… where are the dri­vers for open source blocks?

We have seen that rel­a­tively-open boards with ca­pa­ble FP­GAs and use­ful IOs are af­ford­able even to hob­by­ists. We have also seen that cre­at­ing SoCs can be done in a mat­ter of hours, so why don’t we have dri­vers for all of them?

I mean, we have a FPGA sub­sys­tem that is fo­cused on load­ing bit­streams at boot, or even sup­port­ing on-the-fly FPGA re­con­fig­u­ra­tion. We have sup­port for most hard IPs, but only when ac­cessed through the in­te­grated ARM proces­sor of some FP­GAs. So, why don’t we have dri­vers for soft IPs? Could it be their de­vel­op­ers would not want to up­stream dri­vers for them be­cause the in­ter­face and the base ad­dress of the block is sub­ject to change? It cer­tainly looks like it!

But what if we could cre­ate an in­ter­face that would al­low list­ing these blocks, the cur­rent ver­sion of their in­ter­face, and their base ad­dress? This would ba­si­cally be akin to the De­vice Tree, but with­out the need to ship to every sin­gle user the netlist for the SoC you cre­ated. This would en­able the cre­ation of a generic up­stream dri­ver for all the ver­sions of a soft IPs and all the boards us­ing them, and thus make open source soft IPs more us­able.

Re­mov­ing the fear of ABI in­sta­bil­ity in open cores is at the core of my new pro­ject, Lite­DIP. To demon­strate its ef­fec­tive­ness, I would like to ex­pose all the hard­ware avail­able on the NeTV2 (HDMI IN/OUT, 10/100 eth­er­net, SD Card reader, Fan, tem­per­a­ture, volt­ages), and the ULX3S (HDMI IN/OUT, WiFi, Blue­tooth, SD Card reader, LEDs, GPIOs, ADC, but­tons, Au­dio, FM/AM ra­dio, …) us­ing the same dri­ver. Users could pick and chose mod­ules, con­fig­ure them to their lik­ing, and no dri­ver changes would be nec­es­sary. It sounds am­bi­tious, but also seems like a wor­thy chal­lenge! Not only do I get to en­joy a new hobby, but it would bring to­gether soft­ware and hard­ware de­vel­op­ers, en­abling the cre­ation of mod­ern-ish com­put­ers or ac­cel­er­a­tors us­ing one size fits all open de­vel­op­ment boards.

Am I the only one ex­cited by the prospect? Stay tuned for up­dates on the pro­ject!

2020-06-12 edit: Fixed mul­ti­ple ty­pos spot­ted by For­est Cross­man, the con­fu­sion be­tween kb and kB spot­ted by Mic, added a link to the Linux-wor­thy VexRiscv CPU, re­moved the con­fu­sion spot­ted by TD-Linux be­tween HLS and Scala-based HDLs, link to the open-source hard­ware de­f­i­n­i­tion and do not la­bel all boards as be­ing fully open as sug­gested by the feed­back from in­am­ber­clad and abe­tusk.

Comments