DeepSeek launched a free, open-source large language model in late December, claiming it was developed in just two months at a cost of under $6 million.
GPU is good for graphics. That’s what is designed and built for. It just so happens to be good at dealing with programmatic neural network tasks because of parallelism.
FPGA is fully programmable to do whatever you want, and reprogram on the fly. Pretty perfect for reducing costs if you have a platform that does things like audio processing, then video processing, or deep learning, especially in cloud environments. Instead of spinning up a bunch of expensive single-phroose instances, you can just spin up one FPGA type, and reprogram on the fly to best perform on the work at hand when the code starts up. Simple.
AMD bought Xilinx in 2019 when they were still a fledgling company because they realized the benefit of this. They are now selling mass amounts of these chips to data centers everywhere. It’s also what the XDNA coprocessors on all the newer Ryzen chips are built on, so home users have access to an FPGA chip right there. It’s efficient, cheaper to make than a GPU, and can perform better on lots of non-graphic tasks than GPUs without all the massive power and cooling needs. Nvidia has nothing on the roadmap to even compete, and they’re about to find out what a stupid mistake that is.
If I remember it correctly (I learned this stuff 3 decades ago) they were basically an improvement on logic circuits without clocks (think stuff like NAND and XOR gates - digital signals just go in and the result comes out on the other side with no delay beyond that caused by analog elements such as parasitical inductances and capacitances, so without waiting for a clock transition).
The thing is, back then clocking of digital circuits really took off (because it’s WAY simpler to have things done one stage at a time with a clock synchronizing when results are read from one stage and sent to the next stage, since different gates have different delays and so making sure results are only read after the slowest path is done is complicated) so all CPU and GPU architecture nowadays are based on having a clock, with clock transitions dictating things like when is each step of processing a CPU/GPU instruction started.
Circuits without clocks have the capability of being way faster than circuits with clocks if you can manage the problem of different digital elements having different delays in producing results I think what we’re seeing here is a revival of using circuits without clocks (or at least with blocks of logic done between clock transitions which are much longer and more complex than the processing of a single GPU instruction).
I’m not making an argument against it, just clarifying were it sits as technology.
As I see it, it’s like electric cars - a technology that was overtaken by something else in the early days when that domain was starting even though it was the first to come out (the first cars were electric and the ICE engine was invented later) and which has now a chance to be successful again because many other things have changed in the meanwhile and we’re a lot closes to the limits of the tech that did got widely adopted back in the early days.
It actually makes a lot of sense to improve the speed of what programming can do by getting it to be capable of also work outside the step-by-step instruction execution straight-jacked which is the CPU/GPU clock.
Huh. Everything I’m reading seems to imply it’s more like a DSP ASIC than an FPGA (even down to the fact that it’s a VLIW processor) but maybe that’s wrong.
I’m curious what kind of work you do that’s led you to this conclusion about FPGAs. I’m guessing you specifically use FPGAs for this task in your work? I’d love to hear about what kinds of ops you specifically find speedups in. I can imagine many exist, as otherwise there wouldn’t be a need for features like tensor cores and transformer acceleration on the latest NVIDIA GPUs (since obviously these features must exploit some inefficiency in GPGPU architectures, up to limits in memory bandwidth of course), but also I wonder how much benefit you can get since in practice a lot of features end up limited by memory bandwidth, and unless you have a gigantic FPGA I imagine this is going to be an issue there as well.
I haven’t seriously touched FPGAs in a while, but I work in ML research (namely CV) and I don’t know anyone on the research side bothering with FPGAs. Even dedicated accelerators are still mostly niche products because in practice, the software suite needed to run them takes a lot more time to configure. For us on the academic side, you’re usually looking at experiments that take a day or a few to run at most. If you’re now spending an extra day or two writing RTL instead of just slapping together a few lines of python that implicitly calls CUDA kernels, you’re not really benefiting from the potential speedup of FPGAs. On the other hand, I know accelerators are handy for production environments (and in general they’re more popular for inference than training).
I suspect it’s much easier to find someone who can write quality CUDA or PTX than someone who can write quality RTL, especially with CS being much more popular than ECE nowadays. At a minimum, the whole FPGA skillset seems much less common among my peers. Maybe it’ll be more crucial in the future (which will definitely be interesting!) but it’s not something I’ve seen yet.
No.
GPU is good for graphics. That’s what is designed and built for. It just so happens to be good at dealing with programmatic neural network tasks because of parallelism.
FPGA is fully programmable to do whatever you want, and reprogram on the fly. Pretty perfect for reducing costs if you have a platform that does things like audio processing, then video processing, or deep learning, especially in cloud environments. Instead of spinning up a bunch of expensive single-phroose instances, you can just spin up one FPGA type, and reprogram on the fly to best perform on the work at hand when the code starts up. Simple.
AMD bought Xilinx in 2019 when they were still a fledgling company because they realized the benefit of this. They are now selling mass amounts of these chips to data centers everywhere. It’s also what the XDNA coprocessors on all the newer Ryzen chips are built on, so home users have access to an FPGA chip right there. It’s efficient, cheaper to make than a GPU, and can perform better on lots of non-graphic tasks than GPUs without all the massive power and cooling needs. Nvidia has nothing on the roadmap to even compete, and they’re about to find out what a stupid mistake that is.
I remember Xilinx from way back in the 90s when I was taking my EE degree, so they were hardly a fledgling in 2019.
Not disputing your overall point, just that detail because it stood out for me since Xilinx is a name I remember well, mostly because it’s unusual.
They were kind of pioneering the space, but about to collapse. AMD did good by scooping them up.
FPGAs have been a thing for ages.
If I remember it correctly (I learned this stuff 3 decades ago) they were basically an improvement on logic circuits without clocks (think stuff like NAND and XOR gates - digital signals just go in and the result comes out on the other side with no delay beyond that caused by analog elements such as parasitical inductances and capacitances, so without waiting for a clock transition).
The thing is, back then clocking of digital circuits really took off (because it’s WAY simpler to have things done one stage at a time with a clock synchronizing when results are read from one stage and sent to the next stage, since different gates have different delays and so making sure results are only read after the slowest path is done is complicated) so all CPU and GPU architecture nowadays are based on having a clock, with clock transitions dictating things like when is each step of processing a CPU/GPU instruction started.
Circuits without clocks have the capability of being way faster than circuits with clocks if you can manage the problem of different digital elements having different delays in producing results I think what we’re seeing here is a revival of using circuits without clocks (or at least with blocks of logic done between clock transitions which are much longer and more complex than the processing of a single GPU instruction).
Yes, but I’m not sure what your argument is here.
Least resistance to an outcome (in this case whatever you program it to do) is faster.
Applicable to waterfall flows, FPGA makes absolute sense for the neural networks as they operate now.
I’m confused on your argument against this and why GPU is better. The benchmarks are out in the world, go look them up.
I’m not making an argument against it, just clarifying were it sits as technology.
As I see it, it’s like electric cars - a technology that was overtaken by something else in the early days when that domain was starting even though it was the first to come out (the first cars were electric and the ICE engine was invented later) and which has now a chance to be successful again because many other things have changed in the meanwhile and we’re a lot closes to the limits of the tech that did got widely adopted back in the early days.
It actually makes a lot of sense to improve the speed of what programming can do by getting it to be capable of also work outside the step-by-step instruction execution straight-jacked which is the CPU/GPU clock.
Is XDNA actually an FPGA? My understanding was that it’s an ASIC implementation of the Xilinx NPU IP. You can’t arbitrarily modify it.
Yep
Huh. Everything I’m reading seems to imply it’s more like a DSP ASIC than an FPGA (even down to the fact that it’s a VLIW processor) but maybe that’s wrong.
I’m curious what kind of work you do that’s led you to this conclusion about FPGAs. I’m guessing you specifically use FPGAs for this task in your work? I’d love to hear about what kinds of ops you specifically find speedups in. I can imagine many exist, as otherwise there wouldn’t be a need for features like tensor cores and transformer acceleration on the latest NVIDIA GPUs (since obviously these features must exploit some inefficiency in GPGPU architectures, up to limits in memory bandwidth of course), but also I wonder how much benefit you can get since in practice a lot of features end up limited by memory bandwidth, and unless you have a gigantic FPGA I imagine this is going to be an issue there as well.
I haven’t seriously touched FPGAs in a while, but I work in ML research (namely CV) and I don’t know anyone on the research side bothering with FPGAs. Even dedicated accelerators are still mostly niche products because in practice, the software suite needed to run them takes a lot more time to configure. For us on the academic side, you’re usually looking at experiments that take a day or a few to run at most. If you’re now spending an extra day or two writing RTL instead of just slapping together a few lines of python that implicitly calls CUDA kernels, you’re not really benefiting from the potential speedup of FPGAs. On the other hand, I know accelerators are handy for production environments (and in general they’re more popular for inference than training).
I suspect it’s much easier to find someone who can write quality CUDA or PTX than someone who can write quality RTL, especially with CS being much more popular than ECE nowadays. At a minimum, the whole FPGA skillset seems much less common among my peers. Maybe it’ll be more crucial in the future (which will definitely be interesting!) but it’s not something I’ve seen yet.
Looking forward to hearing your perspective!