the prerequisite of fp8 quantization and knowledge distillation is that you must first have a larger model trained in higher precision, which entails requiring larger gpu memory and processing speeds. you then use these larger models to guide training of smaller models by distillation and further reduce their computational costs by quantization.
you can workaround memory requirements by coding your own project to efficiently parrellelize parts of computation that are bottlenecked by memory across multiple smaller gpus instead of a single colossal computing unit if you cannot biy the single colossal one for whatever reason. however not all computationally intensive operations can be parallelized this way, depending on the AI model architecture. nvidia also utilizes high bandwidth memory and direct access between cpu and gpu in their enterprise computing unit, so there is definitely better performance and cost effectiveness versus distributed computing.
modding existing gpu hardware with larger memory chips is also not unheard of but the applicability per card model varies i.e. so far only heard this done on Ampere generation. unofficial mods may introduce additional issues and would also require patching hardware drivers or AI tech stack to resolve compatibility issues.
bearing in mind that these are workarounds, if the majority of users aren’t using such workarounds, the overall ecosystem be it closed sourced or open sourced would not bother maintaining compatibility for such workarounds but develop according to whatever the most cost effective hardware is available to them. the additional burden would be on the smaller group of users to maintain compatibility with no guarantees that similar workarounds may still be possible on future hardware or software.
you may be able to avoid the requirement of training a larger teacher model first for distillation if you can somehow gain access to one, perhaps by chatgpt api. you would have to decouple parts of the training process so you don’t get hard bottlenecked by the api limits during training, so you have to find some way to extract data out of chatgpt that can be used for offline training in parallel across hundreds of gpu.
for the layman, what this means is at the forefront of AI, hardware performance is still a deciding factor. if you do not have a concrete idea of exactly where to go, you cannot take an optimized shortcut to that destination. optimization should be done last. however, there can’t be too many market leaders as the cost is enormous so usually is one leader at the top like TSMC, followed by many smaller market players that do not need that EUV or DUV for their product or services.
not everyone sees this or agree with it but this AI revolution isnt just propped up by consumer or industrial demand and dictated by market forces solely. in some ways the AI race between global superpowers mimics the space race during the cold war or the nuclear arms race even further back. in this case, the funding and demand for performant hardware then isnt solely driven by consumers or enterprises. for NVDA of course this also brings into the possibility of competitors borne out of necessity even in America, though at this point in time NVIDIA still holds a very large moat in their software ecosystem.