In this video I will show you how to install onnx runtime GPU support and do inference with a generative Model. We will use a Phi3-mini-4k quantized to 4int.
After that we will convert an original Phi3-mini-128k into a 4int quantized version with the runtime onnx