How to properly save and deploy quantized LLM with RAG system to Hugging Face hub?

I built a retrieval-augmented generation system using a 4-bit quantized openllama 13b model directly from Hugging Face without any fine-tuning.

I’m facing two main issues:

  1. After executing torch.save(model.state_dict(), 'filepath') to save the model locally, it saves as an adapter model. I am unable to load it back or push it to Hugging Face.

  2. I want to set this up for Hugging Face to utilize their inference API. How do I properly configure everything?

Here is my setup code for model quantization:

quant_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

auth_token = '*'
llm_config = transformers.AutoConfig.from_pretrained(
    model_name,
    use_auth_token=auth_token
)

llm_model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    config=llm_config,
    quantization_config=quant_config,
    device_map='auto',
    use_auth_token=auth_token
)
llm_model.eval()

Any assistance would be appreciated!

aye, totally get it! torch.save can be tricky with quantized models. try usin model.save_pretrained() for better results. and yeah, don’t forget to upload yr original model files to Hugging Face. ez fix, gl!