I built a retrieval-augmented generation system using a 4-bit quantized openllama 13b model directly from Hugging Face without any fine-tuning.
I’m facing two main issues:
-
After executing
torch.save(model.state_dict(), 'filepath')to save the model locally, it saves as an adapter model. I am unable to load it back or push it to Hugging Face. -
I want to set this up for Hugging Face to utilize their inference API. How do I properly configure everything?
Here is my setup code for model quantization:
quant_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)
auth_token = '*'
llm_config = transformers.AutoConfig.from_pretrained(
model_name,
use_auth_token=auth_token
)
llm_model = transformers.AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
config=llm_config,
quantization_config=quant_config,
device_map='auto',
use_auth_token=auth_token
)
llm_model.eval()
Any assistance would be appreciated!