System requirement?
What is the system requirement to run this model, and how can I find that?
From reading the config, this is a Flaot16 model, using the Model Memory Estimator (https://huggingface.co/spaces/hf-accelerate/model-memory-usage), it provides the following specs for the wizard-coder 30B (as a llm):
dtype	                    Largest Layer or Residual Group	Total Size	Training using Adam
float32	                    2.59 GB	                                                125.48 GB	501.92 GB
int8	                        664.02 MB	                                            31.37 GB	    125.48 GB
float16/bfloat16	1.3 GB	                                                    62.74 GB	    250.96 GB
int4	                        332.01 MB	                                           15.68 GB    	62.74 GB
So, if you pull this down, you'll need 63GB of RAM to run it. I would love to quantize this to a int8, so it could fit on a 4090 or A6000, but don't know how right now.
I am able to run in on M1 Max 64GB. Not super fast, but it works
llama_print_timings:      sample time =  1804.24 ms /   729 runs   (    2.47 ms per token,   404.05 tokens per second)
llama_print_timings: prompt eval time =  3652.04 ms /   144 tokens (   25.36 ms per token,    39.43 tokens per second)
llama_print_timings:        eval time = 94289.78 ms /   728 runs   (  129.52 ms per token,     7.72 tokens per second)
llama_print_timings:       total time = 100932.23 ms
Output generated in 101.16 seconds (7.20 tokens/s, 728 tokens, context 144, seed 1690939106)
Llama.generate: prefix-match hit
llama_print_timings:        load time =  3652.09 ms
llama_print_timings:      sample time =  2548.89 ms /  1024 runs   (    2.49 ms per token,   401.74 tokens per second)
llama_print_timings: prompt eval time = 13158.02 ms /   751 tokens (   17.52 ms per token,    57.08 tokens per second)
llama_print_timings:        eval time = 141916.85 ms /  1023 runs   (  138.73 ms per token,     7.21 tokens per second)
llama_print_timings:       total time = 159473.00 ms
Output generated in 159.71 seconds (6.41 tokens/s, 1024 tokens, context 886, seed 1686911609)
Llama.generate: prefix-match hit
llama_print_timings:        load time =  3652.09 ms
llama_print_timings:      sample time =   694.30 ms /   276 runs   (    2.52 ms per token,   397.52 tokens per second)
llama_print_timings: prompt eval time = 19746.02 ms /  1023 tokens (   19.30 ms per token,    51.81 tokens per second)
llama_print_timings:        eval time = 43975.35 ms /   275 runs   (  159.91 ms per token,     6.25 tokens per second)
llama_print_timings:       total time = 64842.96 ms
Output generated in 65.07 seconds (4.23 tokens/s, 275 tokens, context 1909, seed 828516400)
