Create README.md
Browse files
    	
        README.md
    ADDED
    
    | @@ -0,0 +1,21 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ---
         | 
| 2 | 
            +
            base_model: Qwen/Qwen2-0.5B
         | 
| 3 | 
            +
            pipeline_tag: text-generation
         | 
| 4 | 
            +
            ---
         | 
| 5 | 
            +
            ANE-compatible stateful CoreML models. Maximum context length of 512.
         | 
| 6 | 
            +
            Multifunction models that process 1 or 64 tokens.
         | 
| 7 | 
            +
             | 
| 8 | 
            +
            6 bits quantized models apply a grouped-per-output-channel LUT with group size 4.
         | 
| 9 | 
            +
            For example if the weights have shape (32, 64), the LUT has shape (8, 1, 36), ANE does not support
         | 
| 10 | 
            +
            per-input channel grouping, and smaller group sizes are considerably slower, while larger group size are barely faster.
         | 
| 11 | 
            +
             | 
| 12 | 
            +
            After LUT dequantization a per-output-channel scaling is applied (would have size (32, 1) for the same example shapes).
         | 
| 13 | 
            +
             | 
| 14 | 
            +
            Quantization is not applied to the first and last layers, and embeddings (head weights are shared with input embeddings).
         | 
| 15 | 
            +
             | 
| 16 | 
            +
            Current issues:
         | 
| 17 | 
            +
            - Input embeddings are duplicated, once for the input and another for the prediction head, since ANE supports a maximum size of `16_384`, the weights have to be split, which causes CoreML to duplicate the weights. It should be possible to remove the input embeddings and read the weights directly for the `weights.bin` file.
         | 
| 18 | 
            +
             | 
| 19 | 
            +
            This model requires iOS18 or MacOS 15 to run, and CoreMLTools Beta if running in Python (`pip install coremltools==8.0b2`)
         | 
| 20 | 
            +
             | 
| 21 | 
            +
            And example on how to use the models can be found in the `coreml_example.py` and can be run with the following command `python src/coreml_example.py --model-path ./nbs/Qwen-2-1.5B-6Bits-MF.mlmodelc -p "Write a joke in a poem of Harry Potter" --max-tokens 200 --min_p 0.2 --temp 1.5`
         |