Update README.md
Browse files
README.md
CHANGED
|
@@ -11,8 +11,9 @@ tags:
|
|
| 11 |
This model provides a few variants of
|
| 12 |
[deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) that are ready for
|
| 13 |
deployment on Android using the
|
| 14 |
-
[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert)
|
| 15 |
-
[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference)
|
|
|
|
| 16 |
|
| 17 |
## Use the models
|
| 18 |
|
|
@@ -28,6 +29,16 @@ on Colab could be much worse than on a local device.*
|
|
| 28 |
|
| 29 |
### Android
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
* Download and install
|
| 32 |
[the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
|
| 33 |
* Follow the instructions in the app.
|
|
@@ -45,31 +56,37 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
|
|
| 45 |
|
| 46 |
<table border="1">
|
| 47 |
<tr>
|
| 48 |
-
<th></th>
|
| 49 |
<th>Backend</th>
|
|
|
|
|
|
|
| 50 |
<th>Prefill (tokens/sec)</th>
|
| 51 |
<th>Decode (tokens/sec)</th>
|
| 52 |
<th>Time-to-first-token (sec)</th>
|
| 53 |
-
<th>Memory (RSS in MB)</th>
|
| 54 |
<th>Model size (MB)</th>
|
|
|
|
|
|
|
| 55 |
</tr>
|
| 56 |
<tr>
|
| 57 |
-
<td>
|
| 58 |
-
<td>
|
| 59 |
-
<td><p style="text-align: right">
|
| 60 |
-
<td><p style="text-align: right">
|
| 61 |
-
<td><p style="text-align: right">
|
| 62 |
-
<td><p style="text-align: right">
|
| 63 |
-
<td><p style="text-align: right">
|
|
|
|
|
|
|
| 64 |
</tr>
|
| 65 |
<tr>
|
| 66 |
-
<td>
|
| 67 |
-
<td>
|
| 68 |
-
<td><p style="text-align: right">
|
| 69 |
-
<td><p style="text-align: right">
|
| 70 |
-
<td><p style="text-align: right">
|
| 71 |
-
<td><p style="text-align: right">
|
| 72 |
-
<td><p style="text-align: right">
|
|
|
|
|
|
|
| 73 |
</tr>
|
| 74 |
|
| 75 |
</table>
|
|
@@ -80,4 +97,5 @@ Note that all benchmark stats are from a Samsung S24 Ultra with
|
|
| 80 |
* The inference on CPU is accelerated via the LiteRT
|
| 81 |
[XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
|
| 82 |
* Benchmark is done assuming XNNPACK cache is enabled
|
|
|
|
| 83 |
* dynamic_int8: quantized model with int8 weights and float activations.
|
|
|
|
| 11 |
This model provides a few variants of
|
| 12 |
[deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) that are ready for
|
| 13 |
deployment on Android using the
|
| 14 |
+
[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert),
|
| 15 |
+
[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) and
|
| 16 |
+
[LiteRt-LM](https://github.com/google-ai-edge/LiteRT-LM).
|
| 17 |
|
| 18 |
## Use the models
|
| 19 |
|
|
|
|
| 29 |
|
| 30 |
### Android
|
| 31 |
|
| 32 |
+
#### Edge Gallery App
|
| 33 |
+
|
| 34 |
+
* Download or build the [app](https://github.com/google-ai-edge/gallery?tab=readme-ov-file#-get-started-in-minutes) from GitHub.
|
| 35 |
+
|
| 36 |
+
* Install the [app](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) from Google Play
|
| 37 |
+
|
| 38 |
+
* Follow the instructions in the app.
|
| 39 |
+
|
| 40 |
+
#### LLM Inference API
|
| 41 |
+
|
| 42 |
* Download and install
|
| 43 |
[the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
|
| 44 |
* Follow the instructions in the app.
|
|
|
|
| 56 |
|
| 57 |
<table border="1">
|
| 58 |
<tr>
|
|
|
|
| 59 |
<th>Backend</th>
|
| 60 |
+
<th>Quantization</th>
|
| 61 |
+
<th>Context Length</th>
|
| 62 |
<th>Prefill (tokens/sec)</th>
|
| 63 |
<th>Decode (tokens/sec)</th>
|
| 64 |
<th>Time-to-first-token (sec)</th>
|
|
|
|
| 65 |
<th>Model size (MB)</th>
|
| 66 |
+
<th>Peak RSS Memory (MB)</th>
|
| 67 |
+
<th>GPU Memory (MB)</th>
|
| 68 |
</tr>
|
| 69 |
<tr>
|
| 70 |
+
<td><p style="text-align: right">CPU</p></td>
|
| 71 |
+
<td><p style="text-align: right">dynamic_int8</p></td>
|
| 72 |
+
<td><p style="text-align: right">4096</p></td>
|
| 73 |
+
<td><p style="text-align: right">166.50 tk/s</p></td>
|
| 74 |
+
<td><p style="text-align: right">26.35 tk/s</p></td>
|
| 75 |
+
<td><p style="text-align: right">6.41 s</p></td>
|
| 76 |
+
<td><p style="text-align: right">1831.43 MB</p></td>
|
| 77 |
+
<td><p style="text-align: right">2221 MB</p></td>
|
| 78 |
+
<td><p style="text-align: right">N/A</p></td>
|
| 79 |
</tr>
|
| 80 |
<tr>
|
| 81 |
+
<td><p style="text-align: right">GPU</p></td>
|
| 82 |
+
<td><p style="text-align: right">dynamic_int8</p></td>
|
| 83 |
+
<td><p style="text-align: right">4096</p></td>
|
| 84 |
+
<td><p style="text-align: right">927.54 tk/s</p></td>
|
| 85 |
+
<td><p style="text-align: right">26.98 tk/s</p></td>
|
| 86 |
+
<td><p style="text-align: right">5.46 s</p></td>
|
| 87 |
+
<td><p style="text-align: right">1831.43 MB</p></td>
|
| 88 |
+
<td><p style="text-align: right">2096 MB</p></td>
|
| 89 |
+
<td><p style="text-align: right">1659 MB</p></td>
|
| 90 |
</tr>
|
| 91 |
|
| 92 |
</table>
|
|
|
|
| 97 |
* The inference on CPU is accelerated via the LiteRT
|
| 98 |
[XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
|
| 99 |
* Benchmark is done assuming XNNPACK cache is enabled
|
| 100 |
+
* Benchmark is run with cache enabled and initialized. During the first run, the time to first token may differ.
|
| 101 |
* dynamic_int8: quantized model with int8 weights and float activations.
|