Mitko Vasilev
mitkox
AI & ML interests
Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it.
Recent Activity
posted
an
update
about 1 month ago
I run 20 AI coding agents locally on my desktop workstation at 400+ tokens/sec with MiniMax-M2. It’s a Sonnet drop-in replacement in my Cursor, Claude Code, Droid, Kilo and Cline peak at 11k tok/sec input and 433 tok/s output, can generate 1B+ tok/m.All with 196k context window. I'm running it for 6 days now with this config.
Today max performance was stable at 490.2 tokens/sec across 48 concurrent clients and MiniMax M2.
Z8 Fury G5, Xeon 3455, 4xA6K. Aibrix 0.5.0, vLLM 0.11.2,
posted
an
update
about 2 months ago
I just threw Qwen3-0.6B in BF16 into an on device AI drag race on AMD Strix Halo with vLLM:
564 tokens/sec on short 100-token sprints
96 tokens/sec on 8K-token marathons
TL;DR You don't just run AI on AMD. You negotiate with it.
The hardware absolutely delivers. Spoiler alert; there is exactly ONE configuration where vLLM + ROCm + Triton + PyTorch + Drivers + Ubuntu Kernel to work at the same time. Finding it required the patience of a saint
Consumer AMD for AI inference is the ultimate "budget warrior" play, insane performance-per-euro, but you need hardcore technical skills that would make a senior sysadmin nod in quiet respect.
Organizations
posted
an
update
15 days ago
posted
an
update
about 1 month ago
Post
3093
I run 20 AI coding agents locally on my desktop workstation at 400+ tokens/sec with MiniMax-M2. It’s a Sonnet drop-in replacement in my Cursor, Claude Code, Droid, Kilo and Cline peak at 11k tok/sec input and 433 tok/s output, can generate 1B+ tok/m.All with 196k context window. I'm running it for 6 days now with this config.
Today max performance was stable at 490.2 tokens/sec across 48 concurrent clients and MiniMax M2.
Z8 Fury G5, Xeon 3455, 4xA6K. Aibrix 0.5.0, vLLM 0.11.2,
Today max performance was stable at 490.2 tokens/sec across 48 concurrent clients and MiniMax M2.
Z8 Fury G5, Xeon 3455, 4xA6K. Aibrix 0.5.0, vLLM 0.11.2,
posted
an
update
about 2 months ago
Post
4148
I just threw Qwen3-0.6B in BF16 into an on device AI drag race on AMD Strix Halo with vLLM:
564 tokens/sec on short 100-token sprints
96 tokens/sec on 8K-token marathons
TL;DR You don't just run AI on AMD. You negotiate with it.
The hardware absolutely delivers. Spoiler alert; there is exactly ONE configuration where vLLM + ROCm + Triton + PyTorch + Drivers + Ubuntu Kernel to work at the same time. Finding it required the patience of a saint
Consumer AMD for AI inference is the ultimate "budget warrior" play, insane performance-per-euro, but you need hardcore technical skills that would make a senior sysadmin nod in quiet respect.
564 tokens/sec on short 100-token sprints
96 tokens/sec on 8K-token marathons
TL;DR You don't just run AI on AMD. You negotiate with it.
The hardware absolutely delivers. Spoiler alert; there is exactly ONE configuration where vLLM + ROCm + Triton + PyTorch + Drivers + Ubuntu Kernel to work at the same time. Finding it required the patience of a saint
Consumer AMD for AI inference is the ultimate "budget warrior" play, insane performance-per-euro, but you need hardcore technical skills that would make a senior sysadmin nod in quiet respect.
posted
an
update
about 2 months ago
Post
373
I have just vibe coded a feature for ODA on-device AI with MiniMax M2, running locally on my Z8 Fury - and holy silicon, this thing SLAPS!
TL;DR the nerd stuff
Specialized in coding and agentic work
60 tokens/sec
Ryzen AI is getting some serious ROCm 7.0.2 brain implants
One extra script to rule them all and bind them to my GPU
Vibe coding feature implementation that actually worked on the first try. I know, I'm scared too
TL;DR the nerd stuff
Specialized in coding and agentic work
60 tokens/sec
Ryzen AI is getting some serious ROCm 7.0.2 brain implants
One extra script to rule them all and bind them to my GPU
Vibe coding feature implementation that actually worked on the first try. I know, I'm scared too
posted
an
update
2 months ago
Post
1852
I’m just reading that Ryzen AI 395 has to be 30% slower than DGX Spark in LLM inferencing… and only 96GB GPU RAM… good I haven’t RTFM upfront, so I made the AMD faster with 128GB unified RAM 🫡
Z2 mini G1a can run Qwen3 Coder 30B BF16 at 26.8 tok/sec in ~60GB GPU RAM
Z2 mini G1a can run Qwen3 Coder 30B BF16 at 26.8 tok/sec in ~60GB GPU RAM
posted
an
update
2 months ago
Post
2785
Say hello to my little friends! I just unboxed this trio of HP Z2 G1a!
Three is always better than one!
3x AMD Ryzen AI Max+ Pro 395
384GB RAM
24TB of RAID storage
Ubuntu 24.04
ROCm 7.0.2
llama cpp, vLLM and Aibrix
Small, cheap GPUs are about to become the Raspberry Pi of edge AI inference. Sprinkle some
Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.
Three is always better than one!
3x AMD Ryzen AI Max+ Pro 395
384GB RAM
24TB of RAID storage
Ubuntu 24.04
ROCm 7.0.2
llama cpp, vLLM and Aibrix
Small, cheap GPUs are about to become the Raspberry Pi of edge AI inference. Sprinkle some
kubectl fairy dust on top, and suddenly it's a high-availability, self-healing, cloud-native, enterprise-grade AI cluster camping in a closet.Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.
posted
an
update
2 months ago
Post
2810
I see all Chinese labs are turning TL;DR into TL;DRGB
Problem: 1M text tokens == 1 M opportunities for your GPU to file worker-comp
Solution: don’t feed the model War & Peace—feed it the movie poster.
This is Glyph, Zai’s new visual-text compression voodoo:
• 10 k words → 3 PNGs ≈ 3 k visual tokens
• Compression ratio: 4.3×
• Throughput: 40-60 tok/s i.e. your context window now finishes before my coffee does
So I did the only reasonable thing: asked GLM-4.6 to port Glyph for Qwen3-VL-8B-Thinking.
Translation: I made one model compress a novel into a comic strip, then made another model read the comic strip and still ace QA.
It’s basically passing notes in class, except the note is a 1920×1080 meme and the teacher is a transformer.
We've gone from "Attention is All You Need" to "Attention is Too Expensive, Just Use Your Eyes." Remember kids: in 2025 literacy is optional, but JPEG is forever.
Problem: 1M text tokens == 1 M opportunities for your GPU to file worker-comp
Solution: don’t feed the model War & Peace—feed it the movie poster.
This is Glyph, Zai’s new visual-text compression voodoo:
• 10 k words → 3 PNGs ≈ 3 k visual tokens
• Compression ratio: 4.3×
• Throughput: 40-60 tok/s i.e. your context window now finishes before my coffee does
So I did the only reasonable thing: asked GLM-4.6 to port Glyph for Qwen3-VL-8B-Thinking.
Translation: I made one model compress a novel into a comic strip, then made another model read the comic strip and still ace QA.
It’s basically passing notes in class, except the note is a 1920×1080 meme and the teacher is a transformer.
We've gone from "Attention is All You Need" to "Attention is Too Expensive, Just Use Your Eyes." Remember kids: in 2025 literacy is optional, but JPEG is forever.
posted
an
update
3 months ago
Post
5659
I’ve built my blocker for AI-generated content. It’s a local AI running on my laptop with a browser extension that classifies and scrubs synthetic content from my eyeballs. I’m too old for this synthetic noise.
TL;DR I’m going full John Connor on the AI content apocalypse
Think of it as an on device AI ad-blocker, but for:
Em-dash overdose. Seriously, why is everything suddenly revolutionary—disruptive—life-changing?
AI influencers’ auto-generated posts and images, auto-posted, all hands-free.
Fake news, fake images, fake people... puff.
Surprisingly, it works. I suppose it will block some human-generated content. However, I would rather read a 2007 Myspace blog than another “10 Growth Hacks Powered By ChatGPT” post.
TL;DR I’m going full John Connor on the AI content apocalypse
Think of it as an on device AI ad-blocker, but for:
Em-dash overdose. Seriously, why is everything suddenly revolutionary—disruptive—life-changing?
AI influencers’ auto-generated posts and images, auto-posted, all hands-free.
Fake news, fake images, fake people... puff.
Surprisingly, it works. I suppose it will block some human-generated content. However, I would rather read a 2007 Myspace blog than another “10 Growth Hacks Powered By ChatGPT” post.
posted
an
update
4 months ago
Post
394
Hermes4 70B synthetic dataset generation on my desktop Z8 GPU rig:
307 tok/sec
1.1M tok/hour
The bottleneck for generating massive, high-quality reinforcement learning datasets is never the GPU compute; it's always the model's willingness to actually answer the darn question.
307 tok/sec
1.1M tok/hour
The bottleneck for generating massive, high-quality reinforcement learning datasets is never the GPU compute; it's always the model's willingness to actually answer the darn question.
posted
an
update
5 months ago
Post
3463
I run Claude Code with Qwen3 Coder Flash locally on my MacBook Air. It works offline, zero cloud, zero internet, zero EU AI Act anxiety. No limit with all tokens on the house.
It’s not great, not terrible- adequate performance for an on device AI agent chewing through code on a 1.24 kg laptop. I wrote an interpreter to broker peace between Claude Code and my local AI runtime.
Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.
It’s not great, not terrible- adequate performance for an on device AI agent chewing through code on a 1.24 kg laptop. I wrote an interpreter to broker peace between Claude Code and my local AI runtime.
Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.
posted
an
update
5 months ago
Post
3458
XBai o4 claims to beat Claude Opus 4 and o3-mini, and they provide verifiable proof. My skepticism circuits overloaded, but my local AI FOMO module screamed louder.
I've thrown this 33B monoblock LLM onto a single GPU and used Roo Code for some… let’s call it “vibe testing”. It’s terrifyingly competent. As an architect, it’s the best open-weight model I’ve touched this side of 2025.
I've thrown this 33B monoblock LLM onto a single GPU and used Roo Code for some… let’s call it “vibe testing”. It’s terrifyingly competent. As an architect, it’s the best open-weight model I’ve touched this side of 2025.
posted
an
update
5 months ago
Post
2591
We’ve reached a point where on device AI coding that is free, offline, and capable isn’t just a theoretical possibility; it’s sitting on my lap, barely warming my thighs.
My local MacBook Air setup includes a Qwen3 Coder Flash with a 1M context, Cline in a VSCode IDE. No internet, no cloud, no ID verification- this is the forbidden tech.
Current stats:
All agentic tools work great local, sandboxed, and MCP
OK model output precision
17 tokens/sec. Not great, not terrible
65K tokens context, the model can do 1M, but let’s be real, my MacBook Air would probably achieve fusion before hitting that smoothly
Standard backend and cache off for the test
All inference and function calling happen locally, offline, untethered. The cloud didn’t even get a memo.
My local MacBook Air setup includes a Qwen3 Coder Flash with a 1M context, Cline in a VSCode IDE. No internet, no cloud, no ID verification- this is the forbidden tech.
Current stats:
All agentic tools work great local, sandboxed, and MCP
OK model output precision
17 tokens/sec. Not great, not terrible
65K tokens context, the model can do 1M, but let’s be real, my MacBook Air would probably achieve fusion before hitting that smoothly
Standard backend and cache off for the test
All inference and function calling happen locally, offline, untethered. The cloud didn’t even get a memo.
posted
an
update
5 months ago
Post
2118
I run Qwen3-Coder 480B locally on my Z8, with a 1-million token context window. It’s the equivalent of parallel-parking a Nimitz-class carrier in a kiddie pool. Thanks to whatever dark pact the llama.cpp, CUDA, and kernel folks signed, hybrid inferencing + VRAM↔RAM offload let me stream the model’s synapses across Xeon, RAM, and four lonely A6000s without summoning either the OOM killer or a small house fire.
posted
an
update
5 months ago
Post
285
I needed a distraction-free AI dev system. Omarchy-AI is an opinionated, minimalist, purpose-built OS layer for AI engineers. Arch Linux, stripped, hardened, and injected with pure, uncut AI developer ergonomics.
TL;DR Omarchy-AI. Vertical AF. One Job, Done Stupidly Well. Only cares about AI engineering.
It’s built on top of Arch Linux & Omarchy and further optimized for:
- Offline, on-the-go AI development. Yes, even on your gaming laptop, or freshly minted DIGITS
- Seamless shift to GPU server backends, because your PC shouldn’t train a 1T Kimi K2 model
- Pre-baked RAG pipelines, agentic workflows, and model fine-tuning
- Actual productivity to spend hours hacking local AI agents, not debugging uv conflicts.
How It Works (The Geeky Bits)
It’s One Curl Command to Rule Them All. Turning a vanilla Arch into a batteries-included local AI dev beast with Hyprland, CUDA, llama.cpp, gcc, and every CLI tool you pretend to know about.
Hyprland: Because Your GPU Deserves Glam Shots. Picked it for the nihilists, tweakers, and keyboard cowboys. It’s an independent Wayland compositor that works great and has zero questions like “how do I get this pretty?”
Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.
TL;DR Omarchy-AI. Vertical AF. One Job, Done Stupidly Well. Only cares about AI engineering.
It’s built on top of Arch Linux & Omarchy and further optimized for:
- Offline, on-the-go AI development. Yes, even on your gaming laptop, or freshly minted DIGITS
- Seamless shift to GPU server backends, because your PC shouldn’t train a 1T Kimi K2 model
- Pre-baked RAG pipelines, agentic workflows, and model fine-tuning
- Actual productivity to spend hours hacking local AI agents, not debugging uv conflicts.
How It Works (The Geeky Bits)
It’s One Curl Command to Rule Them All. Turning a vanilla Arch into a batteries-included local AI dev beast with Hyprland, CUDA, llama.cpp, gcc, and every CLI tool you pretend to know about.
Hyprland: Because Your GPU Deserves Glam Shots. Picked it for the nihilists, tweakers, and keyboard cowboys. It’s an independent Wayland compositor that works great and has zero questions like “how do I get this pretty?”
Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.
posted
an
update
11 months ago
Post
3772
llama.cpp is 26.8% faster than ollama.
I have upgraded both, and using the same settings, I am running the same DeepSeek R1 Distill 1.5B on the same hardware. It's an Apples to Apples comparison.
Total duration:
llama.cpp 6.85 sec <- 26.8% faster
ollama 8.69 sec
Breakdown by phase:
Model loading
llama.cpp 241 ms <- 2x faster
ollama 553 ms
Prompt processing
llama.cpp 416.04 tokens/s with an eval time 45.67 ms <- 10x faster
ollama 42.17 tokens/s with an eval time of 498 ms
Token generation
llama.cpp 137.79 tokens/s with an eval time 6.62 sec <- 13% faster
ollama 122.07 tokens/s with an eval time 7.64 sec
llama.cpp is LLM inference in C/C++; ollama adds abstraction layers and marketing.
Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it.
I have upgraded both, and using the same settings, I am running the same DeepSeek R1 Distill 1.5B on the same hardware. It's an Apples to Apples comparison.
Total duration:
llama.cpp 6.85 sec <- 26.8% faster
ollama 8.69 sec
Breakdown by phase:
Model loading
llama.cpp 241 ms <- 2x faster
ollama 553 ms
Prompt processing
llama.cpp 416.04 tokens/s with an eval time 45.67 ms <- 10x faster
ollama 42.17 tokens/s with an eval time of 498 ms
Token generation
llama.cpp 137.79 tokens/s with an eval time 6.62 sec <- 13% faster
ollama 122.07 tokens/s with an eval time 7.64 sec
llama.cpp is LLM inference in C/C++; ollama adds abstraction layers and marketing.
Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it.
posted
an
update
11 months ago
Post
720
Stargate to the west of me
DeepSeek to the east
Here I am
Stuck in the middle with the EU
It will likely be a matter of sparkle to get export control on frontier research and models on both sides, leaving us in a vacuum.
Decentralized training infrastructure and on device inferencing are the future.
DeepSeek to the east
Here I am
Stuck in the middle with the EU
It will likely be a matter of sparkle to get export control on frontier research and models on both sides, leaving us in a vacuum.
Decentralized training infrastructure and on device inferencing are the future.
posted
an
update
11 months ago
Post
571
On device AI reasoning ODA-R using speculative decoding with draft model DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-32B. DSPy compiler for reasoning prompts in math, engineering, code...