Hunyuan-Mamba hybrid out 2502 - Where is the Mistral-xLSTM hybrid (DeepSeek moment)?
Title says all.
I'm sick and tired of hearing about xLSTM's superiority without practical proof and the new (sub)symbolic AI hot air for a 33 million funding for FIVE YEARS when everybody else has had tool use and reasoning for months.
@gue22I have successfully fine-tuned the model on a dataset to teach ChatML format, and the training loss and mean token acc. are looking good, but still, I have a lot of difficulties with inference because of the issues mentioned below:
https://huggingface.co/datasets/John6666/forum1/blob/main/xlstm_1.md ( thanks @John6666 for this)
the difficulty is to make the experiment reproducible with latest pytorch and transformers. once I have something reproducible I'll publish the weights
@John6666 https://huggingface.co/ethicalabs/xLSTM-7b-Instruct
fine-tuning recipe will follow.
does anyone wants to try fine-tuning this initial model for tools usage?
the peft adapter has been fine-tuned only on 25k samples https://colab.research.google.com/gist/mrs83/342d23c8bcceae22384c96d960aa62ac/xlstm-7b-instruct-text-generation-test.ipynb
@gue22 and long-story short, no funds here. I am gpu poor and literally using 50 eur of rented GPU compute to experiment all of this. frustrating? yes. but who cares. someone has to do the work, right? a suggestion for VCs: full-remote async teams are efficient and costs less than 33m.
Hey Massimo, sign of life after 7 months!
Actually I've given up on Sepp & Co. JKU LLM ventures. (Know him/his team from there & what used to be the bioinformatics institute.) Dunno why xLSTM never took off.
Your GH profile says you are in Berlin!? You can easily hook me if you comm in a reasonable timeframe.
Cheers
G.
Hey Massimo, sign of life after 7 months!
Actually I've given up on Sepp & Co. JKU LLM ventures. (Know him/his team from there & what used to be the bioinformatics institute.) Dunno why xLSTM never took off.
Your GH profile says you are in Berlin!? You can easily hook me if you comm in a reasonable timeframe.
Cheers
G.
Yes @gue22 I am in Germany right now. Let’s connect as I am exploring also Federated Learning on post-transformer models. Budget is limited but that could be an advantage, forcing us to optimise before upscaling
Sorry if I don't look into any of your specifics, but I won't move another finger until at least one of Sepp's team is constantly on board.
I think nobody ever replied here. From what I see in their github issues this makes no sense either.
Now that we are two I could go bang on their door at JKU next week or so.
Cheers
G.
Sorry if I don't look into any of your specifics, but I won't move another finger until at least one of Sepp's team is constantly on board.
I think nobody ever replied here. From what I see in their github issues this makes no sense either.Now that we are two I could go bang on their door at JKU next week or so.
Cheers
G.
No problem. I think the original creators have done great work that deserves more recognition from the international community.
I'm just a software engineer who experiments with machine learning and I believe fine-tuning xLSTM specifically for tool usage could be a game-changer.
my small experiment was just SFT, avoiding DPO or reasoning/thinking, as the goal isn't another generalist model, but just an experiment to build a specialised and reliable tool.
If the open source community leverages platforms like HF to push these alternative architectures, we can build a truly competitive and sustainable ecosystem that might not need gigafactories at all.
I've cross-linked this to Elliott Zhen's unanswered question on Github.
More after this beautiful afternoon autumn weather.
Cheers
G.
I've cross-linked this to Elliott Zhen's unanswered question on Github.
More after this beautiful afternoon autumn weather.
Cheers
G.
Awesome. I hope the open-source community can give more attention to this little gem. Let's wait for a feedback.
I have been using LSTMs years ago for text generation, around the time the first GPT model was released, but I have never been able to produce something usable for conversational usage due to my limited knowledge at that time.
The early GenAI years were wild, many people started to learn tensorflow (and later pytorch) just for fun.
After the hype, many of us, away from the academia circles, went back to work to pay the bills, also because fine-tuning on a local or cloud machine was no longer possible, especially without large financial capabilities.
Until recent years... So, let's cross fingers 🤞
Notebook for Chat Template SFT: https://colab.research.google.com/gist/mrs83/644930e57e8cc8a36a80c911292d5d3a/xlstm_finetuning_colab_cudo.ipynb
For next possible SFT iterations, assistant_only_loss=True must be set, also by using a conversational or tool dataset format. ref: https://huggingface.co/docs/trl/dataset_formats#conversational
I'm not that much of an expert that I would dare to grade xLSTM. Hoped it would take off, but got frustrated with the desert here and no visible usage. Dunno what all his PhDs do.
2y ago I assembled a Xeon 3435X w/ 256GB and 2 x 20GB RTX 4000 Ada and bought a 36GB MB Pro M3 Max to run local models. Have a patch to run xLSTM on the 2 GPUs. Almost bought an RTX Pro 6000 Blackwell Max-Q last week, but the payment got stuck at Klarna. I reconsidered.
After a second career as tech writer at Dynatrace from age 60-65 I studied bioinformatics at Sepp's institute from 2016 until Covid put an end to it in 2020.
They always claimed xLSTM superior to transformers and even Mamba. I recently saw Mamba 2 used by the Chinese. Don't remember the model off the top of my head.
I will try to get some insight at the institute ASAP.
Cheers
G.
Sepp's take on his ideas always was that SW must stay in Europe, but I never understood what's the point of that with open source.
Everybody else (except Bengio) sold themselves for billions and fancy HW (which they could share with contributors).
If there was more life and practicality in the project I would have loved to see quantization, but in its state with its appearance in early 2025 - and the competition - this thread is my max investment.
What kind of HW would you need for your tweaks? Am I terribly wrong if I guess an RTX Pro 5000 (Blackwell) (48GB) would do for the model's size and your tasks?
Edit: Its 7B are too large for one of my 20GB RTX 4000 Ada, but may fit into the 32GB of an RTX 5090? (Just wondered what you were using in your Colab and I'm seeing an A100.)
Edit: My own AI focus has changed from genomics / med (too heavily regulated,..) to coding and I doubt if current-state xLSTM would be any good compared to other recent developments.
Cheers
G,
xLSTM-7b has been chosen months ago for a benchmarking on federated/distributed SFT with Flower Labs. but at that time, it was hard to make the experiment reproducible. For reference: https://arxiv.org/abs/2506.02961
