I am working on a project to create a new model using the NatureLM dataset and the Qwen3-Omni model.
When it was released, I tested the ESP NatureLM-audio model, which is based on Llama 3.1. While that latter model is only a bit over a year old, it is “ancient” compared to what is available today. There is Llama 4, plus many new multimodal models that have been trained on audio data.
I decided to try to create a new model, based on the latest/greatest models available today, and selected Qwen3-Omni. It is only a few days old, has very high benchmark scores, is trained on audio media, and has a better license than the Meta models. It is also much larger (30B vs 8B) than the Llama 3.1 model that was used for the “original” ESP model.
I created and tested a LoRA, which worked, and then did a full model using 1% of the 17 TB compressed NatureLM dataset. I am doing tweaks on what I learned so far, such as the batch sizes, parameters, etc. Creating a full model using the full dataset will take 2-3 weeks, I estimate.
I have a “source” code repo available, but warning, it is kind of messy at the moment…
Any and all suggestions welcome, especially if you see I’m going down the wrong path!
I should also note, I am a Linux/Unix sysadmin for 30 years and know a bit about AI, but little about actual interspecies communication…
Happy hacking,
-Jeff