How to build a dog behavior app with NatureLM-audio

Hi ESP team, I tested NatureLM-audio using the sample voices in the demo and it honestly made me even more curious.

I’m a consumer product builder, and I’m exploring how to build something that helps everyday people understand nature a bit better, starting with a very personal use case: my dog. I’m not trying to claim “translation” at all, more like learning patterns and context over time (play vs stress vs attention vs discomfort) from short recordings plus simple labels.

I’ve been reading through your Hugging Face releases and I’m trying to learn the right, grounded way to approach this. What would you recommend as a realistic beginner path to build a small app like this using your open models?
For example, should I start with an esp-aves2 encoder for embeddings + clustering/retrieval and only use NatureLM-audio later for analysis, or is there a better workflow you’d suggest?

If there’s a doc, repo, or community thread you’d point newcomers to, I’d really appreciate it. Happy to learn and contribute back if I build something useful.

1 Like

Hi svs,

Apologies for the late response! Its amazing that you’re exploring our models and repo’s for your own scientific questions and testing hypotheses!

Yes absolutely you should play around / test / build with these models, with the caveat that they are the first generation of models and much more performance and speed gains will come over time!

We don’t have a recommended way to try these models out, but here are a few pointers:

  1. The embedding models are very good at classifying species, detecting presence or absence of species, some other tasks like identifying individuals and call types as well. There are some models that are only trained on biological sounds (“bio” suffix in the name) and some that are general audio and some that are mixed (“all” in the name). With an audio encoder you receive audio embeddings that you could then put to use for fine-tuning, vector based search (audio similarity) or even mapping sounds between species conditioned on behavior (e.g. learn a linear map between a “dog’s angry bark” → “a birds alarm call”) . However, whatever you would do might require additional architectural changes such as linear classifiers or predictors that you train on your own data.

  2. With NatureLM you additionally (along with the fine-tuning capability) get some zero-shot capabilities and the ability to prompt with text. That’s an advantage when you’re trying to get a model to a number of tasks right out of the box without having to fine-tune. You could also do something more advanced where you can use the language model part (Llama 3.1) of the model to learn on a chain of multiple audios, which would allow you to predict outputs over several related audios (just thinking out loud)

1 Like

Hi gagannarula,

Thank you so much for the thoughtful response, this was incredibly helpful and gave me a much clearer direction.

I really liked your point about starting with embeddings and building structure before relying on higher-level reasoning. That helped me reframe how I was thinking about this.

Based on your guidance, I’m now thinking of starting with a very simple, consumer-facing use case inspired by something like “Shazam for animals.” The idea is not translation, but rather:

  • Record a short animal sound

  • Use embeddings to identify similarity against known patterns

  • Return a lightweight interpretation such as species (if possible) and a probable behavioral context (e.g., alert, contact, stress)

  • Let users correct or label the result so the system improves over time

From there, I’m considering layering NatureLM on top as an explanation layer, mainly to:

  • Turn structured outputs into more natural, human-readable insights

  • Potentially reason over sequences of sounds (e.g., repeated patterns over time)

I’m planning to start small, likely focusing on a single domain (dogs or a limited set of common outdoor sounds) and building a small labeled dataset to experiment with embeddings, clustering, and retrieval before introducing any fine-tuning.

One thing I found especially interesting in your note was the idea of mapping sounds across species conditioned on behavior. That feels like a really powerful direction long-term, even if I start with a much simpler single-species version.

If you have any suggestions on:

  • good starter datasets for domestic animals (especially dogs), or

  • best practices for structuring labels (behavior vs call-type vs context),

I’d love to learn from anything you’d recommend.

Really appreciate the openness of your team and the work you’re doing here, it’s exciting to explore.

Best,
Vikas