[Invitation to open collaboration] Practice what you learn in the course and help animal researchers! 🐵

I would like to invite you to an open collaboration on studying animal vocalizations. Why is this important?

If we can show that animals have a language, if we can start getting a glimpse into how they communicate and what they say, that could constitute a pivotal moment for how humanity approaches the nature, how we treat animals.

But arriving at this goal will not happen overnight. The way to it leads through investigating ways of working with animal vocalizations and we are only getting started on this.

My proposition is this - if you would like to help in this endeavor, if you are looking for an interesting project to apply your learnings from the course, please consider joining me in open_collaboration_on_audio_classification.

The above will link to a starter notebook where I walk you through the first dataset we will work on. It is a compilation of 7285 macaque coo calls from 8 individuals. Many believe that being able to identify the speaker is a necessary determinant of a language. Can you train models that will identify which call originated from which individual?

In the notebook, I walk you through all the steps necessary to load the data and train a simple CNN model. With just 16 seconds of fine tuning the pretrained CNN, we get to an error rate of 6.1%! Can you improve on this result? Can you share with others interesting ways of working with the data or some insights into the dataset?

We are only getting started on this work, there is so much that can be done. There is an immense value in publicly available code that researches and students in the field could refer to. Also, I have never attempted an open collaboration like this and so I would really appreciate if you would join forces with me on this.

The dataset in itself is interesting - it could potentially be the MNIST of audio research. You can train on it to great success using any modern GPU. If it proves too easy, I’ll look for the Imagenette and ImageWoof audio equivalents :slight_smile: For now, I feel there is still a lot one could explore here.

Would be an honor if you joined me on this journey so that we can learn together and maybe help bring a positive change to the world.

I realize decoding animal communication might seem like a very challenging goal. But there are a lot of reasons to be optimistic with regards to the chances. If you would like to learn more about how we plan to tackle this, please check out The Earth Species website. I also invite you to listen to an NPR Invisibilia episode: Two Heartbeats a Minute that aired a couple of days ago.


@radek, I would love to work on this! I have never worked on image classification tasks other than on actual images and using audio is something I would love to as mentioned by Jeremy as well in lecture 1! I’ll go through the starter notebook and see if I can use some of the insights I gained and will gain from the course and fastbook!

Will definitely check them out! :slight_smile:


Hmm may be my dogs will tell me to stop troubling them :D.
This is super cool. Will surely try this out thanks for setting up the starter notebook. The data collection process for this would have been interesting


hey @radek, sounds interesting, i have done a couple of computer vision tasks previously but i dont enough experience in the audio analysis.
i would love to contribute and apply whatever we gain from the course.
will definitely go through the starter notebook, thanks for sharing the project :smiley:


I remember listening to this podcast:

They talked about this project to monitor the population of elephants in the rainforest of East Africa Malawi where you essentially cannot see them because the forest is so dense.
So they went there and put microphones recording 24/7 all over the forest for 6 months and the picked them up.
But having so much footage on their hands they used Deep learning just to de-rush the recordings and identify and isolate the moments where you could hear an elephant pass by and also (sadly) gunshots.
But also it helped them to monitor the elephant’s movements in the rainforest geographically.
And they were now trying (maybe they achieved it since) to identify the different elephant individuals from their sound.

I don’t know if their dataset is openly available but I would love to contribute to such a project because this subject got to me and I really hoped that I could help.
I don’t know if it’s a thing to reach out to them and propose our help but that would be awesome.


What a nice project :slight_smile: . I will start working on it.


Sounds fascinating! Thx for sharing the episode.

I recently learned about the false color spectrogram technique.


You grab recordings over periods of months and through the visualization, you can pick up when certain animals arrive and leave (assuming they produce enough vocalizations - not sure how often elephants make calls). Extremely fascinating that this can be done like that and could be useful if you do end up working on this project. I also saw a system running resnet50 on realtime data streaming in for identifying if there are whales nearby to warn the ships and prevent collisions. This is very similar to what we are trying to do in this repository and leverages techniques we will learn about in the course :slightly_smiling_face:

People in the field seem extremely open so by all means you could track them down and ask, no harm done :slight_smile: Even if they would not be able to share their data with you, I am sure they would love hearing from you and learning that you found their work inspiring :slight_smile:


Okay, I’ll try to find their contact, in the mean time I’ll start practicing on you notebook. Havent had any experience with DL applied to audio.

But I think you’re absolutely right, DL must enable us to connect better with the animal world. And maybe one day the vegetal world.

I have a question about the “understanding” part of the project. Since we mostly know nothing about their language, what are the DL or RL technique that are able to classify a dataset without having a set of classes pre-provided?
That can apply to for identifying the different individuals of the elephant population without knowing how much of them there is.


I will outline just one possible scenario which I think can be quite interesting. Ideally, you would want to have some labeled data (could be for another herd of elephants). If you trained a classifier on this data, identifying elephants in that herd based on their calls, you could then run the classifier on the other dataset set (where the number of elephants is not known). You could extract descriptors from the models (this is the output of the CNN part of the network) and try running a clustering algorithm on it such as DBSCAN. Once you get the descriptors, there is a lot of things you could do (dimensionality reduction techniques such as umap where you could plot the data, running simpler clustering algorithms with variable numbers of clusters and evaluating results, etc).

This is from the top of my head, maybe there was some interesting research done into elephant vocalizations or tackling a similar problem that could provide pointers.

Actually, now that I think of this, you could simulate a similar scenario with the macaque dataset. You could train a classifier on say recording of 4 macaques and then run it to extract the embeddings on the remaining four. You could then see if they cluster nicely into 4 clusters (the remaining number of macaques) or whether the results are all over the place. This would be a slightly more involved project, but definitely can be done I feel and would be very interesting :slight_smile:

We can already do translation between human languages in an unsupervised way, this was extremely fascinating to me when I learned about this. There is a lot of good information on this in the NPR podcast. We also have some further details on our website and a technical roadmap on github listing some of the challenges we anticipate :blush:


Nice to hear from people with the same fascinations. In the coming weeks I will be working on a system to detect and identify migration of birds during the night with audio. In the long run I would like to develop a system for real time monitoring of bird migration. I think I can learn a lot from you all. Looking forward.


This is super interesting! We do a little bit of audio classification for work. Sharing some newer and hopefully helpful resources:

The following is YAMNet, a network trained to classify 521 different events. The classes are a subset of Google’s full AudioSet dataset. The model is in TF 1.* so not directly applicable for fastai, but in my experience the network work really well as a feature extractor or first stage classifier:

There is also VGGish from the same team. This might be more applicable since it is directly trained to generate embeddings that you can then use for some downstream task:

Does anyone have any good pointers for converting tensorflow models to pytorch weights? Most of my work with the nets above has been in TF but would love to convert it.


@radek would you please post the value for URLs.MACAQUES? It has not yet merged into fastai2. Thanks.

[EDIT] Never mind. Here it is.
MACAQUES = ‘https://storage.googleapis.com/ml-animal-sounds-datasets/macaques.zip
path = untar_data(MACAQUES)

1 Like

@Pomo you can build via the dev version to use the latest updates as well (see the v2 FAQ on how to do so)

1 Like

Okay, all of this is super super interesting,
I’ll try to dig into this in the forthcoming week.


The amount of positive energy in this thread is amazing! :blush:I think we will be able to do some really cool things if you hang around these parts.

What could be really helpful right now, would be cloning the repository and running the introductory notebook . If you could please walk through the notebook and if anything is unclear please give me a shout, more than happy to answer any of your questions.

Once you are done with this, I created a template to train your models. It copies all of the necessary functionality to the point where you get a DataLoaders object. You can use this as a starting point to train any CNN model (as you can see in the notebook, dls.show_batch() outputs a set of images - we converted our audio to images for training!).

You could copy the template, rename it and then train your model there. It would be interesting to see what results you get. Once you are done, save the notebook, create a git commit and open a pull request against my repository. There is a howto from github you can find here.

Using git is ridiculously unintuitive - it is a command line tool created by Linus Torvalds, the original creator of Linux. But there is a lot of nice tutorials out there and it is one of the most valuable tools in the toolbox of any professional.

As you can see, what we are doing is we are essentially using the workflow of a distributed team. We have one central place to ideate (this thread) and we then go do our own piece and use a technological solution (git, which is used nearly everywhere) to combine our results.

Once you are done with this, the next step might be trying to change the get_x function to maybe process the sound in a different way. Or for those looking for a challenge (we will learn how to do this in a couple of lectures, so maybe park the idea for now) you could build a model directly on sound represented as a sequence of numbers! The concepts useful here, that we will be learning about, are 1d convolutions and RNNs (and maybe dilated convolutions).

Anything you could contribute to the repo would be of great help. There is always some friction to overcome with these things and if we get into the flow of sharing our work like this, that can potentially be very useful. If you need any help figuring out the git piece, or with anything else, please feel free to ask in this thread :slight_smile: Same probably goes for the zoom calls - even if I am not there, I suspect everyone will still be very happy to extend a helping hand. Also, if you would like to chat about anything regarding this, let me know and we can set some time to talk on zoom :slight_smile:

I have never attempted this but seems the ecosystem is maturing, still not sure how viable that is. Here are some really nice resources on this:

Cool! If you need help at any point or have any ideas / thoughts please post in this thread :slight_smile: I’m sure quite a lot of people would be eager to learn how’s it going and extend a helpful hand, if we can :slight_smile:


all this should be in fastai master now - if someone else encounters this problem, please update your fastai2 installation, and of course please give a shout here if you continue to see this issue :slight_smile:

1 Like

@radek Thank you. This is really interesting and you’ve made it super accessible. I feel like i’ve learnt a bunch already. Looking forward to experimenting.

1 Like

@radek, would you recommend a way to play a sound programmatically in Jupyter from a numpy array? I mean without the lovely clickable UI, and one that works.

sounddevice seems to work until you set the sample rate to 24414. Then it consistently crashes the Python kernel.

simpleaudio works until you change the sample rate to 24414. Then it helpfully complains that “weird sample rates are not supported.” It appears that 24414 is “weird”.


(Ubuntu 16.04 LTS)

I am not really sure to be honest. This little widget for playing audio works quite well and comes standard with Jupyter Notebook. Is there some specific need why you would not want to use it? For Jupyter Notebook this seems like it might be the easiest way to go.

BTW maybe someone else will know how to go about this?

Thanks for asking.

  1. I have a musician’s ear and would like to make a utility for comparing sounds without having to do manual clicks.
  2. One of the joys of coding has been to be able to implement one’s own design choices. Here I’m forced into an inferior solution in order to work around someone else’s faulty software.

Anyway, simpleaudio’s source code shows that 24000 passes as a non-“weird” sample rate. It’s close enough to the actual rate of 24414 to acceptably reproduce the original sound. However, at 24000, simpleaudio fails with yet a different, indecipherable error, “Error setting parameters. – CODE: -22 – MSG: Invalid argument”.

sounddevice also accepts 24000 and even generates the sound. Once. It crashes the second time.

You might detect a hint of sarcasm. I acknowledge it and apologize. It’s just that I had a long career in an era when you could expect the basic functions of an OS or programming system to work correctly. It seems that in the open source world, one must reduce expectations, and use workarounds and compromises to keep moving forward. I am still getting used to this mindset. At the same time, to write and test an audio interface that does not barf or crash is not rocket science. It’s Software Development 101.

After wasting more than two hours on playing a sound from Jupyter, now I see that resistance is futile. I accept your terms and conditions. It will be the standard little widget.