Help wanted: ML audio model to help digital contact tracing

If you don’t know what “digital contact tracing” is, see this comic strip.

A key aspect of digital contact tracing is estimating transmission probability between two people using their smartphones. The upcoming google-apple system plans to use Bluetooth signal strength (RSSI) as a proxy for how close people are, and combine that with duration, to get a measure of “contact”. (RSSI is a pretty bad proxy for distance, but that’s a story for another post.) Singapore’s “Trace Together” app is similar, as is the UK’s planned system announced a couple days ago. NOVID measures distance slightly differently, using acoustic time-of-flight (measure how long it takes for a little sound to travel between phones, and multiply by the speed of sound).

All this is just background.

Anyway, it’s very important for these systems to measure “contact” well. If they miss a lot of true contacts, people get sick and die, and/or we need to compensate with more aggressive society-wide lock-downs etc. Conversely, if they send out too many false alarms, people will stop complying with the endless quarantine requests that don’t end in them getting sick. Either they’ll figure that they must have already caught it and now they’re immune, or they’ll lose faith in the systems, or they may even simply run out of ability to find substitutes for their jobs and other obligations etc.

So anything that would make the systems estimate transmission-risk-per-time more accurately is a huge win.

So, here’s one thing we can do.

  • There seems to be widespread consensus—from CDC, WHO, epidemiologists, etc.—that the major transmission vector is when one person creates respiratory droplets and another person breathes them in.

  • There also seems to be universal consensus that people emit dramatically more respiratory droplets when they are vocalizing (talking, coughing, singing, laughing, sneezing, etc.) then when they are just quietly breathing. I especially like this little video. This is also consistent with the fact that superspreader events generally involve loud talking and singing.

So back to contact-tracing, the phones estimate proximity and duration, and declares “contact” if the proximity and duration cross some threshold condition (typically 30 minutes and 2 meters). Right now, that threshold is exactly the same if no one is vocalizing (talking, coughing, singing, etc.), versus if people are screaming.

If the phones could detect the presence or absence of vocalization, we could have different thresholds in the two cases, which would make the systems work better—fewer misses, fewer false alarms, or both.

We could even aim higher than presence/absence of vocalization: We could try to quantify respiratory droplet emission based on volume and type of vocalization, we could try to guess whether people are wearing masks based on how their voices sound, etc. But I’m concerned about “letting the perfect be the enemy of the good.” Even the most bare-bones functionality of classifying the presence/absence of vocalization would already make a huge difference in practice, I think. And I don’t think it would pose any significant problems for either phone battery or privacy, although I’m not an expert on either of those.

Anyway, we need two things:

  1. We need people in charge of building digital contact-tracing systems to agree that this is worth doing.

  2. We need a nice smartphone-compatible open-source codebase that detects (at minimum) the presence/absence of vocalization, for those people to put into their systems.

I’ve been trying to push on (1) for a few weeks now, but I don’t have great contacts and I’m not sure I’m getting anywhere. I’m happy to get help on this part.

…But the main reason I’m posting here is (2). I haven’t done anything in that direction and don’t have time to. But it seems like a reasonably straightforward ML project.

By the way, time is of the essence; even if the people in charge agree that this is worth doing in principle, but vocalization-detection software isn’t already tested and ready to go, it won’t happen.

I’m not an expert on any of this, although I’m lightly involved in a contact-tracing group (I’m a physicist, and specifically involved in a sub-group studying the physics of Bluetooth-RSSI proximity).

I’m very open to ideas and happy to discuss more. Thanks in advance to anyone who wants to help!!! And please understand that, if you do, there is a chance that this will all wind up being a waste of time. High risk high impact, I think. :slight_smile:

1 Like

The iterative aspect of machine learning is important because as models are exposed to new data, they are able to independently adapt.

are you talking on this?

If I understand you correctly, my answer is “no”…

I am proposing:

STEP 1: We find or create a dataset of audio clips, each with a label that says “In this clip, somebody is / isn’t talking, singing, coughing, etc.”

STEP 2: We train a ML classifier on that dataset. This process may involve several iterations, but eventually we will be happy with the model, and then we save the weights and move on to the next step.

STEP 3: We deploy this model to people’s phones, to classify samples of audio picked up by the smartphone’s microphone.

If I understand you correctly, you are proposing to continue editing the model during step 3, after deployment (online learning). I don’t think that’s necessary. Even if it would help, it would be very difficult to do, for various reasons—privacy, lack of ground truth, it makes the software more complicated and less predictable and harder to test, etc.

Sorry if I’m misunderstanding :slight_smile: