Questions on recent study using NLP to classify heart-disease symptoms

This might only be relevant to #NLP or #ethics folks, or those using AI for #healthcare. If folks can shed more light on this study, please jump in!

I came across an MIT Media Lab tweet sharing this article AI Study Finds Key Women’s Heart Attack Symptoms Not So Different From Men.

This seems big, and counter to what I had read in many places. So I dug around a bit. But I could not find any details on the type of people whose cases were considered or the misclassification risks when this is applied to auto-diagnose heart disease for all people. (the article is hinting towards such future applications - “The method turned up distinct clusters of symptoms that could help sort through those confusing patient stories, the researchers say”).

I couldn’t find the paper, but found these slides:

To summarise from it:
Goal: study angina symptoms basis how patients describe them and determine if gender-based differences exist.
Finding: no gender based difference found in description of key symptoms such as chest pain, breathing difficulty.
Conclusion: calling women’s heart attack description/experience as “a-typical” is inaccurate and likely causing more harm than good.
Mechanism Used: record (with consent) doctor-patient chat for 637 cases referred for angiography. extract words and meaning from transcript using NLP. then use topic modelling to create “symptom clusters” for extracted words across the two genders considered. finally map angiogram results to clusters and genders.

I had a few questions/concerns:

  1. If the NLP model is being trained using patient-doctor chat alone, how does it capture the factors that influence the natural language used in that context?
  2. None of the articles/slides mention anything about who these 637 people were, was there parity in the genders in terms of language control, cultural background etc that led to use of same words for similar experience. This seems important when this gets used across populations!
  3. The symptom clusters are created using generative topic models - I don’t understand that enough, but are there known biases in the process outside of dataset bias?
  4. How do you get to publish such a far-reaching conclusions without providing details around the “failure” scenarios?! (reminds me of another recent “AI can identify heart failure with 100% accuracy” paper).

Please add/correct if you’re an expert in NLP or cardiac diseases! :slight_smile: Thanks!


I’m a physician but not practicing and not a Cardiologist nor a ML expert, so take this with grains of salt. From my perspective, this study should be ignored. The two conclusions of the study are irresponsible and not supported by the data, namely:

You can’t prove the null hypothesis, the hypothesis that there is no difference. As physicians, we are trained to recognize the male pattern of angina/MI, and not the female pattern, due to systemic sexism. As a result, women are typically under-diagnosed for coronary artery disease. If the above conclusions were to be believed, this would make the problem worse and set back healthcare for woman by several decades.

1 Like

Thank you for responding!

When you say the data doesn’t prove the null hypothesis, do you mean that the similarity in reporting of symptoms is not sufficient to claim that there is no difference in the symptoms or pattern of angina? If yes, I agree. The study definitely conflates “language used” to “symptoms experienced” and that can definitely de-rail timely evaluation/care for CAD in women.

I suppose there are 2 issues here, both equally important:

  1. Differences between men and women in symptoms experienced.
  2. Differences in reporting styles/language used across different population groups.

The study trains ML on #2 and reports it as lack of #1.

Please correct me if I’m mistaken. Thanks!

In this case, the “NULL HYPOTHESIS” is that there is “no difference” between men and women’s interviews in this study. That is virtually impossible to prove with any statistical test or predictive model. That is like saying, “My model doesn’t find a difference between dogs and cats so scientists should now assume that they are the same species”.

There is also the very real possibility that their NLP predictive model wasn’t good enough, or didn’t have enough data to train on. Or you could also postulate that the interviews were insufficient to detect a difference. Or you could say that there was a selection bias; the women who drop dead from a MI as their first contact with the healthcare system were not included in this study. There are so many things not to like about the author’s conclusions…

If they had a positive result, that is, found a strong difference between men and women interviews detected by their NLP algorithm, that would have been an interesting result, for which they could have conclusions, but there is nothing that can be concluded from this study.

1 Like