This might only be relevant to #NLP or #ethics folks, or those using AI for #healthcare. If folks can shed more light on this study, please jump in!
I came across an MIT Media Lab tweet sharing this article AI Study Finds Key Women’s Heart Attack Symptoms Not So Different From Men.
This seems big, and counter to what I had read in many places. So I dug around a bit. But I could not find any details on the type of people whose cases were considered or the misclassification risks when this is applied to auto-diagnose heart disease for all people. (the article is hinting towards such future applications - “The method turned up distinct clusters of symptoms that could help sort through those confusing patient stories, the researchers say”).
I couldn’t find the paper, but found these slides: https://svcardiologia.org/es/images/documents/esc2019/HERMES_study_ESC19.pdf
To summarise from it:
Goal: study angina symptoms basis how patients describe them and determine if gender-based differences exist.
Finding: no gender based difference found in description of key symptoms such as chest pain, breathing difficulty.
Conclusion: calling women’s heart attack description/experience as “a-typical” is inaccurate and likely causing more harm than good.
Mechanism Used: record (with consent) doctor-patient chat for 637 cases referred for angiography. extract words and meaning from transcript using NLP. then use topic modelling to create “symptom clusters” for extracted words across the two genders considered. finally map angiogram results to clusters and genders.
I had a few questions/concerns:
- If the NLP model is being trained using patient-doctor chat alone, how does it capture the factors that influence the natural language used in that context?
- None of the articles/slides mention anything about who these 637 people were, was there parity in the genders in terms of language control, cultural background etc that led to use of same words for similar experience. This seems important when this gets used across populations!
- The symptom clusters are created using generative topic models - I don’t understand that enough, but are there known biases in the process outside of dataset bias?
- How do you get to publish such a far-reaching conclusions without providing details around the “failure” scenarios?! (reminds me of another recent “AI can identify heart failure with 100% accuracy” paper).
Please add/correct if you’re an expert in NLP or cardiac diseases! Thanks!