Anitha Kannan
August 11, 2021

2020 ML Research Internships at Curai

Blog Main Page

At Curai, research has been a top priority from the very beginning: we value longer term innovation, pushing the state of the art, and sharing advances with the community to validate our findings and to receive feedback. Our research internship program provides the opportunity for students to explore some of the research questions we are tackling towards scaling access to healthcare while gaining valuable research experience and, most of the time, coming out with a publication worthy piece of work. Our 2020 research internship program was special and likely to be the norm going forward — We hosted three research interns remotely through the year.

We had three creative research interns: Anirudh Joshi and Ali Mottaghi from Stanford, and Jai Chintagunta from Virginia Tech. They made substantial research progress in areas of medical conversation summarization and active learning for long-tailed multilabel distributions:

While you can read their work in the above links, what follows in this blog post is an edited short version of the same, in their own words. It has taken us a bit longer to put the blog together precisely because we have prioritized getting the work published in peer-reviewed publications.

If you are interested in NLP and clinical decision making research and want to positively impact quality healthcare access for all, please apply for a research internship or any of our other open positions. You can also read about the amazing work of our engineering interns in this other blog post.

Confidentiality note: In accordance with Curai’s privacy policies, all the illustrative examples included in this post do NOT correspond to real patients. They are either synthetic or fully anonymized.

Automated Summarization of Medical Dialogue

Anirudh Joshi, Stanford University | Mentors: Namit Katariya, Anitha Kannan

You can find out more in our paper presented at EMNLP-2020.

At Curai our mission is to provide affordable quality healthcare to everyone. To realize this mission, it is important to scale key aspects of the healthcare workflow to lower costs and increase access. Last Spring I had the opportunity to work with Anitha and Namit to leverage advances in natural language processing to help scale dialogue summarization.

Summarizing patient history is a core component of Curai doctors’ responsibility. This summary can be useful both to the patient to ensure accurate capture of all the information they conveyed, as well as to other Curai doctors to whom this conversation might be handed off to. This process takes 5–10 minutes per patient and reduces the amount of time health coaches can spend with patients. As Curai’s patient population grows, it is necessary for this task to also scale with the company in order to provide the best patient experience.

Our key insights while working on this problem were that when working with sparse medical data, it is critical to 1) design the data in a form that is conducive to machine learning and 2) design model architectures that possess the right inductive biases for the task.

Before we get into the science, let’s enjoy some of Dr. Summarize creations!

Figure 1. Example of Dr. Summarize output on a conversation related to COVID-19.

Designing the data

From dialogue summarization literature, it is evident that some of the biggest challenges to accurate summarization is the ability to remember context over long dialogues. We tackled this problem by taking advantage of the local structure of our patient history taking data. Each “snippet” is the set of dialogue turns between a doctor and patient between subsequent questions from the doctor. We use snippets as a section of the dialogue to summarize and obtain ground truth summaries for these snippets from human doctors. Each snippet is independent and generating summaries for a snippet does not require other snippets.

While generating the summaries, we asked the doctors to include a special token [NO] if the medical condition mentioned was negated by the patient.

Figure 2. Three training examples and their ground truth labels.

Designing the model

Dialogue summarization can be formulated as an abstractive task (model generates summaries), extractive task (model extracts exact phrases from the text) or a hybrid task (mix of abstractive and extractive). We formulate medical dialogue summarization as a hybrid task because it is important to retain the integrity of patient conversations through exact phrases, however for fluency it is important to allow the model to generate text. Given this formulation, we chose to extend the Pointer Generator model architecture that has proven to be state of the art for hybrid summarization tasks.

The Pointer Generator model produces probability distributions over the original text (copy-mode) and over the generative decoder (generate-mode). This allows it to determine whether to copy from the input or generate a word token for a given location in the summary. While this inductive bias is perfect for our task, there are important limitations of the vanilla model (2M-BASE) that are not well suited for medical dialogue. 1) The pointer generator network biases to generation during inference. 2) Since it is pre-trained on news corpora, it biases towards sequential text copying as opposed to alternating between generation and copying. For our task, we want the model to bias to copying to maintain the integrity of the patient response and we want the model to alternate between copying and generation.

In order to build those properties into the pointer generator network we penalize the probability of generation in our loss function so the model learns to rely more on copying. We also propose to encourage the model to focus on medical concepts and negations more than ordinary words. We do this through a combination of influencing the attention distribution during copying and also through the loss function.

We find that our extension to the pointer generator network improves summaries both on automated metrics like ROGUE-F1 but also through evaluations from human doctors. Two doctors evaluated the outputs of the vanilla baseline model compared to extension (Dr. Summarize) and the results showed that they preferred Dr. Summarize on twice the number of examples compared to the baseline. For a more detailed analysis of the results, check out our publication in EMNLP Findings linked at the top!

Dr. Summarize illustrates that automated medical dialogue summarization can be achieved with current advances in NLP and can be translated to help scale healthcare.

I had an incredible time as an ML researcher at Curai as it has the perfect balance between research focus and tying the research back to improving patient experience in the product. The promise of machine learning in healthcare has always been about scaling and industrialising parts of the workflow and Curai’s product and research are directly aligned with that. A huge thank you to the health coaches who enable high quality AI work to happen through data creation and evaluation. An extra special thanks to Anitha who has been the best machine learning research mentor I’ve had!

Medical symptom recognition from patient text: An active learning approach for long-tailed multilabel distributions

Ali Mottaghi, Stanford | Mentor: Anitha Kannan

You can find out more in our paper presented at NeurIPS 2020 ML4H.

In alignment with the standards for medical history taking, Curai asks its patients for a concise statement describing the reason for their visit: the reason for encounter (RFE), sometimes also called chief complaint (CC). Extracting medical findings (symptoms) from the RFE is an important step in history taking and asking follow-up questions. In my internship at Curai, I worked on developing a more data-efficient machine learning model for this task. The following example shows an example of the extracted medical symptoms from RFEs.

Figure 3: An example RFE and corresponding set of medical symptoms

There is no large and publicly available dataset for the task of medical symptom recognition from patient text. Therefore, at Curai we curate our own dataset. However, labeling RFEs is time-consuming and expensive. In this internship, we developed a new method that achieves the highest performance and coverage in medical symptom recognition while using the fewest number of labeled examples.

As shown in Figure 4, RFEs and their extracted symptoms have a long-tailed distribution. Some symptoms are common while multiple symptoms tend to co-occur. In order to model this kind of data, we developed a new active learning method that iteratively selects a set of unlabeled RFEs that covers a wide range of medical symptoms. Therefore, by labeling them, we can achieve the highest accuracy using the least number of labels.

Figure 4: Long-tailed data distribution. Also, shown example RFEs from dataset: Note the patient language and co-occurrence of symptoms.

Our active learning procedure starts with a large pool of unlabeled RFEs, available in encounters on Curai Health Platform. At the beginning we only had a small fraction of RFEs labeled. Our goal is to iteratively select and label the most informative samples. As shown in the Figure 5 below, in each iteration, we train our neural network model on the current labeled set and extract the features (latent representations) from labeled and unlabeled RFEs.

Figure 5: Overview of our active learning approach

Then the Affinity propagation algorithm [ref: Frey et al. 2007] is utilized to reveal the underlying structure of the latent space. This algorithm is an exemplar clustering method based on probability propagation and it does not require the number of clusters to be specified beforehand. Our algorithm, Active Long-Tailed Learning (ALTL), selects a diverse set of representations based on their distance to the cluster centroids and already labeled data points. We want to select data points that explore both new regions of the latent space (and hence introduces previously unseen labels), as well as those that are close to the cluster centroids (captures data density in the latent space). Selected data points are labeled by an expert and are added to the labeled set.

To evaluate our model, we gathered 1232 RFEs from Curai Health Platform and had medical experts label them with corresponding symptoms. We consider a universe of 20 most frequently occurring medical symptoms in patient RFEs. Figure 4 shows the distribution of the data. Due to the size of available datasets, we use a pretrained InferSent [ref: Conneau et al. 2017] for encoding patient RFE text. Text embeddings are fed into a MLP and the outputs before the last fully connected layer are used as the main features in our method.

We set up our experiments to first start with a small set of 10 labeled RFEs, and then at each iteration, we select a new batch (again of size 10) of unlabeled RFEs, with ALTL acquisition function. The new data points are then labeled by the oracle and added to the labeled training dataset. The model is then retrained on this new training dataset. This process is continued for multiple iterations until the labeling budget is exhausted. We compare the performance of our method with the following baselines:

  1. Random baseline, as the name suggests, randomly selects new data points from the unlabeled data pool following a uniform distribution
  2. Fully supervised baseline is a fully supervised model trained on a completely labeled training data set and serves as a reference
  3. Core-set [ref: Sener et. al. 2017] is a representation- based method that solves the K-Center problem to select the points in the latent space
  4. Max-entropy [ref: Settles et. al. 2009] is the best-known uncertainty-based method in Active Learning that uses the entropy of the model’s output as a measure for uncertainty.
Figure 6: Performance (LRAP and F1-Score) of our method compared to Core-set, Maxentropy, Random, and fully-supervised baselines.

Figure 6 shows the performance of our model compared to the baselines. We can see that our approach outperforms the baselines. Since we start with only 10 labeled RFEs, many labels in the universe of symptoms do not have associated data samples. Our method is able to identify the previously unseen and unlabeled data points for exploration. Simultaneously, since the data has a long-tailed distribution, our methods are not susceptible to oversampling outliers along the boundaries of the data.

We also qualitatively analyzed the performance of our model based on the selected points in the latent space. Figure 4 in the paper shows that at first, our model explores new clusters (labels) and later it balances the exploration and exploitation to avoid outliers in the data. As we showed in our experiments, this can address long-tailed distribution in the dataset and outperform compared baselines in the symptom recognition task.

Medically aware GPT-3 as data generator for medical conversation summarization

Jai Chintagunta, Virginia Tech | Mentor: Namit Katariya

You can find out more in our paper that won the best paper award in 2020 ACL Workshop on medical conversations and also got accepted as a full paper at MLHC-2021.

Hi I’m Jai. Prior to Curai, I studied mathematics at Virginia Tech. I joined Curai’s machine learning team specifically working on the problem of medical dialogue summarization.

As Anirudh described previously, summarizing patient history is a crucial task in the workflow of Curai providers. With Dr. Summarize work from Anirudh’s internship, we had a foundation for approaching medical dialogue summarization. However, we lacked a large labeled dataset, which is crucial for summarization tasks. In the medical world, obtaining labeled examples is both costly and challenging. At Curai, we’ve started on the journey of generating high-quality labeled summarization examples, obtaining a set of 6,000 labeled examples vetted by clinical professionals. Although this is a good start, it is nowhere near enough labeled data to train a summarization model, especially in the medical domain where errors are deleterious. To put this into perspective, the main summarization datasets (CNN/Daily Mail, XSum) used in research have upward of 200k labeled examples. With such few labels we hypothesized that a large scale language model which had been trained on a vast amount of data could offer us good few shot performance. GPT-3 certainly fits the bill and we tried it out on our task, priming it with 21 labeled examples. GPT-3 off-the-shelf (referred to as Vanilla GPT-3 from now on) produces coherent summaries; however, it does not do so well at retaining relevant medical information from the conversation. For example in Conversation 1 in Figure 7, Vanilla GPT-3 fails to capture the name of the medication and dosage information.

Figure 7: Qualitative comparison of summaries outputted by Vanilla GPT-3 and GPT-3-ENS

We realized that we need to incorporate medical domain knowledge into Vanilla GPT-3. Our approach, GPT-3-ENS, illustrated in the first box of Figure 8 relies on ensembling. We have 10 different instances of Vanilla GPT-3 for which we provide 21 unique training examples each (total of 210 human annotated examples) and have each produce a summary for a total of 10 summaries. Out of these 10 summaries we pick the summary which has the most medical concepts in common with the original dialogue using the help of a medical entity recognizer. This approach yields more medically correct summaries as opposed to Vanilla GPT-3. We can see from Conversation 1 in Figure 7 that GPT-3-ENS now captures the relevant dosage and medication information that was missed by Vanilla GPT-3.

Now we have a way of producing medically correct summaries using relatively few manual annotations. However we can’t send every doctor-patient chat to OpenAI servers due to privacy concerns and furthermore, the approach is not very amenable to corrections/supplementations from doctors. Thus, we propose to use GPT-3-ENS as a way to synthesize labels upon which we train a downstream summarization model (Figure 8).

Figure 8: First box introduces GPT-3-ENS: infusing medical knowledge to identify best medically correct summary from GPT-3. GPT-3-ENS synthesized and human labeled data are combined to learn summarization model that performs better than models trained on either of the sources

Using this approach, we are able to train a model from 210 human labeled examples to be on par in performance with a model trained from 6400 human labeled examples. Furthermore, in our experiments we discovered that the best model comes from training on a mix of GPT-3-ENS synthesized examples and human-labeled examples.

Figure 9: Doctor evaluations in regards to coherency and coverage of generated summaries

In Figure 9:

  • Human is a model trained from 6400 human labeled examples
  • GCF… is a model trained from 6400 GPT-3-ENS synthesized examples (which it itself needed 210 human labeled examples)
  • Human + GCF … is a model trained from 3200 GPT-3-ENS synthesized examples and 3200 human labeled examples

You can find out more in our paper that won the best paper award in 2020 ACL Workshop on medical conversations and also got accepted as a full paper at MLHC-2021.

Stay Informed!

Sign up for our newsletter to stay up to date on all Curai news and information.