GIST scientists advance voice pathology detection via adversarial continual learning

Dina Katabi, principal investigator for AI and health at the MIT Jameel Clinic, and Hong Kook Kim, professor at Gwangju Institute of Science and Technology (GIST) co-led a project to develop an adversarial task adaptive training approach that improves performance of a voice pathology detection (VPD) model by more than 15%, compared to the ResNet50 model. The development enables early and accurate diagnosis of voice-related disorders.

Hong states, “Our partnership with MIT has been instrumental in this success, facilitating ongoing exploration of contrastive learning. The collaboration is more than a mere partnership; it’s a fusion of minds and technologies that strive to reshape not only medical applications but various domains requiring intelligent, adaptive solutions.”


Voice pathology detection (VPD) is a non-invasive procedure that detects voice problems, due to conditions such as cancer and cysts, by differentiating abnormal vibrations in the vocal cords. But pretrained VPD models can show a dip in performance due to overfitting when they are self-supervised. Recently, researchers from Gwangju Institute of Science and Technology (GIST) and Massachusetts Institute of Technology (MIT) have developed an adversarial task adaptive pretraining approach, which enhances the performance of a VPD model by over 15%, compared to the state-of-the-art ResNet50 model.

Voice pathology refers to a problem arising from abnormal conditions, such as dysphonia, paralysis, cysts, and even cancer, that cause abnormal vibrations in the vocal cords (or vocal folds). In this context, voice pathology detection (VPD) has received much attention as a non-invasive way to automatically detect voice problems. It consists of two processing modules: a feature extraction module to characterize normal voices and a voice detection module to detect abnormal ones. Machine learning methods like support vector machines (SVM) and convolutional neural networks (CNN) have been successfully utilized as pathological voice detection modules to achieve good VPD performance. Also, a self-supervised, pretrained model can learn generic and rich speech feature representation, instead of explicit speech features, which further improves its VPD abilities. However, fine-tuning these models for VPD leads to an overfitting problem, due to a domain shift from conversation speech to the VPD task. As a result, the pretrained model becomes too focused on the training data and does not perform well on new data, preventing generalization.

To mitigate this problem, a team of researchers from Gwangju Institute of Science and Technology (GIST) in South Korea, led by Prof. Hong Kook Kim, has proposed a groundbreaking contrastive learning method involving Wave2Vec 2.0 — a self-supervised pretrained model for speech signals — with a novel approach called adversarial task adaptive pretraining (A-TAPT). Herein, they incorporated adversarial regularization during the continual learning process.