Risk Stratification of Indeterminate Thyroid Nodules Using Ultrasound and Machine Learning Algorithms

Matti Lauren Gild; Mico Chan; Jay Gajera; Brett Lurie; Ziba Gandomkar; Roderick J. Clifton-Bligh


Clin Endocrinol. 2022;96(4):646-652. 

In This Article

Abstract and Introduction


Background: Indeterminate thyroid nodules (Bethesda III) are challenging to characterize without diagnostic surgery. Auxiliary strategies including molecular analysis, machine learning models, and ultrasound grading with Thyroid Imaging, Reporting and Data System (TI-RADS) can help to triage accordingly, but further refinement is needed to prevent unnecessary surgeries and increase positive predictive values.

Design: Retrospective review of 88 patients with Bethesda III nodules who had diagnostic surgery with final pathological diagnosis.

Measurements: Each nodule was retrospectively scored through TI-RADS. Two deep learning models were tested, one previously developed and trained on another data set, mainly containing determinate cases and then validated on our data set while the other one trained and tested on our data set (indeterminate cases).

Results: The mean TI-RADS score was 3 for benign and 4 for malignant nodules (p = .0022). Radiological high risk (TI-RADS 4,5) and low risk (TI-RADS 2,3) categories were established. The PPV for the high radiological risk category in those with >10 mm nodules was 85% (CI: 70%–93%). The NPV for low radiological risk in patients >60 years (mean age was 100% (CI: 83%–100%). The area under the curve (AUC) value of our novel classifier was 0.75 (CI: 0.62–0.84) and differed significantly from the chance-level (p < .00001).

Conclusions: Novel radiomic and radiologic strategies can be employed to assist with preoperative diagnosis of indeterminate thyroid nodules.


Fine needle biopsy (FNAB) has remained the standard diagnostic investigation for characterisation of thyroid nodules. Following FNAB of suspicious nodules, classification through the Bethesda system stratifies malignancy risk by a standardized framework of reporting pathology results. A proportion (~20%) of these will be indeterminate and require further investigation including those that are classified as atypia of undetermined significant/follicular lesion of undetermined significance (AUS/FLUS) and are graded as Bethesda III.[1] For these nodules, the risk of malignancy is usually quoted at 5%–15% but some studies report risk as high as 25%.[2] Recommendations are usually to repeat FNAB in 3 months or perform a hemithyroidectomy for confirmatory diagnosis. This uncertainty can lead to patient anxiety, particularly when the majority are benign (>75%).[1] Classification of nodules without final histopathology is challenging and these diagnostic surgeries are expensive, invasive, and harbour risk. The burden of this predicament is increasing, as the incidence of incidental thyroid nodules is rising with the more widespread use of ultrasonography. This in turn leads to increasing diagnostic, treatment, and surveillance costs for a condition with low disease prevalence.[3]

Several groups have attempted to improve risk stratification of thyroid nodules with ultrasound scoring, as it is the primary imaging modality used to evaluate them. The American Thyroid Association (ATA) guidelines utilize radiological features (microcalcification, a tall rather than wide nodule, solid, and hypoechoic) to characterize malignancy risk and attempt to minimize invasive treatments for characteristically benign nodules.[1] Thyroid Imaging, Reporting and Data System (TI-RADS) is another more recent noninvasive risk stratification protocol.[4] Each nodule is allocated a series of points depending on its radiological features and the sum of the points classifies a malignancy risk and whether FNA is warranted. Studies have confirmed that employing TI-RADS leads to less nodules biopsied and improves the accuracy of nodule management.[5]

Companion use of molecular analysis to assist in diagnostics has increased over the last decade. Molecular analysis of cytological material obtained by FNAB was pioneered with the Afirma Gene Expression Classifier (GEC)[6] and ThyroSeq[7] which have been utilized in evaluation of the preoperative risk for differentiated thyroid cancer (DTC) in indeterminate nodules. Initial results using Afirma showed a high sensitivity (90%) and high negative predictive value (NPV) suggesting utility to rule out malignancy in these nodules.[8] Practically this is reflected in audits which confirm a reduction the rate of surgical intervention of up to 66.4%.[9] Initially, the positive predictive value (PPV) was reported around 37% in Bethesda III nodules, making it challenging to rule in malignancy and causing patient anxiety for those receiving a suspicious Afirma result, unnecessarily for the majority, as the final pathology is benign in >50%.[8,10] ThyroSeq reports a sensitivity of 94.1% and a higher PPV of 65.9%[11] and has also shown reduced surgical interventions.[12] Molecular testing is costly (>$2000USD), often not subsidized and is not easily accessible to those in less resourced countries. It is also evolving particularly for ruling in malignancies despite refinement with the Afirma Gene Sequencing Classifier (GSC) and version 3 of ThyroSeq.[13] We propose simply using the validated TI-RADS score to restratify risk is a simple, cost neutral first-line approach. In a cohort of suspicious lesions from molecular analysis, returning to TI-RADS grade may assist with triage and surgical timing.

Machine learning algorithms for the diagnosis of the thyroid nodules is another strategy to predict final diagnosis. Lim et al.[14] proposed one of the earliest computer-aided diagnosis systems for the differentiation of malignant from benign thyroid nodules on ultrasound (USS). Eight expert-determined ultrasound parameters were utilized as inputs for the model and area under the receiver-operating characteristic curve (AUC) was 0.95. Similarly, Wu et al. fed ultrasound features and patient background characteristics to a radial basis function (RBF)-neural network (NN) and achieved an AUC of 0.91.[15] Instead of expert-determined multipound parameters, computer-extracted morphological or textural were also used in the computer-aided diagnosis system for evaluating activity of thyroid nodules. In recent years, deep learning has become the go-to framework for many visual perception tasks in radiology and pathology, overtaking classical machine learning.[16] In a deep learning framework, features are directly learned by computer and the extraction of the hand-crafted features is not required. Recent work[17–21] has focused on a comparison of the diagnostic performance of deep-learning-based tools with radiologists in thyroid nodule differentiation. Although these studies achieved promising results, they did not include large numbers of the most challenging cohort, the Bethesda III images. We considered whether deep learning models could play a role in better diagnostic strategies for this group.

The primary outcome of this study was to determine the likelihood of TI-RADS in predicting malignancy in indeterminate nodules. The secondary outcome was to determine whether a deep learning model based on a thyroid nodule USS data set could be used to assist in diagnosis of our Bethesda III cohort. Here, we demonstrate how the malignancy risk in a cohort of 88 Bethesda III nodules can be effectively classified utilising TI-RADS and describe the performance of a purpose built machine learning algorithm based off a large data set of mostly nonindeterminate cases.