Sources of data and participants
The database used to develop the predictive algorithm was the International Center for Nutritional Status Assessment (ICANS, University of Milan, Milan, Italy) database, which contains data from an ongoing large-scale open cohort nutrition study . As part of the study protocol, all patients at baseline will undergo a complete nutritional assessment, lifestyle interventions and eventually pharmacological interventions will also be prescribed, and follow-up examinations will be scheduled. A more limited number of parameters are routinely collected during follow-up to assess changes in weight, body composition, and laboratory tests. The development of the algorithm included all prediabetic patients enrolled from 2009 to the beginning of 2019. The complete database contains 18.973 baseline observations and a total of 45.148 follow-up observations. In this study, we included a total of 59 variables from the database.
Patients included in this study were self-referred patients seeking a weight loss program, mainly residing in Milan or nearby cities, and newly or recently diagnosed with prediabetes. Eligibility criteria are as follows: Must be 18 years or older. Not pregnant or breastfeeding. No conditions that significantly limit movement or physical activity. No severe cardiovascular, neurological, endocrine, or psychiatric disease. Only lifestyle interventions were prescribed. The lifestyle intervention consisted of a low-calorie omnivorous diet with a Mediterranean pattern, with macronutrient and micronutrient levels set according to the Italian Recommended Daily Intake (5). Physical activity recommendations were also provided according to the WHO physical activity guidelines (6).
This study followed the principles established by the Declaration of Helsinki, and written informed consent was obtained from each subject. The Ethics Committee of the University of Milan (n. 6/2019) approved the study procedures.
Results and predictors
The outcome was normalization of blood glucose (binary, fasting blood glucose <100 mg/dL) within 1 year of starting the lifestyle intervention.
A total of 59 predictor variables were used in the analysis.
Demographic data: age, gender, education, occupation, marital status
Anthropometric measurements: height, weight, arm length, arm circumference, wrist circumference, waist circumference, subcutaneous fat of the biceps, subcutaneous fat of the triceps, subcutaneous fat of the subscapularis, skin fat of the upper arm, Muscle area, arm fat area, body density, fat mass, lean mass
Bioimpedance analysis: intracellular water, extracellular water
Abdominal ultrasound examination: sternal subcutaneous adipose tissue, sternal visceral adipose tissue, abdominal subcutaneous adipose tissue, abdominal visceral adipose tissue
Indirect calorimetry: oxygen consumption, carbon dioxide production, respiratory quotient, resting energy expenditure
Medical history: family status, menstruation, pregnancy, dietary status, dietary history, physical activity, smoking, medications, clinical symptoms, weight history.
Vital signs: heart rate, systolic blood pressure, diastolic blood pressure
Blood and urine tests: white blood cell count, red blood cell count, hemoglobin, mean corpuscular volume, glucose, total cholesterol, HDL cholesterol, LDL cholesterol, triglycerides, glutamate pyruvate transaminase, glutamate oxaloacetate transaminase, gamma glutamyl transferase, thyroid stimulating hormone, creatinine , uric acid, urea
Statistical and machine learning analysis techniques
All eligible patients at the time of the study were included and the sample size was determined (no pre-calculations were performed).
For algorithms requiring complete data, k-nearest neighbor imputation (Gower’s distance, number of neighbors = 5) was used to impute missing data during the preprocessing phase.
Maximum predictive strength was determined through optimization of correct classification rate (CCF) and receiver operating characteristic area under the curve (AUROC). Among accuracy and discriminatory ability, accuracy was chosen as the most relevant metric in clinical practice (i.e., maximizing CCF).
We compared several statistical and machine learning models using 10-fold cross-validation resampling. For models requiring tuning parameters, grids consisting of several combinations of tuning parameters were tested by 10-fold cross validation.
Before model selection, preprocessing steps were defined for each model to ensure the best predictive ability for a particular model. All preprocessing steps were repeated in each cross-validation fold to capture uncertainties regarding non-deterministic data manipulation.
Principal component analysis (PCA) was employed as an optional preprocessing step aimed at reducing the dimensionality of the dataset. In these cases, PCA was used to transform a reduced number of sets of predictors designed to capture the maximum amount of information in the original variables. A potential advantage of this approach, besides dimensionality reduction, is that it can generate statistically independent predictor variables that can ameliorate the problem of correlation between variables in a dataset.
The following models were evaluated:
Logistic regression
linear discriminant analysis
Quadratic discriminant analysis
Naive Bayes tuned for kernel smoothness and Laplace correction
K nearest neighbors adjusted to the number of nearest neighbors, and distance weighting function, Minkowski distance order
Ridge regression and LASSO, adjusting the amount of regularization and percentage of LASSO penalty
Decision tree (adjusted for tree depth, minimum node size, and cost complexity parameters)
Adjusted for the cost/complexity parameters used in the CART model, the maximum depth of the tree, the minimum number of data points within a node required to further split the node, and the cost value you assign to the node. Bagging tree. Class corresponding to first factor level
Random Forest. It is tuned for the number of randomly selected predictors, the number of trees, and the minimum node size.
Boosted trees, tree depth, number of trees, learning rate, number of randomly chosen predictors, minimum node size, minimum loss reduction, percentage of observations sampled, and number of iterations before stopping. adjusted accordingly.
Linear support vector machine tuned for cost and dead margin
Single-layer neural network tuned for number of hidden units, amount of regularization, and number of epochs
Sensitivity, specificity, positive predictive value, and negative predictive value were calculated for the best model as follows: sensitivity = TP/(TP + FN), specificity = TN/(TN + FP). , positive predictive value = TP/(TP + FP), negative predictive value = TN/(TN + FN), FN, false negative. FP, false positive; TN, true negative; TP, true positive;
All statistical analyzes were performed using R 4.1.1 (7). Model preprocessing, tuning, resampling, and fitting were performed by adding the Tidymodels package to R (see Appendix for algorithm-specific packages).