This study diverges from the conventional framework of a randomized controlled trial (RCT) typically employed in clinical investigations. Instead, it centers on an elucidative analysis and statistical appraisal of anonymized data acquired through random and voluntary participation. Consequently, adherence to the Standards for Reporting Implementation (StaRI) checklist was ensured, as detailed in Supplementary File 1. The collection of data adhered rigorously to pertinent guidelines and regulations, as outlined in the “Ethics Approval and Consent to Participate” Section within the Declarations. Our data collection endeavors primarily aimed at capturing demographic and lifestyle information pertinent to depressive manifestations during the COVID-19 pandemic for subsequent statistical scrutiny. The analytical approach adopted in this study prioritizes the exploration of data-centric insights over clinical delineations. To this end, depression assessments relied on self-reported measures and validated non-clinical scores indicative of depression diagnosis thresholds. In the ensuing analysis, we conducted statistical examinations of the amassed dataset, revealing correlations between certain lifestyle factors and depressive conditions.
Prior to commencing experimentation, all protocols were subjected to scrutiny and approval by the internal ethical board of the Simula Metropolitan Center for Digital Engineering (SimulaMet), followed by endorsement from the Norwegian ethical entity, the Regional Ethical Committee (REK) (accessible at https://rekportalen.no/#home/REK), under reference number #614,685. Stringent adherence to the stipulations of the General Data Protection Regulation (GDPR) was observed throughout the study duration.
Study characteristics
Internet-based surveys are widely employed for data acquisition, enabling survey questions to reach designated participants via the World Wide Web (WWW)21,22,23,24,25,26. They utilize diverse channels such as email, website integrations, and social media platforms to provide respondents access to online surveys. Table 1 outlines the six primary characteristics that encapsulate our online survey framework.
Table 1 The attributes of our online survey.
Questionnaire preparation for online public survey
Conducting online surveys involves deploying a set of inquiries to a targeted sample via the World Wide Web (WWW). This method is widely adopted for data collection in contemporary research endeavors. Various platforms, such as email, website integrations, and social media channels, facilitate the dissemination of these surveys to respondents. An effective survey design necessitates a judicious combination of open-ended and closed-ended questions to elicit comprehensive responses while maintaining survey efficiency. The formulation of online survey questions is pivotal in eliciting pertinent information from the selected respondents. By incorporating diverse question types, such as multiple-choice, dichotomous, matrix-form, or Likert scale, survey designers can tailor the questionnaire to suit the specific objectives of the study. The alignment between the research objectives and the survey questions is paramount to ensuring the acquisition of valid and meaningful data from online surveys. Moreover, meticulous attention to the structural coherence of survey questions is imperative to capture essential details accurately. A nuanced understanding of the overarching purpose of the online survey facilitates the construction of well-organized and targeted inquiries, thereby enhancing the efficacy and relevance of the data obtained.
The survey questionnaire comprised open-ended multiple-choice inquiries, formulated using Google Forms and disseminated through various social media platforms (e.g., Facebook, LinkedIn, WhatsApp) and electronic mail. Prior to commencing data collection, an online workshop was convened involving e-Health researchers (n = 8), e-Health professors (n = 5), experts in health policy, survey methodology, and data security (n = 4), healthcare professionals (e.g., physicians, nurses) (n = 4), a representative sample of participants (n = 20), and specialists in computer science and statistics (n = 3). This workshop facilitated the development of a comprehensive study roadmap and the formulation of an online questionnaire tailored to survey random participants. Questions within the questionnaire were stratified into three distinct categories based on their method of response: text box-based (for textual input), check box-based (enabling binary selection), and selection box-based (permitting the selection of multiple, non-mutually exclusive elements). Subsequently, collected data was categorized into numeric and categorical groups, with all data uniformly converted into the float64 format to facilitate statistical processing. A detailed description of the survey questions is provided in Supplementary File – 2.
Data collection
Over a span of three months, data were collected from 1,834 respondents between June 2021 and August 2021. The survey ensured anonymity by omitting any personally identifiable information such as names, emails, or official identifiers like Aadhar ID, Passport Number, Voter ID, PAN ID, or similar government and private identifiers. Participation in the survey was voluntary and contingent upon electronically distributed informed consent, duly signed by the participants. The data collection process involved several steps. Initially, participant data were gathered and stored in Google Drive. Subsequently, the collected data were downloaded in comma-separated values (CSV) format. Following this, all data were transferred to secure storage for subsequent analysis. Data cleaning procedures were then implemented, primarily focusing on outlier identification through box-plot analysis. Entries falling outside the predetermined sample parameters, such as the targeted age range (≥ 18 and < 65), were removed, along with any duplicated entries. Further data refinement included standardizing units across the dataset for consistency.
Body mass index (BMI) calculations were performed based on participants’ height and weight measurements, which were subsequently categorized into four distinct levels of body composition: underweight (0), normal weight (1), overweight (2), and obese (3). Participants were also stratified into three age groups: 0 (age ≥ 18 and age ≤ 40), 1 (age ≥ 41 and age ≤ 64), and 2 (age ≥ 65). To streamline the dataset, redundant features such as height, weight, city, age, and BMI were manually reviewed and removed, limiting the feature count to 40. Detailed information regarding the encoding of categorical variables in the dataset is provided in Supplementary File – 3. The final dataset comprised 1,767 respondents, with 62% being male (standard deviation, σ = ± 21.05, mean, µ = 45.32) and 38% female (σ = ± 20.05, µ = 55.93) following data cleaning and correction processes. The dataset encompassed both numerical and categorical variables, necessitating the encoding of categorical columns into integer labels to facilitate efficient data processing. All survey questions were mandatory, resulting in no missing data instances.
To assess the normality of the dataset, the Shapiro-Wilk test was employed, where a p-value below 0.05 indicated non-normality, leading to the rejection of the null hypothesis. Subsequently, our analysis revealed that the collected survey data did not adhere to a Gaussian distribution. Measures were implemented to uphold dataset security, ensure the anonymity of respondents, and safeguard against unauthorized data manipulation.
Statistical analysis and data visualization
Correlation analysis was employed to ascertain the presence and strength of relationships between various features within the dataset. The correlation coefficient (r)27 served as a metric for quantifying the degree of association between features, ranging from − 1 to + 1. A value of |r| = 1 indicates a perfect linear relationship, where each positive increase in one variable corresponds to a constant proportional increase in the other variable (positive correlation), while |r| = -1 signifies a perfect negative correlation, where an increase in one variable corresponds to a proportional decrease in the other. A value of |r| = 0 denotes no linear relationship between the variables. Given the non-normal distribution of the dataset, Spearman’s rank correlation coefficient was utilized to calculate |r|, enabling the identification of strong correlations with |r| ≥ 0.85 for subsequent elimination of redundant features. The Spearman’s rank correlation coefficient can be calculated as28:
$$\:{r}_{s}=1-\:\frac{6\sum\:{d}_{i}^{2}}{n({n}^{2}-1)}$$
(1)
where rs = Spearman’s rank correlation coefficient, di = difference between the two ranks of each observation, and n = number of observations.
To find the dependency between the dependent variable and independent variables (or features), we used ANOVA (Analysis of Variance) statistical testing method with the following hypothesis29:
if p < 0.05, this means that the categorical variable has significant influence on the numerical variable, and
if p > 0.05, this means that the categorical variable has no significant influence on the numerical variable.
In our analysis, we utilized Python data visualization libraries, including Matplotlib and Seaborn, to visually represent the dataset. Outlier analysis was conducted using box and whisker plots to assess locality, spread, and skewness. Additionally, various plotting methods such as histograms, bar charts, and density plots were used to examine the distribution of the data. Moreover, bar plots with cross-tabulations were employed to visualize dependencies between two features.
Feature ranking and data labeling with clustering method
Three standard methods were employed in our analysis: SelectKBest utilizing the Chi-Square (χ2) Statistic (chi2), principal component analysis (PCA), and ExtraTreeClassifier. These methods were utilized to assess the fitness score of the features, aiding in their ranking. As our dataset was unlabeled, we utilized the standard K-Means clustering algorithm to label the dataset due to its ease of implementation and rapid convergence. Clustering serves as a widely employed unsupervised learning technique aimed at uncovering hidden patterns or relationships between data points based on shared attributes. It is particularly valuable for drawing insights from large datasets. K-Means clustering was selected for its simplicity and efficiency, especially in scenarios involving large variable sets, as it converges more quickly than hierarchical clustering and yields tighter clusters. The optimal “K” value was determined through Silhouette scoring and the Elbow method. Notably, K-Means clustering is sensitive to scaling and follows the expectation-maximization method for problem resolution. In this method, data points are assigned to clusters to minimize the sum of squared distances between the data points and centroids. The resultant clustering introduced an additional predictor column, yielding a total of 41 features (40 independent and one dependent). The optimum number of clusters finding problem in K-Means clustering can be defined as30,31,32:
$$\:\text{arg}S\text{min}\sum\:_{i=1}^{k}\sum\:_{x\in\:{S}_{i}}\left|\left|x-\:{\mu\:}_{i}\right|\right|$$
(2)
Where S = set of observations, k = number of sets of predictors, x = observation data points, and \(\:{\mu\:}_{i}\) = mean of points in \(\:{S}_{i}\).
The pseudo code for finding the best “K” value with data label is stated as follows:
Pseudo code “K”-value determination based on clustering for data labeling
Step-1: Define input parameters – data, max_clusters = 10, scaling\(\:\in\:\){True, False}, visualization\(\:\in\:\){True, False}, and metric=’euclidean’
Step-2: Define list – n_clusters_list, silhouette_list
Step-3: if (scaling = = True) Then
scalar = convert_to_min_max (data)
else
scalar = data
Step-4: For n_c = 2 to max_clusters + 1 do
kmeans_model = KMeans(n_clusters = n_c).fit(scalar)
labels = find_labels(kmeans_model)
n_clusters_list.append(n_c)
silhouette_list.append(silhouette_score(scalar, labels, metric = metric))
End
Step-5: Cross-verification of “K” value with the Elbow method.
Step-6: Find the best parameters based on defined lists
Step-7: Perform data labeling with the best model
Step-8: Visualize the best Clustering corresponds to Number of clusters (n_c) and Silhouette score.
Data classification
In this study, a pipeline methodology was employed, integrating Principal Component Analysis (PCA) with conventional machine learning algorithms for classification purposes. PCA aims to reduce the dimensionality of datasets that have numerous correlated variables while striving to retain maximal variance within the dataset. Within this pipeline framework, PCA played a crucial role in identifying the optimal feature set to achieve superior mean classification accuracy, F1-score, precision, recall, and Matthews Correlation Coefficient (MCC) value. The calculation of performance metrics is outlined in accordance with prior literature33,34:
$${\text{Accuracy }}\left( {\text{A}} \right){\text{ }} = \frac{{\left( {{\text{TP}} + {\text{TN}}} \right){\text{~}}}}{{\left( {{\text{TP}} + {\text{FP}} + {\text{FN}} + {\text{TN}}} \right)}},0 \le \frac{{\left( {\text{A}} \right){\text{~}}}}{{\left( {100} \right)}} \le {\text{1}}$$
(3)
$${\text{Precision }}\left( {\text{P}} \right) = \frac{{\left( {{\text{TP}}} \right){\text{~}}}}{{\left( {{\text{TP}} + {\text{FP}}} \right)}}$$
(4)
$${\text{Recall }}\left( {\text{R}} \right){\text{ or Sensitivity }}\left( {\text{S}} \right){\text{ or True positive rate}} = \frac{{\left( {{\text{TP}}} \right){\text{~}}}}{{\left( {{\text{TP}} + {\text{FN}}} \right)}}$$
(5)
$${\text{Specificity }}\left( {\text{S}} \right) = \left( {{\text{1}} – {\text{Sensitivity}}} \right) = \frac{{\left( {{\text{TN}}} \right){\text{~}}}}{{\left( {{\text{TN}} + {\text{FP}}} \right)}}$$
(6)
$${\text{F1 score }}\left( {{\text{F1}}} \right) = \frac{{\left( {2{\text{*P*R}}} \right){\text{~}}}}{{\left( {{\text{P}} + {\text{R}}} \right)}},0 \le \frac{{\left( {{\text{F}}1} \right){\text{~}}}}{{\left( {100} \right)}} \le {\text{1}}$$
(7)
$${\text{Matthew}}^{\prime } {\text{s correlation coefficient }}\left( {{\text{MCC}}} \right) = \frac{{\left( {{\text{TP}}\left( {{\text{TP*TN~}}{-}{\text{~FP*FN}}} \right){\text{~}}} \right){\text{~}}}}{{\surd \left( {\left( {{\text{TP}} + {\text{FP}}} \right)\left( {{\text{TP}} + {\text{FN}}} \right)\left( {{\text{TN}} + {\text{FP}}} \right)\left( {{\text{TN}} + {\text{FN}}} \right)} \right){\text{~}}}}, – {\text{1}} \le \frac{{\left( {{\text{MCC}}} \right){\text{~}}}}{{\left( {100} \right)}}~ \le + {\text{1}}{\text{.}}$$
(8)
Where TP: True Positive, TN: True Negative, FP: False Positive, and FN: False Negative.
The following standard machine learning algorithms were used for classification with 5-fold cross validation as they represent most common families of algorithms and work well with limited dataset and multiple features:
a)
Support Vector Classifier (SVC): A support vector machine35 is a supervised linear machine learning algorithm most used to solve classification problems, also known as support vector classification. The SVC algorithm helps to find the best line or decision boundary; this best boundary or region is called a hyperplane. The SVC algorithm finds the closest line point from two classes. These points are called support vectors. The distance between the vector and the hyperplane is called the margin. The goal of SVC is to maximize this margin. The hyperplane with the largest margin is called the optimal hyperplane. A kernel function in SVC is a method for processing data that takes data as input and transforms it into a desired form. It returns the inner product between two points in the standard feature dimension. Usually, the training dataset is transformed so that the non-linear decision surface can be transformed into a linear equation in a higher dimensional space. SVC uses the following kernels – Linear, Gaussian, Gaussian Kernel Radial Basis Function (RBF), Sigmoid, and Polynomial. The mathematical model behind the SVC can be defined as:
$$\:maximize\:f\left(c1,c2,\dots\:.,cn\right)=\:\sum\:_{i=1}^{n}{C}_{i}-0.5\:\sum\:_{i=1}^{n}\sum\:_{j=1}^{n}{y}_{i}{C}_{i}\:\left({x}_{i}\:.\:\:{x}_{j}\right){\:y}_{i}{C}_{j}$$
subject to
$$\mathop \sum \limits_{{i = 1}}^{n} C_{i} y_{i} = 0,\;{\text{and}}\;0~\, \le ~\,C_{i} \,~ \le ~\,\frac{1}{{2n\partial }}\forall \,{\text{i}}.$$
(9)
b)
DecisionTreeClassifier: A decision tree36 is a supervised machine learning algorithm that uses a set of rules, like how humans make decisions. Decision trees capture knowledge in the form of a tree, which can also be rewritten as a discrete set of rules for better understanding. The intuition behind decision trees is that dataset creates yes/no questions and keeps partitioning the dataset until we isolate all data points belonging to each category. Both the Gini coefficient and entropy are measures of node impurity for classification. The Gini index has a maximum impurity of 0.5 and a maximum purity of 0, while entropy has a maximum impurity of 1 and a maximum purity of 0. Multi-class nodes are impure, while single-class nodes are pure. Entropy is more expensive to compute because of the logarithm in the equation.
c)
Naïve Bayes: The naive Bayes37 algorithm is a supervised learning algorithm based on the Bayes theorem for solving classification problems. Naive Bayesian classifiers are one of the simplest and most effective classification algorithms that help build fast machine-learning models that can make quick predictions. It is a probabilistic classifier, which means it makes predictions based on the probability of the object. It assumes that a particular trait occurs independently of other characteristics. The mathematical expression of Naïve Bayes is:
$$\:P\:\left(A\right|B)\:=\:\frac{P\left(B\right|A\left)P\right(A)}{P\left(B\right)}\:\:\:and\:P(B)\:=\:\sum\:_{y}P\left(B\right|A\left)P\right(A)$$
(10)
Where, \(\:P\:\left(A\right|B)\) = Posterior, \(\:P\left(B\right|A)\) = Likelihood, \(\:P\left(A\right)\) = Prior, \(\:P\left(B\right)\) = Normalizing constant
d)
K-Nearest Neighbor (KNN): K-Nearest Neighbors38 is one of the simplest machine-learning algorithms based on supervised learning techniques. It stores all available data and classifies new data points based on similarity. It assumes the similarity between new cases/data and available cases and places new cases in the category most like general categories. It means that when new data comes in, it can be easily classified into the appropriate type using the KNN algorithm. It is a nonparametric algorithm, which means it does not make any assumptions about the underlying data. The lazy learner algorithm is called because it does not immediately learn from the training set but instead stores the dataset and performs operations on the dataset while classifying. The mathematics behind KNN is:
$$P(y = j|X = x)~ = ~\frac{1}{K}~*~\mathop \sum \limits_{{i~ \in ~A~}} I\left( {Y^{i} ~ = ~j} \right)$$
(11)
We split the dataset with a train, validation, and test ratio of 60:20:20 (and a random state of 42). We used the grid-search technique to find the best parameters for the classifier. Finally, we used the validation and learning curve for model verification.
Classification explanation
In this study, we used the LIME algorithm39 to generate explanations that are easily understandable to humans, enabling users to gain insights into the model’s decision-making process. These explanations aim to help understand why a model made a particular prediction for a specific instance or observation. LIME works by approximating the behavior of a black-box model in the local vicinity of the instance being explained.
LIME offers instance-level explanations, focusing on individual predictions rather than overall model behavior. It is versatile across various machine learning models, including deep neural networks and ensemble methods, ensuring interpretability regardless of model complexity. LIME generates saliency maps using a gradient-based approach, where brighter colors signify higher feature influence. By training interpretable models on perturbed data, LIME approximates black-box model behavior locally. It visualizes feature importance using color codes, with brighter colors denoting greater influence. This approach enhances model interpretability, which is crucial for understanding complex predictions. Here’s a description of LIME color codes used in this context –
Orange color: In the feature importance visualization, orange is commonly used to highlight features that have a significant impact on the model’s prediction for the instance being explained. These features are considered the most influential in determining the output of the black box model.
Red color: Red may indicate significant importance but slightly less than orange. Features highlighted in red still hold considerable influence on the prediction for a specific class but may not be as dominant as those depicted in orange.
Blue color: Conversely, blue is used to denote features that have a minimal or negligible effect on the model’s prediction. These features are deemed less important in influencing the output of the black-box model and may have little impact on the final prediction.
Green color: LIME uses green to symbolize the original prediction made by the black-box model for the instance under consideration. This serves as the baseline for comparison with the explanations provided by LIME.
Yellow color: In saliency maps, yellow is commonly used to denote regions of high saliency, indicating areas of the input space that have the greatest influence on the model’s prediction. These regions are considered crucial in shaping the model’s decision for the instance under consideration.
Purple color: Purple is used to denote regions of low saliency, indicating areas of the input space that have minimal influence on the model’s prediction. These regions are considered less important in determining the output of the black-box model and may have little impact on the final prediction.
Ontology modeling
We developed an ontology model using the Protégé (v. 5.x) open-source software to encapsulate insights derived from the survey dataset. Visualization of the ontology was achieved through the OWLViz tool within Protégé. In this object-oriented representation, owl.
serves as the overarching parent class, with arrows denoting hierarchical relationships (IS-A) between the concepts40,41. The ontology encompasses various elements, including classes, objects, properties, relationships, and axioms. Properties are categorized into two types: ObjectProperties and DataProperties, each with defined domain scopes, restriction rules, filters, and types such as Some (existential), Only (Universal), Min (minimum cardinality), Exact (exact cardinality), and Max (maximum cardinality). A detailed explanation of our designed ontology is provided in (Textbox 1). Our OWL ontology adheres to distinct knowledge representation phases, including abstraction for rule mapping, abduction for hypothesis generation, deduction for operator-reductor rules, and induction for generalization. The object-oriented class structure of the OWL ontology is presented in Supplementary File 4, formatted in “TTL” for improved readability. The following is a general step for semantic Ontology design and development used in this study:
Domain identification to define the scope of the Ontology and model associated concepts (classes) and their associations.
Knowledge and requirement gathering on the domain, including relevant literature, expert opinions, and existing ontologies. This information is used to identify concepts and relationships that need to be modeled in ontology.
Defining of Ontology structure with classes, properties (object and data), axioms, and relationships that will be used to represent the concepts and their interrelationships.
Ontology development with well-established editors or software tools. This involves creating the classes, attributes and relationships defined in the previous step and adding instances to the ontology to illustrate the concepts.
Structural consistency checking of the Ontology.
Validation of the Ontology with real or simulated dataset for individuals (objects) to ensure that it accurately represents the domain and can be used for its intended purpose. This may involve using ontologies to perform tasks such as classification, retrieval, or inference.
Refine and update the Ontology with new knowledge. It is important to maintain the Ontology and ensure it remains current and useful.
Textbox 1 The ontology expression.
Source link