Translate this page into:
Integrating multi-platform gene expression data and machine learning assisted biomarker discovery in colorectal cancer
*Corresponding author: E-mail address: salehw@kfupm.edu.sa (S. Alwahaishi)
-
Received: ,
Accepted: ,
Abstract
Public repositories host a wealth of gene expression datasets, most of which come from microarray platforms. More recent studies are increasingly using high-throughput RNA sequencing (RNA-Seq) for better specificity and sensitivity. This study proposes an innovative approach that combines diverse gene expression data from multiple colorectal cancer (CRC) datasets generated using high-throughput sequencing and microarray technologies. The data integration increases the statistical power and increases the biological meaning of our findings.
We employed least absolute shrinkage and selection operator (LASSO) regression for feature selection on the combined dataset to reduce the dimension of the data and retain only robust gene signatures associated with colorectal cancer. The chosen features were subjected to functional enrichment analysis. The LASSO-selected features served as an input to multiple classifiers. We then applied 5 machine learning and 2 deep learning models to identify the most effective genes present across all seven different classification algorithms. Parameters such as F1 score, accuracy, sensitivity, and specificity were used to assess the model’s performance. The models were evaluated on an external dataset obtained from the the cancer genome atlas (TCGA) database.
Random forest and one-dimensional convolutional neural networks (1D-CNNs) were found to be the most effective models, achieving the highest accuracies. Each model also demonstrated greater than 90% accuracy when tested on the TCGA dataset. Finally, we identified carbonic anhydrase 7 (CA7), ATP binding cassette subfamily A member 8 (ABCA8), somatostatin (SST), myomesin 1 (MYOM1), CC motif chemokine ligand 23 (CCL23), procollagen C-endopeptidase enhancer 2 (PCOLCE2), and CXC Motif chemokine ligand 10 (CXCL10) genes as potential prognostic biomarkers of CRC. This study presents a data integration and machine learning approach for finding biomarkers in CRC. The identified gene panel shows promise as a diagnostic tool and needs further validation in clinical settings.
Keywords
Bioinformatics
Colorectal cancer
Gene expression
Machine learning
Predictive biomarker
1. Introduction
Colorectal cancer (CRC) is the third most commonly diagnosed cancer worldwide and the second leading cause of cancer-related mortalities, making it a major public health concern (Sung et al., 2021). It is included amongst the most prevalent cancers in Saudi Arabia, where its incidence and mortality rate in 2020 is 14.6% and 1.48% among all cancers, respectively (Alessa and Khan 2024). It is therefore very important for early and alternative detection methodologies and stratifying the risk, since histopathological images are unable to resolve the molecular nature of the disease.
The development of clinically useful molecular markers for diagnosis, prognosis, and prediction has been made possible by the molecular pathology of cancer, which has also enhanced our understanding of carcinogenesis (Sarhadi and Armengol 2022). Recently classification of cancer and biomarker identification using gene expression data provided us with the opportunity to differentiate healthy and diseased samples (Jopek et al., 2024). Previously, RNASeq gene expression data have been utilized to accurately classify cancer, which has a significant impact on disease diagnoses and prognosis (Tran et al., 2021). Even though high-throughput transcriptomics enables the screening of diagnostic markers and new prognostic biomarkers, the integration of data across different platforms continues to be a challenge (Bottomly et al., 2011; Sîrbu et al., 2012). This integrative approach of combining different microarray platforms (Leek et al., 2010), mainly comes from two manufacturers: Affymetrix (Gohlmann and Talloen, 2009) and Illumina (Inc). RNA seq high-throughput sequencing, primarily Illumina, should be addressed with batch correction using tools like ComBat (Johnson et al., 2007). These technical inconsistencies may hide the true biological information and pose significant challenges to classifying different stages. Thus, proper standardization, batch adjustments, and careful feature selection are needed to adopt this integrative approach.
Machine learning (ML) tools have assisted in discovering biomarkers and predicting a range of diseases. For example, the least absolute shrinkage and selection operator (LASSO) is a regularized regression model and is used to identify biomarkers amongst various cancers (Fujii et al., 2021). Additionally, radiomics-based models for the diagnosis of lung adenocarcinoma use support vector machines (SVM) (Cai et al., 2021), gradient boosting machine (GBM), and random forest (RF) has been utilized for microbial studies (Bakir-Gungor et al., 2022), (Toth et al., 2019); and immune-related signatures used artificial neural network (ANN) analysis (Chen et al., 2021).
The objective of the study is to combine microarrays and high-throughput sequencing datasets. Samples were then subjected to preprocessing methods, including strict quality control settings that differed from various microarrays and NGS platforms. The sample integration accounted for genes that had a common annotation across all the series. Following batch effect correction and batch merging, the combined datasets were normalized to move forward with feature selection and machine learning models. Our multi-platform integration approach of CRC expression datasets, joined with feature selection and machine learning models, would reveal biomarkers of potential interest.
2. Methods
2.1 Inclusion criteria
We integrate transcriptomic datasets from RNA-Seq and microarray technologies. The search term “Colorectal cancer” was used to retrieve relevant studies from the Gene Expression Omnibus (GEO) database. The dataset was selected using the inclusion criteria listed below:
-
a)
The datasets must originate from Homo sapiens.
-
b)
The datasets must specifically exclude cell line-based investigations and only include total RNA extracted from Colorectal tissue and nearby healthy regions.
-
c)
The two distinct dataset groups must adhere to a standard sample-collecting procedure.
-
d)
Samples having mutations, induced gene expression, therapeutic interventions, medication treatments, or gene knockdowns shouldn’t be included in the datasets.
By following the above-mentioned selection criteria, we choose five microarray datasets—GSE25070, GSE37182, GSE41328, GSE74602, and GSE8671. The data platform for GSE25070, GSE37182, and GSE74602 datasets were Illumina HumanRef-8 v3.0 expression bead chip, while GSE41328 and GSE8671 studies involved Affymetrix Human Genome U133 Plus 2.0 Array. Similarly, two studies for RNA-Seq data selected for the study include, GSE142279 and GSE50760. A detailed representation of individual datasets, along with the sample numbers, is represented in Table 1.
| Experimental type | Dataset ID | Platform | CRC | Normal | Total number |
|---|---|---|---|---|---|
| MicroArray | GSE25070 | GPL6883 Illumina HumanRef-8 v3.0 expression beadchip | 26 | 26 | 52 |
| GSE37182 | GPL6947 Illumina HumanHT-12 V3.0 expression beadchip | 84 | 88 | 172 | |
| GSE41328 | GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 10 | 10 | 20 | |
| GSE74602 | GPL6104 Illumina humanRef-8 v2.0 expression beadchip | 30 | 30 | 60 | |
| GSE8671 | GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 32 | 32 | 64 | |
| RNA-Seq | GSE142279 | GPL20795 HiSeq X Ten (Homo sapiens) | 20 | 20 | 40 |
| GSE50760 | GPL11154 Illumina HiSeq 2000 (Homo sapiens) | 18 | 18 | 36 |
2.2 Data processing
We used GEO query (Davis and Meltzer 2007) R package from Bioconductor to acquire the different microarray technologies and platforms from the GEO repository. Data processing makes an important impact since it turns the raw data matrix into a clean dataset, which enhances the quality of information. Standard transcriptomic data preprocessing steps include imputation of missing data, filtering, transformations, and sample-based normalization. The accuracy, precision, and robustness of analyses can be greatly increased by using the right tools, techniques, and datasets. We utilized the arrayQualityMetrics R package that makes use of six quality tests to eliminate low-quality samples and outliers from microarray datasets. The duplicate samples from the RNA-seq datasets were removed. The Robust multi-array average (RMA) algorithm normalized the highly variable data due to the different platforms used in our study (Bolstad et al., 2003). It normalizes the microarray data by quantiles, adjusts the background, and summarizes the data. The rma function from R package “affy” (Gautier et al., 2004) and “oligo” (Carvalho et al., 2013) was used for pre-processing the data generated from Affymetrix microarrays while data from Illumina platforms were processed using “lumi” (Du et al., 2008) package. The annotate R package (Gentleman et al., 2013), was used for gene annotations and conversion of EntrezIDs to official gene symbols. Alternatively, the biomaRt R package was used to annotate the Ensemble IDs and convert them into gene symbols once the expression raw count matrix file from RNA seq data was obtained from its source repository (Smedley et al., 2015; Howe et al., 2021). The cqn package corrected the GC content bias and normalized the data (Hansen et al., 2012), while the NOISeq package was used to compute the gene expression values (Tarazona et al., 2015).
2.3 Gene expression integration
The average and median statistical summaries can summarize all the genes with the same identifier, which is essential for a consistent data analysis. Moreover, log transformation was applied to datasets that have not been scaled to log2 depth. The transcripts were pooled by taking the average of each series separately.
The depth of each microarray platform was examined with consistent representation of multiple datasets based on platform IDs. Considering the images from microarray scanners are digitized at 16 bit-depth with known saturation limits, the RNA-seq-derived expression values can cover a wider numeric range. An unsupervised monotonic linear rescaling to match the dynamic ranges of each dataset was applied. This method preserved rank information and fold-change ordering, but only as a unit change, and was done before quantile normalization and ComBat batch effects correction. This approach aligns the intensity levels across platforms and provides point estimates for microarray intensities aligned to RNA-seq distributions. A dynamic-range harmonization (bit-depth–aware scaling) was applied depending on the maximum value of each series. This step rescales each dataset to a common [0,1] range. It does not change any value, affect ranks within a dataset, or use outcome labels. A similar approach has been used before while integrating heterogeneous gene expression data (Castillo et al., 2017; Galvez et al., 2019). This led to a bit depth between platforms: 16-bit for Human Genome U133 Plus 2.0 Array (GSE41328 and GSE8671), 16-bit for the HumanAll platform (GSE25070, GSE37182, and GSE74602), 20-bit for the HiSeq 2000 (GSE50760), and 24-bit for the HiSeq X Ten (GSE142279). Each platform selected a specific bit depth to standardize their sample data. Data integrity was maintained by selecting only genes that appeared in all datasets for the integration process. Batch effect correction was performed through the ComBat method from the sva R package after filtering common genes to create a uniform distribution of samples across all batches. The inter-array normalization of expression values across samples used the limma R package normalizeBetweenArrays function to ensure sample comparison and avoid classification errors in machine learning later. The process ensures every sample matches the others while quantile normalization applies the same empirical distribution to all samples. The integrated dataset was formed by merging n samples from N classes with p common genes.
2.4 Feature selection
Feature selection is a crucial step in genomic datasets, where it streamlines the classification process by eliminating superfluous features. Feature selection reduces the workload of the classifier and, as a result, increases the classification accuracy by identifying the most important characteristics for high-dimensional datasets. We used LASSO regression to decrease the number of genes or features that helped in further analyzing the data. The training dataset with 10-fold CV was subjected to the LASSO model using the R package glmnet version 4.1-3. The L1 penalty term’s inherent property allows LASSO to precisely reduce some of the estimated coefficients to zero. Consequently, characteristics with non-zero regression coefficients can be automatically chosen as informative (Tibshirani 1996). The data were randomly divided into ten sets; nine were used for training, and the remaining set was used for testing. Initially, using the cv.glmnet package, 10-fold CV was used to compute the penalty regularization parameter lambda (Ahmed et al., 2022). The final model was constructed using glmnet, with the lambda value corresponding to the maximum area under the curve (AUC, lambda. min) selected for regularization. The expression data of genes with non-zero coefficients were used to create the final LASSO model.
2.5 GO-KEGG pathway enrichment analysis
Kyoto encyclopedia of genes and genomes (KEGG) and Gene ontology (GO) pathway enrichment analysis were performed to investigate the biological pathways and functions of the genes obtained in the previous step. Functional analysis was conducted by database for annotation, visualization, and integrated discovery (DAVID), with the KEGG pathways and GO terms significantly enriched, explained by P < 0.05 after an adjusted p-value cut-off (Kanehisa and Goto 2000).
2.6 Classification through traditional machine learning classifiers
We employed five classification models to evaluate the performance of Lasso regularized regression, namely random forest (RF) (Rigatti 2017), support vector machines (SVM) (Suthaharan and Suthaharan 2016), eXtreme Gradient Boosting (XGBoost) (Chen et al., 2015), naive bayes (NB) (Webb et al., 2010), and k-nearest-neighbors (KNN) (Guo et al.,). We applied these ML models using the R environment with caret (Kuhn 2011), xgboost (Chen et al., 2015), pROC (Robin et al., 2021), e1071 (Meyer et al., 2019), and random forest (RColorBrewer and Liaw, 2018) packages. To ensure reproducibility, we fixed random seeds to 123. A stratified 80:20 split of the integrated dataset into training and test sets, with class proportions preserved in each split. The test set was held out and used only once for final evaluation. A 10-fold cross-validation (K-fold CV, K = 10) was applied to the training set to tune the hyperparameters for each classification approach. Hyperparameters were chosen based on the mean AUROC (area under the receiver operating characteristic curve) from the inner 10-fold CV. The 1-SE rule was used to select the simplest setup that stayed within one standard error of the best option. Specifically, the parameters that were modified include: σ (Laplace) and usekernel (Gaussian Naïve Bayes) in Naïve Bayes, the booster, objective, nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight, and subsample in XGBoost, number of trees for RF, k for k-NN, and γ for support vector machines (SVM) and random forest (RF). The objective was to select the best values for each procedure to improve model performance on unseen data. The classification models were developed using the selected genes and the training dataset and subsequently evaluated using the test dataset. We made pairwise comparisons of model performance on the testing dataset using DeLong’s test for AUROC and McNemar’s test for paired accuracy. We used the fixed thresholds from the training data to calculate F1 performance via bootstrap 95% confidence intervals (B=2000). Holm adjustment was used in the case of more than one pairwise comparison.
2.7 Deep learning models
We used two deep-learning models, i) 1-dimensional convolutional neural network (1D-CNN), ii) multilayer perceptron (MLP), to train our integrated data and extract important features. For deep learning models, we used features obtained from LASSO regression. The deep neural network models were trained using the Adam optimizer. The learning rate was selected via a small grid using a training-only internal validation split. To minimize overfitting and improve generalization, L2 weight decay, dropout, and early stopping on validation loss parameters were employed. The binary classification uses a softmax activation function in its output layer. The testing dataset was not changed until the final evaluation was done. The selected features were then used for classification by deep learning models. We load the reticulate library (v1.41) and configure R to use a Python installation. The deep neural network TensorFlow (v2.16) (Developers 2022) was used in combination with the Keras (v2.1.5) R package (Arnold 2017) to design a neural network.
2.8 1-Dimensional convolutional neural network
A convolutional neural network (CNN) is a type of deep learning model that automatically learns spatial levels of features through convolutional layers. A standard CNN architecture consists of convolutional and pooling layers, as well as regularization mechanisms such as batch normalization and dropout, to enhance generalization. First, the gene expression data were normalized, and then replaced into the three-dimensional tensor organization, ready to be processed serially by a 1D-CNN model. The CNN network consists of two convolutional layers, each with 64 and 128 filters, respectively. Both layers utilize ReLU activation and a kernel size of 3. The resulting tensor is then flattened, passed through a dense layer (64 units with the ReLU activation function) (LeCun et al., 2015). The model was trained in 100 epochs with a 0.0005 learning rate before early stopping prevented overfitting. Multiple evaluation metrics, including accuracy alongside sensitivity and specificity, F1-score, and area under the receiver operating characteristic curve (AUC), were used to assess the model’s performance.
2.9 Multilayer perceptron
MLP deep neural network contains three dense layers with 128, 64, and 32 hidden units that use ReLU activation functions. The model was trained at 100 epochs with 128 batches per epoch and 20% validation data (Popescu et al., 2009). The first convolutional layer weights are converted into absolute values to determine the feature importance. Each feature receives an importance score through the total of absolute weights across all filters. The features were ranked from highest to lowest score, with the top-ranked feature receiving the highest score and vice versa. The 25 leading features appeared in bar plots and were stored for further analysis.
2.10 Model validation using TCGA GDC dataset
We performed validation of the models using transcriptomic RNA-Seq data of CRC samples from the Cancer genome atlas (TCGA) Genomic data commons (GDC) portal (https://portal.gdc.cancer.gov/). The TCGAbiolinks R package (Colaprico et al., 2016) was used to download the data, and the Ensemble IDs were mapped to HGNC gene symbols via biomaRt package (Durinck et al., 2009). The validation on TCGA was performed on the original, unaltered data without balancing the data. Sample types were identified using short letter codes and labeled as Tumor or Normal. We applied a variance stabilizing transformation (VST) (Kelmansky et al., 2013) with DESeq2 (Love et al., 2014) to normalize the count data.
2.11 Common features across all models
To analyze intersections of the top 20 Features extracted via five ML and 2 Deep Learning models, we utilized the recently developed UpSet analysis (Lex et al., 2014) as visualization of intersections of multiple sets is not possible by classic Euler or Venn diagrams. The UpSet could analyze 212, i.e., 4096 intersections.
2.12 Survival analysis
The gene expression profiling interactive analysis 2 (GEPIA2) web server was used to assess the prognostic significance of the significant genes retained after the final stage of data filtering (Tang et al., 2019). The platform’s default settings were implemented to see the prognostic significance of the genes across the TCGA Colon adenocarcinoma (COAD) dataset. Significant features retrieved using artificial intelligence (AI) models and had prognostic value (Log-rank test P-value < 0.05) for either disease-free survival (DFS) or overall survival (OS) were chosen to identify COAD prognostic genes. Additionally, GEPIA2 compared the expression levels were compared between tumor and normal tissue samples and assessed the dysregulation of COAD prognostic genes.
3. Results
Gene expression datasets from multiple technologies and platforms were merged, as stated in the methods section. This integration enhanced the number of samples available for feature selection and classification, hence boosting the robustness and statistical reliability of the results. Moreover, getting information from numerous sources ensures that findings are not influenced by a particular technique. The incorporation of RNA-Seq data enlarged the dynamic midrange of gene expression, increasing the precision and importance of the analysis, while the microarray datasets contributed greatly to the overall sample size.
The data preparation steps were part of a series of pre-processing steps. The first step was establishing base 2 for each batch using the logarithmic transformation to modify the scale of the gene expression levels (Fig. 1a). The second step was applying 16-bit depth homogenization to datasets GSE50760 and GSE142279, enabling uniformity in bit-depth (Fig. 1b). After the identification and selection of the common genes shared by all samples from the series or batches. At this step, batch effect correction is necessary because samples coming from different datasets create systematic variations in the data. The Combat method from sva package removes batch effect and ensures a uniform distribution of samples across all batches (Fig. 1c). To ensure uniformity across all the samples combined, the final step is inter-array normalization, which applies an identical empirical distribution to each sample based on quantile normalization (Fig. 1d). The maximum and the minimum expression values across all features and samples in our dataset were 10.07 and 2.25.

- Steps involved in gene expression integration. (a) logarithmic transformation, (b) 16-bit depth homogenization, (c) Batch effect correction with Combat from sva package, (d) Normalize Between Arrays with normalize BetweenArrays function from the limma package.
3.1 Feature selection
The curse of dimensionality problem was addressed by the gene selection process from the expression data, which includes just n samples out of thousands of p genes. We reduced the dimension by removing non-essential genes and retaining the essential genes by utilizing an effective feature selection method. In our analysis, we utilized LASSO regression for our feature selection step. Prior to the start, our integrated normalized dataset contains 7880 features or genes and 416 samples, with 213 Normal samples and 203 Tumor samples. LASSO regression screening identified a total of 42 gene characteristics of CRC (Fig. 2a). The best Lambda for our model was 0.005, and the minimum square error (MSE) value was 0.234 (Fig. 2b) (Table S1).

- LASSO model fitting and selection of optimal regularization parameter. (a) coefficient profiles plotted as a function of the logarithm of the regularization parameter (log λ). each curve represents the trajectory of a predictor’s coefficient across varying λ values. The first vertical dashed line corresponds to the λ value that minimizes the cross-validation error, while the second represents the largest λ within one standard error of the minimum, offering a more regularized and parsimonious model. (b) ten-fold cross-validation curve showing binomial deviance (y-axis) versus log(λ) (x-axis). red dots represent the mean cross-validation error, with error bars denoting ±1 standard error. The two vertical dashed lines correspond to the same λ values as in (a), guiding model sparsity and selection.
3.2 GO and KEGG enrichment analysis of core genes
The biological functions and associated pathways of genes identified via LASSO regression were determined via GO and KEGG enrichment analyses. The analyses were categorized into three parts, which include Biological processes (BP), cellular components (CC), and molecular functions (MF). The top five enriched terms from each category were summarized in Table 2. In the BP category, core genes were primarily associated with chemotaxis, lipid transport, and general transport processes. For CC, the most significant terms were related to the secretion of proteins into the extracellular environment. In the MF category, the top enriched terms included heparin binding, cytokine activity, and monooxygenase activity.
| Category | GO pathways | P-value | Genes counts |
|---|---|---|---|
| Go biological processes | Chemotaxis | 2.9E-2 | 3 |
| Lipid transport | 2.3E-32 | 3 | |
| Transport | 2.7E-17 | 10 | |
| Inflammatory response | 1.4E-12 | 3 | |
| Osteogenesis | 1.4E-9 | 2 | |
| CC | Secreted | 2.4E-1 | 12 |
| MF | Heparin-binding | 6.2E-2 | 4 |
| Cytokine | 2.2E-1 | 4 | |
| Monooxygenase | 2.3E-1 | 3 | |
| Translocase | 2.3E-1 | 3 | |
| Serine protease | 3.4E-1 | 3 | |
| KEGG pathway disease terms | Folate transport and metabolism | 1.8E-1 | 3 |
| ABC transporters | 1.9E-1 | 3 | |
| Viral protein interaction with cytokine and cytokine receptor | 5.7E-1 | 3 | |
| Cytokine-cytokine receptor interaction | 6.1E-1 | 4 |
Along with the GO keywords, a total of four significant KEGG pathways (p-value < 0.05) were identified. The most notable pathways were cytokine-cytokine receptor interaction, viral protein interaction with cytokines and cytokine receptors, ABC transporters, and folate transport and metabolism.
3.3 Classification results through traditional machine learning models
The features identified through the Lasso model were given as data input to various traditional ML models to assess the performance of these genes when new samples were presented based on classification accuracy. We used five different classifiers RF, SVM, XGBoost, Naïve Bayes, and kNN. The optimum hyperparameters used in our models are: mtry = 1 for RF; SVM-Kernel = radial, cost = 0.50 and sigma = 0.03 with 135 support vectors; nrounds = 500, max_depth = 3, eta = 0.05, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample = 0.5 parameters for XGBoost; laplace = 0, use kernel = TRUE and adjust =1 for Naïve Bayes and k = 11 for kNN. The ML models developed were evaluated based on accuracy, precision, recall, and AUC. The results showed that Random Forest and XGBoost with the highest diagnostic accuracy (98.78%) followed by kNN (97.56%), SVM (96.56%), and Naïve Bayes (95.36%). Table 3 displays the validation (10-fold cross-validation on the training dataset) and the test results obtained from the use of these five algorithms.
| Classifier | Accuracy (%) | Sensitivity (%) | Specificity (%) | F1 Score | AUC (%) |
|---|---|---|---|---|---|
| RF | 98.78 | 97.50 | 100.0 | 98.73 | 99.90 |
| SVM | 96.56 | 97.60 | 97.50 | 97.62 | 97.56 |
| XGBoost | 98.78 | 97.50 | 100.0 | 98.73 | 98.87 |
| Naïve bayes | 95.36 | 97.50 | 97.62 | 97.50 | 99.76 |
| kNN | 97.56 | 97.50 | 100.0 | 98.69 | 98.87 |
| 1D-CNN | 98.81 | 98.79 | 97.67 | 100 | 99.21 |
| MLP | 97.62 | 97.67 | 97.67 | 97.56 | 99.32 |
The AUC evaluates the performance of a binary classification model. The receiver operating characteristic - area under curve (ROC-AUC) curve for all the models comes in the range of >90%, representing excellent models (Fig. 3). Each model extracts important features/genes and ranks them according to their importance, which is represented and shown graphically in Fig. 4. For SVM and k-NN, we used the recursive feature elimination (RFE) method to select important features. RFE is a wrapper-based feature selection method that begins by considering all features. It repeatedly fits the model, ranks the features according to their importance, and then prunes away the least significant ones. This process continues until a preset number of features remain or until model performance no longer improves. The RF model used Mean Decrease in Gini (Gini Importance) method, which measures how much a feature contributes to reducing impurity (Gini Index) across all decision trees in the RF model. The XG-Boost used Gain-based feature importance method that measures how much a feature improves the purity of splits in the decision trees. The Naïve bayes used mean differences in conditional probabilities method that assumes conditional independence of features where each feature contributes independently to classification based on how different its distributions (mean values) are across classes.

- Receiver operating characteristic (ROC) curves of five classifiers; SVM, XGBoost, Naïve Bayes, k-NN, and RF, evaluated using 10-fold cross-validation. The AUC values for all models exceed 0.97, indicating excellent discriminatory power. Random Forest achieved the highest AUC (0.999), followed closely by Naïve Bayes (0.988), XGBoost (0.989), k-NN (0.987), and SVM (0.978).

- Important features extracted from five different machine learning models. (a) The top important features are extracted via the random forest model using the ini importance method. (b) The most important features are extracted via an svm model using the recursive feature elimination method. (c) The most important features extracted via Naïve Bayes model using mean differences in conditional probabilities. (d) The top important features are extracted via XGBoost model using the gain-based feature importance method. (e) The top significant features extracted via kNN model using the recursive feature elimination method. (f) The top important features extracted via 1D-CNN using the absolute weights of the first convolutional layer. The importance score for each gene (feature) was computed as the sum of the absolute values of its weights across all convolutional filters. (g) The most significant genes extracted via MLP.
The RF model extracts 42 genes with the most important features. The SVM model extracts 15 genes with the most important features. The XGBoost model selected 42 genes with the most significant features. The Naïve Bayes model obtains 42 features as the most valuable features. Finally, the k-NN model acquires 33 genes as the most significant features.
Comparisons across five classifiers showed small pairwise differences and no statistically significant differences (all adjusted DeLong p≥0.99 and all adjusted McNemar p≥0.99) after Holm correction. The largest absolute AUROC difference was approximately 0.010, and the largest absolute accuracy and F1 differences were approximately 0.024. Point estimates suggested Naïve Bayes and random forest had slightly higher AUROCs than the SVM, and XGBoost and k-NN had slightly lower AUROCs than the SVM, but these differences were statistically insignificant (Table S2).
3.4 Classification results through deep learning models
We also employed two deep-learning neural network models (MLP and 1D-CNN). A regression-based prediction model called the MLP, along with a 1D-CNN, was put forth to classify gene expression data. The ID-CNN and MLP models, when trained on LASSO regression feature-extracted data, give an accuracy of 98.81% and 97.62% respectively (Fig. 5). 1D-CNN shows 98.79% F1 score, 97.67% Sensitivity, 100% Specificity, and 99.21% AUC. Similarly, the results of MLP show 97.67% F1 score, 97.67% Sensitivity, 97.56% Specificity, and 99.32% AUC (Table 3).

- Training and validation performance of (a) 1D-CNN and (b) MLP models. Both models show decreasing training loss and stable validation loss, indicating effective learning and convergence. Accuracy curves reveal strong generalization. The use of LASSO regularization promotes model sparsity and may enhance feature selection without compromising performance.
3.5 Validation of models
The diagnostic validity of the seven Models (5 traditional machine learning and 2 deep learning) constructed was validated using an external dataset retrieved from TCGA database. The TCGA-colorectal adenocarcinoma (COAD) dataset contains 481 Tumor samples and 43 Normal samples. The DESeq2 variance stabilizing transformation was applied to the raw counts to achieve normalization of the gene expression data. All the models achieved an accuracy of greater than 90% showing the model’s accuracy and reliability (Table S2).
3.6 Upset analysis to identify overlapping genes
We performed the UpSet analysis to visualize commonly regulated genes extracted by five different ML models and two deep learning models. The top features were selected from each model. The results from UpSet analysis revealed that seven genes, Carbonic Anhydrase 7 (CA7), ATP binding cassette subfamily A member 8 (ABCA8), somatostatin (SST), myomesin 1 (MYOM1), CC motif chemokine ligand 23 (CCL23), procollagen C-endopeptidase enhancer 2 (PCOLCE2), and CXC motif chemokine ligand 10 (CXCL10), as overlapping features present across all models (Fig. 6).

- UpSet plot illustrates the overlap of selected features among five classification algorithms. The plot shows the intersection sizes of features selected by different classifiers. Vertical bars represent the number of shared features across different combinations of algorithms, while the matrix below indicates which algorithms contribute to each intersection. The horizontal bars on the left indicate the total number of features selected by each algorithm.
3.7 Prognostic analysis of overlapping features
We compared the expression level of these hub genes in the COAD dataset using GEPIA2 database, comparing 275 tumor and 349 normal tissues. The results showed that the expression level of all essential genes was significantly downregulated for PCOLCE2, CA7, ABCA8, SST, MYOM1, and CCL23 (Figs. 7a-f) and significantly upregulated for CXCL10 (Fig. 7g)

- The box plot for hub genes expression level of PCOLCE2, CA7, ABCA8, SST, MYOM1, CCL23, and CXCL10 (a-g) in tumors indicated by red block and control samples showed by grey block. The red block shows the expression level of genes in tumor samples, which is significantly differentiated from normal samples. *P < 0.05. (h) Prognostic analysis of PCOLCE2 in colon adenocarcinoma using the GEPIA2 portal for overall survival. The red line denotes high gene expression samples, whereas the blue line signifies low gene expression samples. The research utilized the log-rank test to establish statistical significance while setting the threshold at P < 0. 0085. The X-axis shows time intervals while the Y-axis records patient survival percentages.

- The schematic framework of the proposed workflow.
We also perform survival analysis to analyze the role of significant features and the progression of colon adenocarcinoma. The key genes identified were analyzed for overall survival and disease-free survival to find parameters that were linked with prognostic relevance. The survival analysis showed that PCOLCE2 demonstrated a significant association with disease-free survival, with a log-rank p-value of 0.0085 (Fig. 7h). This suggests that PCOLCE2 expression levels are significantly correlated with disease recurrence in prostate cancer patients. The hazard ratio (HR) of 1.9 indicates that patients with high PCOLCE2 expression have nearly double the risk of disease progression compared to those with low expression. The survival curve for the high-expression group remains consistently below that of the low-expression group, further supporting the observation that elevated PCOLCE2 expression is associated with worse prognosis and a higher likelihood of disease recurrence.
4. Discussion
CRC continues to be the top cause of cancer-related deaths around the globe. It shows extensive variability among cases while maintaining its status as a highly malignant disease, leading to a high number of deaths. (Bray et al., 2018). Early detection of CRC continues to be a significant clinical obstacle despite significant improvements in diagnostic platforms. Every subtype of CRC develops through distinct genetic and molecular alterations, which confirms the importance of identifying critical genes and pathways involved in its development. Such molecular insights are crucial for accurate patient classification and the choice of successful treatment approaches. Through the identification of target genes linked to colorectal cancer, machine learning and deep learning techniques have demonstrated promise in the discovery of novel biomarkers. However, the selection of optimal biomarkers typically depends on their predictive accuracy and stability across datasets.(Kourou et al., 2015). Kopad et al. identified 34 genes using Machine learning models with high accuracy that can be used as a diagnostic panel for CRC (Koppad et al., 2022). Similarly, another recent study used various ML models to find a collection of possible biomarkers for CRC metastasis (Ahmadieh-Yazdi et al., 2023).
Most of the previous studies use individual datasets and apply the ML models separately on them for biomarker screening (Hossain et al., 2021, Koppad et al., 2022, Vaziri-Moghadam and Foroughmand-Araabi 2024). Although this method helps to identify biomarkers by using different bioinformatic analyses, combining gene expression levels from different platforms to identify CRC biomarkers is a unique approach. Our study thus provides a unique approach by integrating 5 Microarray datasets and 2 RNA-seq datasets (Galvez et al., 2020). This enhances the robustness and statistical reliability of our findings, in addition to giving us the benefit of an increasing number of samples that will be fed into our machine learning algorithms. The integration of data was initiated by eliminating duplicate data and removing anomalous values. The gene expression matrix was developed through the combination of data summarization with normalization, and background correction processes. Gene expression integration depends on logarithmic transformation alongside 16-bit depth homogenization and full case selection between batches prior to applying batch effect correction and inter-array normalization.
We used the Lasso-regularized regression model for feature extraction because it performs better in gene selection and extracts the most important variables. LASSO offered a sparse solution for both variable selection and shrinkage issues, with the ability to handle multicollinearity (Ranstam and Cook 2018). We identified 38 genes from our integrated dataset to perform classification tasks using the least absolute shrinkage and selection operator (LASSO) model and listed provided in Table S1. It selects only essential features and discards the non-essential ones. The proteins obtained through this process show enrichment in the extracellular region. The proteins function predominantly through Heparin-binding activities along with cytokine functions within extracellular matrix environments. These proteins are involved in biological activities such as chemotaxis, as well as transporting diverse types of molecules like metabolites, lipids, and proteins. The proteins show the highest level of enrichment in biological pathways that regulate the movement of various molecules, as well as in the interactions between cytokines and their receptors. It has been observed that the metastasis of cancer is linked to changes in the expression levels and functions of transport proteins. These changes, which affect crucial cellular processes such as drug resistance, migration, apoptosis, and proliferation, contribute to the spread of cancer cells. Therefore, understanding these modifications is crucial in developing effective cancer treatments (Huang et al., 2004).
The role of ANO7 and SLC38A4 gene expression in the development of CRC and its metastatic process has been well documented. The expression of these genes was significantly increased when compared with healthy tissues (Khosroshahi et al., 2024). The cytokine-cytokine network interaction controls the immune response mechanism, and the tumorigenesis process is highly regulated by the pathway associated with cytokines and their receptors (Uddin et al., 2019). This has been confirmed by a recent study where the elevated levels of interleukin-6 (IL-6) together with its receptor (IL-6R) in colon cancer tissues demonstrate a link to disease advancement, along with poor patient outcomes. (Li et al., 2018). A potential significant role of transforming growth factor-beta (TGF-β), tumor necrosis factor (TNF), and interleukin-10 (IL-10) proteins was seen in CRC development and its progression to other parts of the body (Mirlekar 2022).
The features extracted through Lasso regularized regression were provided as input data to various classification algorithms. The machine learning classifiers demonstrated different performance levels, with Random Forest achieving the highest accuracy, while Deep Learning Neural Networks achieved their highest accuracy with 1D-CNN. Given comparable performance statistics across all ML models, we highlight the compact biomarker panel and select the classifier based mostly on interpretability and calibration. Six genes, including CA7, ABCA8, SST, MYOM1, and CCL23, PCOLCE2, and CXCL10, appeared as common top features throughout every model. Survival analysis showed PCOLCE2 with a significant association with disease-free survival, and its high expression is linked to a worse prognosis and a higher chance of the disease recurring. PCOLCE2, which exists in the extracellular matrix, serves as an important predictive biomarker for various malignancies. Although PCOLCE2’s precise mode of action in CRC is still unknown, it is a crucial gene in the development of thyroid (Luo et al., 2022), prostate (Tan et al., 2023), colorectal (Chen et al., 2019), and head and neck squamous cell carcinoma (Tian et al., 2018). PCOLCE2 overexpression is linked to the development of metastases in ovarian tissue following chemotherapy (Pietilä et al., 2021). Previously, it has been linked with gastric cancer progression and poor survival outcomes in patients (Zhang et al., 2023). Carbonic anhydrase 7 (CA7) gene is expressed in colon tissues, primarily involved in acid-base homeostasis process (Bootorabi et al., 2010). A high expression of CA7 increases the risk of rectal cancer, along with rectal adenocarcinoma and colorectal cancer (Yang et al., 2015; Hua et al., 2017; Zhang et al., 2020). The ABCA8 protein is a human ATP-binding cassette (ABC) transporter with a length of 1,621 amino acids (Albrecht and Viturro 2007). The decrease in protein expression in CRC tissues is directly linked to tumor stage, lymph node involvement, and tumor cell differentiation status (Yang et al., 2024). The low expression of ABCA8 protein is also observed in various other types of cancer, including tongue squamous cell carcinoma, ovarian cancer, prostate cancer, and hepatocellular carcinoma (Ye et al., 2008; Demidenko et al., 2015; Liu et al., 2015; Cui et al., 2020). The Somatostatin (SST) protein is essential for maintaining the dormant state of cancer stem cells (CSCs). Several studies have demonstrated the downregulation of SST expression in CRC tissues compared to control samples (Swatek and Chibowski 2000, Geltz et al., 2024). A combination of high glucagon along with gastrin and low levels of SST protein leads to a poor prognosis of CRC (Sereti et al., 2002). CCL23 or macrophage inflammatory protein-3 (MIP-3) is primarily responsible for the secretion of eosinophils, neutrophils, and monocytes (Poposki et al., 2011). A recent study illustrated a high level of IFNG, CXCL9, and CCL23 proteins in the plasma of CRC and rectal cancer patients (Miyoshi et al., 2014; Urbiola-Salvador et al., 2023). Interferon-induced protein CXC chemokine ligand 10 (CXCL10) functions to inhibit the growth of tumor cells and their metastasis in various types of cancer. However, a higher CXCL10 expression leads to poor survival and high recurrence rates. In patients with CRC, specific clinicopathological features were found to be correlated with significantly low levels of CXCL10 in their plasma samples (Galamb et al., 2008; Jiang et al., 2010; Hamilton et al., 2014; Bai et al., 2016). The increased CXCL10 levels observed in the tumor context are most reasonably interpreted as reflecting an interferon-stimulated inflamed microenvironment; it does not necessarily indicate effective anti-tumor immunity (e.g., due to immune evasion, checkpoint signaling, or insufficient T-cell infiltration).
This approach to gene expression integration has identified a panel of seven genes that play a key role in promoting the growth and progression of CRC. These genes can be used as biomarkers for the classification and diagnosis of new, uncharacterized samples. If data from CRC cell lines is available, they can also serve as a unique biomarker signature.
5. Conclusions
This study aims to identify potential biomarkers for CRC by combining data from RNA-Seq and microarray platforms. The NCBI-GEO database was extensively searched to gather CRC data from both platforms. LASSO was used for feature selection to identify the most informative biomarkers while excluding the non-informative features. The chosen genes were used to train two deep learning models (MLP, 1D-CNN) and five machine learning classifiers (SVM, RF, k-NN, Naive Bayes, XGBoost). An external dataset from the TCGA database was used to validate these models, and they demonstrated remarkable accuracy, particularly with Random Forest (98.78%) and 1D-CNN (98.81%). The study confirmed the potential of seven key genes that had previously been connected to colorectal cancer (CRC) as accurate biomarkers for the detection of the disease.
Acknowledgment
The authors would like to acknowledge the support provided by King Fahd University of Petroleum & Minerals (KFUPM) and the IRC for Finance & Digital Economy for funding this work through project No INFE2502.
CRediT authorship contribution statement
Saleh Alwahaishi: Data collection, formal analysis and validation, supervision, writing, reviewing and editing. Haseeb Nisar: Investigation, formal analysis, validation, formal analysis, and writing- review.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.
Declaration of generative AI and AI-assisted technologies in the writing process
The authors confirm that there was no use of artificial intelligence (AI)-assisted technology for assisting in the writing or editing of the manuscript, and no images were manipulated using AI.
Supplementary data
Supplementary material to this article can be found online at https://dx.doi.org/10.25259/JKSUS_1217_2025.
References
- Using a machine learning approach for screening metastatic biomarkers in colorectal cancer and predictive modeling with experimental validation. Sci Rep. 2023;13:19426. https://doi.org/10.1038/s41598-023-46633-8
- [Google Scholar]
- A systems biology and LASSO-based approach to decipher the transcriptome–interactome signature for predicting non-small cell lung cancer. Biology. 2022;11:1752. https://doi.org/10.3390/biology11121752
- [Google Scholar]
- The ABCA subfamily—gene and protein structures, functions and associated hereditary diseases. Pflugers Arch - Eur J Physiol. 2007;453:581-589. https://doi.org/10.1007/s00424-006-0047-8
- [Google Scholar]
- Epidemiology of colorectal cancer in Saudi Arabia: A review. Cureus. 2024;16:e64564. https://doi.org/10.7759/cureus.64564
- [Google Scholar]
- kerasR: R interface to the Keras deep learning library. JOSS. 2017;2:296. https://doi.org/10.21105/joss.00296
- [Google Scholar]
- CXCL10/CXCR3 overexpression as a biomarker of poor prognosis in patients with stage II colorectal cancer. Mol Clin Oncol. 2016;4:23-30. https://doi.org/10.3892/mco.2015.665
- [Google Scholar]
- Inflammatory bowel disease biomarkers of human gut microbiota selected via different feature selection methods. PeerJ. 2022;10:e13205. https://doi.org/10.7717/peerj.13205
- [Google Scholar]
- A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185-193. https://doi.org/10.1093/bioinformatics/19.2.185
- [Google Scholar]
- Analysis of a shortened form of human carbonic anhydrase VII expressed in vitro compared to the full-length enzyme. Biochimie. 2010;92:1072-1080. https://doi.org/10.1016/j.biochi.2010.05.008
- [Google Scholar]
- Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS One. 2011;6:e17820. https://doi.org/10.1371/journal.pone.0017820
- [Google Scholar]
- Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA. Cancer J Clin. 2018;68:394-424. https://doi.org/10.3322/caac.21492
- [Google Scholar]
- Package ‘randomforest’. University of California, Berkeley: Berkeley, CA, USA. 2018;81:1-29. https://www.stat.berkeley.edu/∼breiman/RandomForests/
- [Google Scholar]
- A radiomics study to predict invasive pulmonary adenocarcinoma appearing as pure ground-glass nodules. Clin Radiol. 2021;76:143-151. https://doi.org/10.1016/j.crad.2020.10.005
- [Google Scholar]
- A framework for oligonucleotide microarray preprocessing. Bioinformatics. 2010;26(19):2363-2367.
- [Google Scholar]
- Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling. BMC Bioinformatics. 2017;18:506. https://doi.org/10.1186/s12859-017-1925-0
- [Google Scholar]
- Identification of biomarkers associated with diagnosis and prognosis of colorectal cancer patients based on integrated bioinformatics analysis. Gene. 2019;692:119-125. https://doi.org/10.1016/j.gene.2019.01.001
- [Google Scholar]
- Xgboost: A scalable tree boosting system. KDD ‘16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016:785-794. https://doi.org/10.1145/2939672.2939785
- [Google Scholar]
- Artificial neural network analysis-based immune-related signatures of primary non-response to infliximab in patients with ulcerative colitis. Front Immunol. 2021;12:742080. https://doi.org/10.3389/fimmu.2021.742080
- [Google Scholar]
- TCGAbiolinks: An r/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res.. 2016;44:e71. https://doi.org/10.1093/nar/gkv1507
- [Google Scholar]
- ABCA8 is regulated by miR-374b-5p and inhibits proliferation and metastasis of hepatocellular carcinoma through the ERK/ZEB1 pathway. J Exp Clin Cancer Res. 2020;39:90. https://doi.org/10.1186/s13046-020-01591-1
- [Google Scholar]
- GEOquery: A bridge between the gene expression omnibus (GEO) and BioConductor. Bioinformatics. 2007;23:1846-1847. https://doi.org/10.1093/bioinformatics/btm254
- [Google Scholar]
- Frequent down-regulation of ABC transporter genes in prostate cancer. BMC Cancer. 2015;15:683. https://doi.org/10.1186/s12885-015-1689-8
- [Google Scholar]
- Developers, T., 2022. TensorFlow. Zenodo.
- lumi: A pipeline for processing Illumina microarray. Bioinformatics. 2008;24:1547-1548. https://doi.org/10.1093/bioinformatics/btn224
- [Google Scholar]
- Mapping identifiers for the integration of genomic datasets with the r/Bioconductor package biomaRt. Nat Protoc. 2009;4:1184-1191. https://doi.org/10.1038/nprot.2009.97
- [Google Scholar]
- Discriminative feature of cells characterizes cell populations of interest by a small subset of genes. PLoS Comput Biol. 2021;17:e1009579. https://doi.org/10.1371/journal.pcbi.1009579
- [Google Scholar]
- Diagnostic mRNA expression patterns of inflamed, benign, and malignant colorectal biopsy specimen and their correlation with peripheral blood results. Cancer Epidemiol Biomarkers Prev. 2008;17:2835-2845. https://doi.org/10.1158/1055-9965.EPI-08-0231
- [Google Scholar]
- Towards improving skin cancer diagnosis by integrating microarray and RNA-Seq datasets. IEEE J Biomed Health Inform. 2020;24:2119-2130. https://doi.org/10.1109/JBHI.2019.2953978
- [Google Scholar]
- Towards improving skin cancer diagnosis by integrating microarray and RNA-Seq datasets. IEEE J Biomed Health Inform. 2020;24:2119-2130. https://doi.org/10.1109/JBHI.2019.2953978
- [Google Scholar]
- affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics.. 2004;20:307-315. https://doi.org/10.1093/bioinformatics/btg405
- [Google Scholar]
- Differentially expressed somatostatin (SST) and Its receptors (SST1-5) in sporadic colorectal cancer and normal colorectal mucosa. Cancers (Basel). 2024;16:3584. https://doi.org/10.3390/cancers16213584
- [Google Scholar]
- Gentleman, R., B. P. Maintainer, D. B. I. Imports Biobase, et al., 2013. Package ‘annotate’.
- Gohlmann, H., Talloen, W., 2009. Gene expression studies using Affymetrix microarrays, Chapman and Hall/CRC.
- KNN model-based approach in classification. In: Lecture notes in computer science, On the move to meaningful internet systems 2003: CoopIS, DOA, and ODBASE Lecture notes in computer science, On the move to meaningful internet systems 2003: CoopIS, DOA, and ODBASE. Berlin, Heidelberg: Springer Berlin Heidelberg; p. :986-996. https://doi.org/10.1007/978-3-540-39964-3_62
- [Google Scholar]
- Identification of prognostic inflammatory factors in colorectal liver metastases. BMC Cancer. 2014;14:542. https://doi.org/10.1186/1471-2407-14-542
- [Google Scholar]
- Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012;13:204-216. https://doi.org/10.1093/biostatistics/kxr054
- [Google Scholar]
- Machine learning and network-based models to identify genetic risk factors to the progression and survival of colorectal cancer. Comput Biol Med. 2021;135:104539. https://doi.org/10.1016/j.compbiomed.2021.104539
- [Google Scholar]
- Ensembl 2021. Nucleic Acids Res. 2021;49:D884-D891. https://doi.org/10.1093/nar/gkaa942
- [Google Scholar]
- Abnormal expression of mRNA, microRNA alteration and aberrant DNA methylation patterns in rectal adenocarcinoma. PLoS One. 2017;12:e0174461. https://doi.org/10.1371/journal.pone.0174461
- [Google Scholar]
- Membrane transporters and channels: Role of the transportome in cancer chemosensitivity and chemoresistance. Cancer Res. 2004;64:4294-4301. https://doi.org/10.1158/0008-5472.CAN-03-3884
- [Google Scholar]
- Inc, I. a., Illumina: Illumina Gene Expression arrays.
- CXCL10 expression and prognostic significance in stage II and III colorectal cancer. Mol Biol Rep. 2010;37:3029-3036. https://doi.org/10.1007/s11033-009-9873-z
- [Google Scholar]
- Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118-127. https://doi.org/10.1093/biostatistics/kxj037
- [Google Scholar]
- Deep learning-based, multiclass approach to cancer classification on liquid biopsy data. IEEE J Transl Eng Health Med. 2024;12:306-313. https://doi.org/10.1109/jtehm.2024.3360865
- [Google Scholar]
- KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27-30. https://doi.org/10.1093/nar/28.1.27
- [Google Scholar]
- A new variance stabilizing transformation for gene expression data analysis. Stat Appl Genet Mol Biol. 2013;12:653-666. https://doi.org/10.1515/sagmb-2012-0030
- [Google Scholar]
- Determining expression changes of ANO7 and SLC38A4 membrane transporters in colorectal cancer. Heliyon. 2024;10:e34464. https://doi.org/10.1016/j.heliyon.2024.e34464
- [Google Scholar]
- Machine learning-based identification of colon cancer candidate diagnostics genes. Biology (Basel). 2022;11:365. https://doi.org/10.3390/biology11030365
- [Google Scholar]
- Machine learning-based identification of colon cancer candidate diagnostics genes. Biology (Basel). 2022;11:365. https://doi.org/10.3390/biology11030365
- [Google Scholar]
- Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2014;13:8-17. https://doi.org/10.1016/j.csbj.2014.11.005
- [Google Scholar]
- Building Predictive Models in R Using the caret Package. Journal of Statistical Software. 2008;28(5):1-26. https://doi.org/10.18637/jss.v028.i05.
- [Google Scholar]
- Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11:733-739. https://doi.org/10.1038/nrg2825
- [Google Scholar]
- UpSet: Visualization of intersecting sets. IEEE Trans Vis Comput Graph. 2014;20:1983-1992. https://doi.org/10.1109/TVCG.2014.2346248
- [Google Scholar]
- Targeting interleukin-6 (IL-6) sensitizes Anti-PD-L1 treatment in a colorectal cancer preclinical model. Med Sci Monit. 2018;24:5501-5508. https://doi.org/10.12659/MSM.907439
- [Google Scholar]
- Discovery of microarray-identified genes associated with ovarian cancer progression. Int J Oncol. 2015;46:2467-2478. https://doi.org/10.3892/ijo.2015.2971
- [Google Scholar]
- Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. https://doi.org/10.1186/s13059-014-0550-8
- [Google Scholar]
- Identification of a four-gene signature for determining the prognosis of papillary thyroid carcinoma by integrated bioinformatics analysis. Int J Gen Med. 2022;15:1147-1160. https://doi.org/10.2147/IJGM.S346058
- [Google Scholar]
- Tumor promoting roles of IL-10, TGF-β, IL-4, and IL-35: Its implications in cancer immunotherapy. SAGE Open Med. 2022;10:20503121211069012. https://doi.org/10.1177/20503121211069012
- [Google Scholar]
- Expression profiles of 507 proteins from a biotin label-based antibody array in human colorectal cancer. Oncol Rep. 2014;31:1277-1281. https://doi.org/10.3892/or.2013.2935
- [Google Scholar]
- Co-evolution of matrisome and adaptive adhesion dynamics drives ovarian cancer chemoresistance. Nat Commun. 2021;12:3904. https://doi.org/10.1038/s41467-021-24009-8
- [Google Scholar]
- Multilayer perceptron and neural networks. WSEAS Trans on Circuits Syst. 2009;8:579-588.
- [Google Scholar]
- Increased expression of the chemokine CCL23 in eosinophilic chronic rhinosinusitis with nasal polyps. J Allergy Clin Immunol. 2011;128:73-81. https://doi.org/10.1016/j.jaci.2011.03.017
- [Google Scholar]
- Random forest. J Insur Med. 2017;47:31-39. https://doi.org/10.17849/insm-47-01-31-39.1
- [Google Scholar]
- Robin, X., N. Turck, A. Hainard, et al., 2021. Package ‘pROC’. Package “pROC.
- Molecular biomarkers in cancer. Biomolecules. 2022;12:1021. https://doi.org/10.3390/biom12081021
- [Google Scholar]
- Immunoelectron study of somatostatin, gastrin and glucagon in human colorectal adenocarcinomas and liver metastases. Anticancer Res. 2002;22:2117-2123.
- [Google Scholar]
- RNA-Seq vs dual- and single-channel microarray data: Sensitivity analysis for differential expression and clustering. PLoS One. 2012;7:e50986. https://doi.org/10.1371/journal.pone.0050986
- [Google Scholar]
- The BioMart community portal: An innovative alternative to large, centralized data repositories. Nucleic Acids Res.. 2015;43:W589-W598. https://doi.org/10.1093/nar/gkv350
- [Google Scholar]
- Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA. Cancer J Clin. 2021;71:209-249. https://doi.org/10.3322/caac.21660
- [Google Scholar]
- Support vector machine. Integrated series in information systems, Machine learning models and algorithms for big data classification Integrated series in information systems, Machine learning models and algorithms for big data classification :207-235. https://doi.org/10.1007/978-1-4899-7641-3_9
- [Google Scholar]
- Endocrine cells in colorectal carcinomas Immunohistochemical study. Pol J Pathol.. 2000;51:127-136.
- [Google Scholar]
- Comprehensive analysis of scRNA-Seq and bulk RNA-Seq reveals dynamic changes in the tumor immune microenvironment of bladder cancer and establishes a prognostic model. J Transl Med. 2023;21:223. https://doi.org/10.1186/s12967-023-04056-z
- [Google Scholar]
- GEPIA2: An enhanced web server for large-scale expression profiling and interactive analysis. Nucleic Acids Res. 2019;47:W556-W560. https://doi.org/10.1093/nar/gkz430
- [Google Scholar]
- Data quality aware analysis of differential expression in RNA-seq with NOISeq r/Bioc package. Nucleic Acids Res. 2015;43:e140. https://doi.org/10.1093/nar/gkv711
- [Google Scholar]
- A six-mRNA prognostic model to predict survival in head and neck squamous cell carcinoma. Cancer Manag Res. 2018;11:131-142. https://doi.org/10.2147/CMAR.S185875
- [Google Scholar]
- Regression Shrinkage and Selection Via the Lasso. J Royal Stat Soc Ser B: Stat Methodology. 1996;58:267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
- [Google Scholar]
- Random forest-based modelling to detect biomarkers for prostate cancer progression. Clin Epigenetics. 2019;11:148. https://doi.org/10.1186/s13148-019-0736-8
- [Google Scholar]
- Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Med. 2021;13:152. https://doi.org/10.1186/s13073-021-00968-x
- [Google Scholar]
- Identification of transcriptional markers and microRNA–mRNA regulatory networks in colon cancer by integrative analysis of mRNA and microRNA expression profiles in colon tumor stroma. Cells. 2019;8:1054. https://doi.org/10.3390/cells8091054
- [Google Scholar]
- Plasma protein changes reflect colorectal cancer development and associated inflammation. Front Oncol. 2023;13:1158261. https://doi.org/10.3389/fonc.2023.1158261
- [Google Scholar]
- Integrating machine learning and bioinformatics approaches for identifying novel diagnostic gene biomarkers in colorectal cancer. Sci Rep. 2024;14:24786. https://doi.org/10.1038/s41598-024-75438-6
- [Google Scholar]
- Naïve Bayes. In: Encyclopedia of machine learning Encyclopedia of machine learning. Boston, MA: Springer US; p. :713-714. https://doi.org/10.1007/978-0-387-30164-8_576
- [Google Scholar]
- Prognostic value of carbonic anhydrase VII expression in colorectal carcinoma. BMC Cancer. 2015;15:209. https://doi.org/10.1186/s12885-015-1216-y
- [Google Scholar]
- Tumour suppressor ABCA8 inhibits malignant progression of colorectal cancer via Wnt/β-catenin pathway. Dig. Liver Dis. 2024;56:880-893. https://doi.org/10.1016/j.dld.2023.10.026
- [Google Scholar]
- Transcriptomic dissection of tongue squamous cell carcinoma. BMC Genomics. 2008;9:69. https://doi.org/10.1186/1471-2164-9-69
- [Google Scholar]
- Five EMT‐related genes signature predicts overall survival and immune environment in microsatellite instability‐high gastric cancer. Cancer Med. 2023;12:2075-2088. https://doi.org/10.1002/cam4.4975
- [Google Scholar]
- Metabolic reprogramming‐associated genes predict overall survival for rectal cancer. J Cellular Molecular Medi. 2020;24:5842-5849. https://doi.org/10.1111/jcmm.15254
- [Google Scholar]
