Translate this page into:
Adversarial validation of causal identifiability assumptions in model-based inference of diseases from multi-level omics and clinical data
* Corresponding author: E-mail address: shafiul.haque@hotmail.com (S Haque)
-
Received: ,
Accepted: ,
Abstract
Model-based causal inference techniques are being used to infer disease mechanisms from heterogeneous biomedical data. However, the identifiability assumptions made in such models remain empirically unvalidated. We first formulated disease mechanism models as causal Bayesian networks (BNs) learned by integrating genomics, epigenomics (DNA methylation), transcriptomics (RNA-seq), proteomics Sequential Window Acquisition of All Theoretical Fragment ion Mass Spectra (SWATH-MS), metabolomics (gas chromatography-mass spectrometry (GC-MS)), and clinical phenotype data using constraint-based and score-based structure learning. Next, we developed an adversarial generator network trained to introduce latent confounders or violate causal sufficiency. The original and perturbed datasets were passed through the disease models to quantify changes in the inferred causal structure. We applied our approach to breast cancer models inferred from The Cancer Genome Atlas (TCGA) database. The adversarial network effectively generated perturbed data capturing latent confounding between genetic mutations and protein expression and feedback loops violating causal sufficiency. Retraining causal models on perturbed data altered causal structures, endorsing the limitations of the original identifiability assumptions. Our framework, therefore, appears to be a suitable tool for empirically evaluating the robustness of disease mechanism models when the core causal assumptions are violated. The adversarial validation framework is effective in evaluating the robustness of causal assumptions for identifying model-based disease mechanisms inference.
Keywords
Bayesian networks
Causal identifiability
Causal inference
Latent confounding
Multi-omics data
1. Introduction
Multi-level omics data at the genetic, epigenetic, transcriptomic, and proteomic levels in combination with clinical information is essential for identifying complex causal relationships between the molecular determinants of pathological conditions. This integration holds tremendous promise because it can aid in the diagnosis, prediction, and personalized management of illnesses with a level of accuracy that has never been seen before. The intricately complex schemes of causal interactions that drive and regulate biological systems are, however, significant challenges. Computational model-based inferences may provide a systematic way to explore such complex networks of causes and effects that drive disease pathogenesis (Huynh et al., 2021). However, the precision, accuracy, and reliability of such models are critically dependent on the validity of a set of essential assumptions. Indeed, causal disease models constructed from multi-omics and clinical data often have identifiability assumptions of no hidden confounding, set temporal ordering, and largely linear relationships. These assumptions have, however, rarely been empirically stress-tested on real disease datasets. In the absence of proper validation, causal disease models may result in compromised inferences. This, in turn, may lead to inaccuracies in the downstream analyses of discovering biomarkers and developing interventions.
As indicated, the accuracy of the inferences made for the causal linkages with regard to a specific pathology depends on the identifiability assumptions. These assumptions are varied in nature and range from the lack of unobserved confounding factors to the strict order of events and their linear relationships (Du & Zhao, 2023). Such assumptions are followed in most popular statistical and machine learning models, such as structural equation models (SEMs), Bayesian networks (BNs), and causal inference algorithms like the do-calculus. The issue is that the complexity of biological information often contradicts assumptions of standard statistical models (Duan et al., 2024). Absence of explicit evaluation of the causality assumptions in real multi-omics and clinical datasets may lead to misleading conclusions regarding causal links.
Therefore, there is a need for rigorous validation frameworks that can confirm the validity of the underlying causal assumptions. In this study, we propose an adversarial validation framework that can introduce perturbations in data to violate specific causality and identifiability assumptions. This strategy may allow quantification of the extent of changes in the causal networks, thereby ensuring robustness of the disease mechanism inference. Building computational models with the ability to capture associations between the multi-level biological factors and clinical variables is an obvious first step. Subsequently, a new machine learning method based on adversarial validation is introduced, which can check the assumptions of causal identifiability of multi-level omics and clinical data-based disease models. Our adversarial model is based on identifying inconsistencies between a training and a validation dataset. For our purposes, multi-level omics and clinical data served as the training dataset, while the validation dataset consisted of artificial data created by varying the causal conditions. The ability of the adversarial model to differentiate between the training and validation datasets allowed evaluation of the validity of the causal identifiability assumptions in real datasets. It also aided in identifying stability parameters and any confounding or bias in the inferred disease mechanism networks (Gao et al., 2023).
The outline followed for this study was as follows: first, we delineate the model-based scanning electron microscopy (SEM), BNs, and other causal inference approaches used for the integration of multi-level omics and clinical data, highlighting the challenges in making causal identifiability assumptions when working with such data. The associated risks involved with inappropriate assumptions and their effects on deducing reliable causal inferences were also explored. Next, we detail the construction and training of an adversarial validation model designed to test causal hypotheses using synthetic data under varying causal situations as validation datasets. Regarding the specificity of the proposed adversarial validation framework (Fig. 1), this study primarily demonstrated the use of cancer-focused datasets, including breast cancer models derived from The Cancer Genome Atlas (TCGA) and lung cancer (A549) cell line experiments for wet-lab validation. Our workflow used widely adopted software packages to ensure reproducibility: GATK and snpEff for genomic variant calling/annotation, Minfi for methylation analysis, STAR and Salmon for transcriptomic alignment and quantification, Proteome Discoverer with SEQUEST-HT for proteomics, and XCMS with METLIN for metabolomics feature detection and identification. Cancer was selected because of its well-characterized, data-rich multi-omics cohorts, enabling a robust demonstration of the approach.

- Overall workflow for adversarial validation of causal models. Multi-omics pipeline integrating data, causal modeling, robustness testing, wet-lab validations, and refinement for iterative biological discovery.
It should be noted here that the adversarial validation pipeline is not cancer-specific; rather, it is a generalizable framework that can be applied to other chronic and complex diseases whenever sufficiently profiled multi-omics and clinical data are available. Indeed, adversarial validation analyses can aid in better alignments of multi-omics and clinical variables with the causal identifiability requirements, increasing the reliability of inferred causal networks (Garcia-Dominguez et al., 2023). In the long run, our efforts may help us learn more about complicated illnesses, find new treatment targets more quickly, and make strides toward data-driven, individualized healthcare (Huang et al., 2022). However, in spite of the implications of the adversarial validation technique for future research directions in precision medicine, significant challenges remain, which are briefly discussed.
2. Materials and Methods
2.1 Data collection and preprocessing
We utilized multi-omics and clinical data from three public sources: TCGA (Tomczak et al., 2015), the Genotype-Tissue Expression Project consortium (Keen & Moore, 2015), and the Shifa Hospital Biobank. Data was preprocessed to clean measurements, technical biases were normalized, missing values were imputed, and omics features were streamlined. Samples with over 10% missing values were filtered out, and variables with greater than 50% zeros were removed to alleviate sparsity issues. K-nearest neighbor imputation and correlation pruning were used to impute missing data points without overfitting (Li & Parker, 2014). The processed datasets were integrated into a single normalized matrix, ensuring informative multi-omics datasets for robust causal model inference and interpretation. This study restricted literature-derived benchmark statistics to studies published in 2022–2023 to (i) ensure comparable assay generations and annotation standards, (ii) use datasets fully curated by our analysis cutoff, and (iii) avoid partially processed 2024–2025 releases that lacked harmonized metadata at the time of model training. This improves comparability without limiting the generality of the framework.
2.2 Model-based inference methods
SEM was used to infer causal pathways by combining multi-omics and phenotypic data. SEM is a statistical framework that expresses interactions between variables, with path diagrams representing hypothetical causal linkages. BNs were used as a probabilistic extension of SEMs, learning from constraint-based inductive causal structure, fast causal inference (FCI), and PC algorithms (Chobtham & Constantinou, 2020). Both SEMs and FCI-BNs assessed causal effects and sought adjustment sets to reduce confounding. Multi-model graph learning framework LASSO (MMGL) (Ektefaie et al., 2023) was used to combine genomic, epigenomic, transcriptomic, and clinical domains into a single causal inference framework. For the estimation of causal pathways specific to the layers, features of each omics layer were summed up at the gene, protein, and metabolite levels. Before the model merger, the data from gene expression (RNA-seq) and protein abundance Sequential Window Acquisition of All Theoretical Fragment ion Mass Spectra (SWATH-MS) were standardized to Z-scores, while metabolite concentrations (gas chromatography-mass spectrometry (GC-MS)) were log-transformed and mean-centered.
2.3 Causal identifiability assumptions
Unexplored confounding factors and assumptions of a deterministic cause-effect temporal sequence and linear causal connections are the major factors in distorting the inferences of disease mechanism models. To address this, we systematically tested the conditional independence of phenotypes, conducted partial correlation tests and sensitivity analyses, ranked the variables based on their importance, and estimated causal effects using both linear and non-linear structural equation specifications. Any assumptions contradictory to qualitative molecular domain knowledge were marked as failing to detect real causal links reliably. Exploratory experiments were also used to simulate real breaches of causal identifiability. This was done to see what might happen to model conclusions and how much they depend on assumptions that may not always be true.
2.4 Adversarial validation framework
An adversarial validation framework to test the resilience of causal models to false assumptions was developed. The framework used an intervention engine to digitally shake a system’s joint distribution of variables, attacking the underlying structural assumptions. Three types of assaults were fabricated: confounding attacks, temporal attacks, and structural attacks. The intensity of the assaults could be adjusted from mild to severe. One hundred unique datasets for each assault were produced, and the causal network and model parameters were relearned using these datasets. To ensure omics-layer parity during adversarial perturbations, individual datasets were attacked independently: gene expression matrices, protein abundance tables, and metabolite profiles were each permuted within matched sample identifiers. Techniques such as SEM, FCI, PC, and MMGL were used in the study. The structural hamming distance on adjacency matrices was used to measure consistency between the original and attacked models. Biological integrity and robustness of attacked graphs were investigated to determine if reported inconsistencies are due to enforced breaches.
2.5 Experimental design
We undertook preliminary experimental interventions to verify the most important results from our in silico adversarial validations. In our multi-omics models of lung cancer, we zeroed in on genetic and epigenetic alterations that may be seen as violating causal principles. We chose genes predicted to act as hidden confounders in order to launch confounding assaults on model systems. Three potential confounder genes in A549 lung cancer cells had their expression altered by CRISPR activation or inhibition. RNA-seq analysis compared the impacts on the transcriptome to those of unaltered samples. Experimental testing of sequential vs. simultaneous therapy for temporal assumption breaches, the Assay-for-Transposase-Accessible Chromatin using sequencing (ATAC-seq) looked at changes in the epigenome of A549 cells that were treated with the demethylating drug 5-aza-cytidine alone or with the histone modifier trichostatin A (Guidi & Fava, 2021). To test non-linearity, exposure to increasingly high levels of the environmental carcinogen benzo(a)pyrene (BaP) was evaluated (Ramesh & Knuckles, 2006). Methyl-capture sequencing of the whole genome was used to find genes involved in BaP metabolism that showed changes that depended on concentration. Finally, we used targeted metabolomics to measure alterations in 53 onco-metabolites, allowing us to provide a more global assessment of the effects. This multi-omics profile investigated the downstream molecular impacts of experimental causal assumption breaches. Fold-change correlations between in silico-predicted and experimentally-observed expression changes were used for proteomic and metabolomic validations. The level of agreement between these two measures was evaluated using Pearson’s correlation coefficient in conjunction with Bland-Altman bias metrics, with any disagreement indicating potential unexplored features. In parallel, quantitative proteomic profiling of performed A549 cell line data using SWATH-MS allowed assessment of concordant changes in protein abundances following gene perturbations. These differential protein expression patterns provided direct validation for computationally inferred proteomic nodes. Conclusively, in silico validations enabled a virtuous cycle to iteratively develop disease models.
2.6 Validation analysis
The results from the adversarial simulations on the causal disease inference models were verified using computational experiments. Thus, we measured changes in model structure and parameters following each form of in silico perturbation. A high structural hamming distance between the unmodified and altered adjacency matrices was used as an indicator of vulnerability to assumption interruptions. Changes in the Pearson correlations and causal effect estimations further illustrated the effects of model perturbations. Downstream validations were prioritized based on inconsistencies that pointed to really unknown components. Furthermore, alternative attack tactics based on domain knowledge sometimes disrupted models more severely, emphasizing their potential for improvement. Experimentally, the impacts of causal assumption violations were investigated utilizing multi-omics assays. After genetic changes, RNA-seq profiles were compared to in silico predictions to look for hidden confounding attacks. As concentration-dependent changes in methylation are consistent with a non-linear relationship, epigenomic profiles after sequential vs. simultaneous therapy confirmed the validity of the time-based hypothesis. Pathophysiological relevances of model manipulations were captured by targeted metabolomics changes and cancer-related processes identified via pathway enrichment analyses. Statistical assessments were used to compare the number of molecules that were altered between the unperturbed and perturbed experimental groups. Finally, a direct comparison was made between the in silico and experimental findings. Some computational conclusions agreed with the wet lab results, but there were also significant discrepancies that underscored the application of alternative modeling strategies. The repeated cycle generated by the interplay between computational and experimental validations served to validate and improve our disease models, allowing a comprehensive assessment of the causal hypotheses.
3. Results
3.1 Data description from multi-omics data integration studies
We did a full study using multi-level omics and clinical data from three different cohorts with a total of 1500 participants (Fig. 2a). We focused on studies published in 2022–2023 because these cohorts had completed curation and harmonization at the time of analysis. Including only finalized datasets minimized variability from ongoing data releases and ensured consistent quality across all omics layers. Transcriptomics and epigenomics are the most frequently used omics layers overall, with slightly higher representation in cancer studies. Genomics and proteomics show moderate usage, whereas metabolomics and metagenomics are underrepresented, highlighting areas where additional profiling could add value (Fig. 2b). Because of the abundance of data at our disposal, we were able to probe the interplay between genetic changes and clinical manifestations. Genomic information included DNA variations found after sequencing the whole genomes of 1500 subjects. Overall, we found 15 million single nucleotide polymorphisms (SNPs) and 500,000 insertions and deletions (INDELs). This information gave a complete perspective of genetic variations within the research population, allowing us to evaluate the possible significance of individual genetic variants in illness development (Fig. 2c).

- (a) Participant attrition. 1750 initially collected, reduced to 1500 after removing missing omics, QC failures, and incomplete clinical data, ensuring harmonized cohorts. (b) Omics usage frequency. Transcriptomics and epigenomics dominate; genomics and proteomics moderate; metabolomics and metagenomics are underrepresented, highlighting future profiling opportunities. (c) Omics features. 15M SNPs, 500k indels, 850k CpGs, 20k RNAs, 3k proteins, 200 metabolites, reflecting multi-omic feature diversity across platforms. (d) Clinical traits. Over 100 traits across demographics, medical history, lifestyle, illness status, and medications provide a rich phenotypic context.
Because epigenetic mechanisms have considerable influence on gene regulation during disease processes, we also used DNA methylation profiles collected using the Illumina EPIC Bead Chip assay in this study. Over 850,000 CpG sites were analyzed for methylation levels in all 1500 samples. Further, entire RNA transcriptomic profiles were created from RNA-seq data of blood samples from all 1500 patients (Kang et al., 2023). We analyzed gene expression variations related to illness characteristics by quantifying the expression levels of over 20,000 protein-coding and long non-coding RNAs.
Protein abundance studies may shed light on disease causes and treatment targets because the proteome is the functional output of the genome. Sequential window acquisition of all theoretical mass spectra (SWATH-MS) analysis (Anjo et al., 2017) of plasma samples yielded proteomic data for 1200 of the 1500 cases. We discovered and measured approximately 3000 plasma proteins throughout the research population. Data from untargeted GC-MS of serum samples were used for metabolomic profiling for all 1500 subjects. Over two hundred lipid, amino acid, and carbohydrate metabolites were found. Information gleaned from metabolomics studies has the potential to illuminate metabolic dysregulations linked to disease states and give a comprehensive picture of the biochemical processes that are disrupted in illness.
Data from electronic health records and in-person interviews were used to compile detailed clinical phenotypic information (Kim et al., 2024). Over a hundred different clinical characteristics were collected for each participant, including demographic information, medical history, lifestyle factors, illness status, and medication use (Fig. 2d). Curated, categorized, and standardized clinical factors were used in this study (Smith et al., 2006), with the objective of connecting the clinical traits of the subjects in the research cohort with their omics data. For this, we used conventional processes and analytic pipelines. Thus, genomic variations were identified using the Genome Analysis Toolkit (GATK) (McKenna et al., 2010), and annotated using snpEff. Minfi program (Aryee et al., 2014) was employed for the normalization and correction for background, following which, we determined the methylation levels at CpG sites. Salmon platform was used for the quantification of the gene expression from the RNA-seq data aligned to the reference genome using the STAR aligner (Dobin et al., 2013). For the analyses of proteomic data, we utilized the Proteome Discoverer tool (Orsburn, 2021), identifying peptides using the SEQUEST-HT platform (Tabb, 2015). Integration, identification, and quantification of metabolite peaks were performed on the XCMS program using the Metabolite and Tandem Mass Spectral Database (METLIN) datasets.
For the data integration of multi-omics features and clinically relevant parameters, such as illness stage and risk factors, we used a feature-selection-based method. Significant associations between the multi-omics and clinical characteristics were discovered using elastic net regression analyses. This allowed us to eliminate inconsequential relationships and extricate the most pertinent factors that actually reflect disease pathophysiology. A correlational network was also created to identify statistically significant associations between multi-omics and clinical features. This network provided insights into feature clusters, connection patterns, and molecular signature-based illness subtypes. Furthermore, an adversarial validation protocol was used for the evaluation of the legitimacy of the causal assumptions in disease models. The adversarial validation method involved retraining upon random permutations in multi-omics and clinical variables. Such an approach allowed us to evaluate the model performance, ensuring direct or indirect causal contributions of the variables and thereby confirming the robustness of the identifiability assumptions of the disease mechanism model. Such examination of the combined multi-level omics and clinical data has previously led to significant conclusions (Hasin et al., 2017).
We found statistically significant links to more than 1,500 genomic variations, 2,000 CpG sites, 1,000 RNAs, 200 proteins, and 50 metabolites for the disease condition. This makes sense given that transcriptomics + epigenomics and transcriptomics + genomics combinations are heavily highlighted in cancer research. In contrast, proteomics + metabolomics are more commonly identified as potential determinants for other diseases, such as type 2 diabetes, heart disease, etc., which rely more equally on multi-omics pairs. Our study’s extensive dataset of ill participants made it possible to identify molecular endotypes that were significantly correlated with the state of illness. Correlational network analyses were instrumental in deducing significant linkages between illness subphenotypes and genetic markers. Regarding adversarial validation studies, we observed a huge drop in the accuracy of disease risk predictions from 85% to 60% upon shuffling 20% of the DNA variations and clinical factors (Fig. 3a), supporting the idea of a strong causal influence of these factors on disease development and progression. When proteomic and transcriptome data were permuted, prediction accuracy dropped even further, compared to alterations performed for genetic or metabolic cues (Fig. 3b). This suggests that the dominant roles of these functional layers are in disease pathophysiology. Scrambling the molecular layers, however, had a large influence on the disease models, but permuting the clinical variables alone did not (Fig. 3c). This indicates that clinical parameters per se, without accounting for the multi-omic features, are inefficient in drawing appropriate conclusions about disease biology. Indeed, our results highlight the need for integrating multi-omics data for a comprehensive understanding of disease pathophysiology. This is consistence with previous findings, which suggest combining omics data from several levels with clinical information for a better comprehension of complicated pathologies (Luo et al., 2023).

- (a) Perturbation accuracy. Baseline 85% dropped to 60% after shuffling DNA+clinical; transcriptome and proteome perturbations caused the strongest declines. (b) Sensitivity. Transcriptomics and proteomics show the largest drops; genomics and metabolomics moderate; clinical-only minimal. (c) Robustness. Confounding, temporal, and structural attacks progressively increased topological disruptions and correlation losses.
In summary, the extensive dataset enabled us to validate critical causal identifiability assumptions in computational disease models and identify molecular endotypes associated with disease. Information on disease subphenotypes based on genetic markers was obtained through correlational network analysis. Important signals were found with the aid of feature selection. Additionally, the adversarial validation trials confirmed that certain omics layers are more causally responsible for illnesses than others. Proteomic and transcriptomic layers were found to be more helpful in understanding the mechanisms underlying diseases than genetic or metabolic cues. These layers could change how well predictions are made. Our results highlight the need for integrating multi-omic data for a comprehensive understanding of disease biology. Incorporating molecular information improves our capability to dissect the underlying mechanisms of disease, since clinical phenotypes alone may not convey the intricacy of disease processes. We demonstrated the promise of model-based inference from rich multi-omics and clinical data for advancing precision medicine. We confirmed the causal assumptions that drive illness inference models and gained new insights into the biological basis of diseases by combining multiple data sources. Patient outcomes in the area of precision medicine can only improve as a result of this study’s findings, which open the way for individualized approaches to diagnosis, treatment, and prevention.
3.2 Predominant single and combined omics per objective and disease
We did a number of focused experiments to make sure that the results from our in silico multi-omics models were confirmed as part of our research on testing the assumptions of causal identifiability in model-based inference of lung cancer. The network representation (Fig. 4a) highlights transcriptomics as the central-most node, consistently paired with epigenomics, proteomics, and metabolomics, reflecting its key role in multi-omics integration. The thickest edges indicated frequent co-occurrence of transcriptomics-epigenomics and transcriptomics-proteomics interactions, underlining their importance for disease mechanism inference. Genomics and transcriptomics contributed most strongly to disease classification tasks, serving as primary biomarkers, while proteomics and metabolomics add complementary information that improves classification accuracy (Fig. 4b). Furthermore, transcriptomics and epigenomics were shown to dominate prognosis/outcome studies, capturing dynamic molecular changes that drive disease progression, with genomics providing baseline risk context (Figs. 4c and d). Taken together, these results demonstrate that integrating functional (transcriptomic, proteomic) and regulatory (epigenomic) layers with genomic data produces the most robust and clinically meaningful disease models, offering deeper insights into disease heterogeneity and supporting precision medicine applications.

- (a) Disease inference. The Omics co-occurrence network shows transcriptomics as the central point, strongly paired with epigenomics and proteomics, underscoring their roles in disease inference. (b) Disease classification. Genomics and transcriptomics strongest contributors; proteomics and metabolomics add complementary insights. (c) Disease prognosis. Transcriptomics and epigenomics dominate prognosis; genomics provides baseline risk context. (d) Tasks. Therapy-transcriptomics-driven; risk-genomics-centered; subtyping-transcriptomics plus epigenomics integration.
Our goal was to improve our knowledge of the main causes and underlying processes of lung cancer by combining computational forecasts with validations in wet labs. By changing how the expected confounder genes catenin beta-1 (CTNNB1), phosphatase and TENsin homolog deleted on chromosome 10 (PTEN), and tumor protein p53 (TP53) were expressed in A549 lung cancer cells (Fig. 5a), we learned more about these confounders. By confirming the predicted increase in Wnt pathway genes with CTNNB1 activation, RNA-seq analysis established its involvement as a confounder. The unexpected upregulation of inflammatory genes was also detected. Suppressing PTEN had no effect on wingless-related integration site (Wnt) signaling, but it did mess up the PI3K-Akt and TP53 targets. We confirmed that CTNNB1 and PTEN were confounding factors by comparing these experimental transcriptome profiles to computer predictions (Liu et al., 2023; Luo et al., 2023). As can be observed from Fig. 5(b), TP53 regulation had unexpected effects on transcription. These results underlined the need for model modification to integrate the reported effects of TP53 on gene expression. To test our temporal hypotheses in A549 cells, we used the demethylating drug 5-aza-cytidine along with the histone modification trichostatin A. To evaluate modifications in chromatin accessibility and gene expression, we used ATAC-seq and RNA-seq. Alterations in chromatin accessibility and gene expression were more pronounced when treatments were administered simultaneously rather than in sequence (Fig. 5c). This was counter to our original theory that epigenetic changes occurred in a linear fashion, highlighting the need to rethink the sequence of causes and effects. We used genome-wide methylation profiling to look into what happened when we exposed A549 cells to higher and higher levels of the environmental carcinogen BaP. Hypermethylation of metabolizing genes coding for cytochrome P450, family 1 subfamily A member 1 (CYP1A1) and family 1 subfamily B member 1 (CYP1B1) was shown to be concentration-dependent (Fig. 6a). Furthermore, targeted metabolomics found non-linear changes in onco-metabolites only at high doses of BaP, revealing complex dose-response interactions that necessitate updating our models (Figs. 6b and c).

- (a) Dumbbell plot. Predicted vs. observed effects for CTNNB1, PTEN, TP53, confirming confounder roles across pathways. (b) CRISPR RNA-seq. Differential expression counts for CTNNB1, PTEN, and TP53 perturbations validate computational predictions. (c) Temporal assays. Simultaneous epigenetic treatments altered chromatin accessibility and expression more strongly than sequential application.

- (a) Methylation. CYP1A1 and CYP1B1 hypermethylation increased non-linearly with BaP dose, showing clear concentration-dependent inflections. (b) Volcano plot. Significant metabolite changes appeared only at high BaP doses, highlighting non-linear effects. (c) Threshold summary. Number of significant metabolites surged above dose 5 (a)u., confirming high-dose-dependent responses.
When we compared the multi-omics profiles we acquired from these causal perturbation experiments with computational predictions, we found correlations that supported numerous our hypotheses (Fig. 7a). However, we did find several inconsistencies that need further research. New experimental hypotheses may be based on the subtleties revealed by these differences in confounding factors, temporal ordering, and non-linear effects (Luo et al., 2023; Lyu et al., 2023). The synergistic growth of causal knowledge was shown via the repeated cycle of computer modeling and wet lab validations. Areas of agreement between computational predictions and experimental data supported our model’s inferences about the primary causes of lung cancer. On the other hand, the differences sparked new research into biological processes, including the kinetics of drug interactions and the context-specific activities of TP53 (Figs. 7b and c). We verified the expected drivers of lung cancer and improved our knowledge of the underlying systems by focusing experimental assaults on our disease models. Using in silico methods together helped make more accurate models of how diseases happen that better reflect the complex biological processes at play on many levels. This integrated approach offers potential for enhancing our knowledge of complicated illnesses like lung cancer and helping the development of more effective therapies.

- (a) Scatter. Predicted vs. observed effects showed strong correlation with notable outliers, confirming both agreement and discrepancies. (b) Bland-Altman. Agreement analysis indicated a small mean bias but wide limits, highlighting inconsistencies. (c) Summary table. Correlation, bias, limits, and overlap metrics quantify agreement between in silico predictions and wet-lab validation.
3.3 Integrating multi-omics data: Computational challenges and methods
Robust disease modeling and causal inference need the integration of varied data types from many omics domains, which poses considerable computing problems. In this study, we ran across a number of significant difficulties and devised novel computational approaches to overcome them. Disparate data architectures and missing values across omics layers are significant obstacles. For instance, only a minority of people with full genetic and clinical profiles may have data for proteomics and metabolomics. To deal with this, we imputed missing data in each omics block using a Bayesian principal component analysis (PCA) method (Šmídl & Quinn, 2007). To prevent introducing artifacts from other domains, this imputation was based on observed correlation patterns within the particular data type. Batch effects are another difficulty because they may induce systematic biases that obscure biological signals. To address this, we used ComBat normalization, a technique that eliminates batch effects across integrated omics data while retaining intriguing variation (Zhang et al., 2020). ComBat corrects omics results for batch effects by treating the experimental batch as a covariate. The occurrence of molecular subgroups with distinguishing properties, such as heterogeneous disease subtypes, presents yet another difficulty. We used numerous rounds of consensus clustering with subsampling to create stable subgroups that adequately reflected the inherent variability of the dataset. In order to ensure the accuracy of the determined subgroups, it was necessary to locate robust clusters that occurred often in this method (Nema & Vachhani, 2023).
The high complexity of omics data complicates both statistical modeling and biological interpretation. We used elastic net regularization for feature selection to solve this problem. This strategy enabled us to find factors that were most strongly linked with important traits across cohorts. We decreased the complexity of the data and enhanced the interpretability of the findings by zeroing in on signals that could be reproduced and were resistant to overfitting. Another difficulty is bringing together disparate data sets with varying levels of measurement. We found a solution by scale-normalizing each omics block separately. We did this using quantile normalization and principal component analysis, among other methods. In order to ensure that all data types were appropriately integrated, we used these methods to rescale features to a common scale before combining them. We chose to use network-based models in our study because of how well they can accept the combination of disparate types of multi-omics information.
We built a multi-layered network that included all chosen omics characteristics and clinical factors using the graph, a probabilistic graphical modeling technique. We were able to discover important network aspects, such as modular structures that mirrored chemical subtypes, by using conditional dependencies learned from the data and underlying this network. Classification and prediction studies employing network-based approaches were also created, with their efficacy assessed by cross-validation methods. Finally, we randomized subsets of variables to cause targeted disruptions in multi-omics networks. By using this method, we were able to conduct a counterfactual evaluation of the disease models’ underlying causal assumptions.
Evidence for the importance of the causative variables postulated by our models was gained by comparing prediction performance before and after disruption. To overcome these obstacles, we created a unified analytical framework through the use of specialized computational approaches. We used multi-scale molecular correlations and clinical information across omics domains to help us come up with hypotheses and test our causal inference models. By solving these computational problems, we have increased our knowledge of illness and strengthened our capacity to draw solid conclusions about the interactions between variables in complicated biological systems (Lyu et al., 2023; Nema & Vachhani, 2023).
3.4 Key findings from in-silico adversarial validation
Our in-silico experiments systematically challenged the causal models and yielded several important observations. First, disease risk prediction accuracy dropped from 85% to 60% when 20% of DNA variants and clinical factors were randomly permuted, confirming the causal relevance of these features to illness outcomes. Second, proteomic and transcriptomic layers showed the largest performance degradation when scrambled, highlighting their functional proximity to phenotypic expression and emphasizing their importance in model construction. Third, we identified CTNNB1 and PTEN as hidden confounders, with RNA-seq validation confirming their influence on Wnt and PI3K-Akt signaling pathways. We also found that temporal assumption violations are simultaneous epigenetic treatments versus sequential administration, which produced stronger network alterations than predicted, suggesting that certain epigenetic processes may act concurrently rather than in a strictly linear sequence. Finally, non-linear dose-dependent effects were observed following benzo(a)pyrene exposure, with concentration-dependent hypermethylation and changes in onco-metabolite levels, underscoring the need to incorporate non-linear causal relationships in future models. Collectively, these results demonstrate that adversarial validation not only reveals the fragility of untested causal assumptions but also provides a data-driven guide for prioritizing model refinements and experimental validations.
4. Discussion
Model-based causal inference from integrated omics is a potential method in enhancing precision medicine. Using large-scale multi-omics and clinical data, we constructed illness models, and then systematically evaluated causal hypotheses using adversarial approaches in an effort to verify fundamental assumptions underpinning these models. First, we demonstrated repeatability by finding associations between individual and combination omics characteristics and clinical symptoms in different populations. The synergistic effect of combining these linked signals improved illness stratification beyond the capabilities of the separate omics layers (Osuala et al., 2023).
The next step was to run variable scrambling experiments specifically aimed at identifying the causative involvement of certain omics in causing illnesses. Examples of perturbations outside of genomics that have a significant effect on predictions include changes in proteins and DNA methylation. The strong identification of illness subgroups demonstrates that our network-based modeling effectively captures the inherent variability of diseases. Correlated signals reflecting different molecular endotypes were discovered by the network’s modular design. We also looked at temporal causalities and found some evidence for the importance of sequential epigenetic-transcriptional processes in phenotypic plasticity. Experiments revealed exceptions that highlighted places where temporal causality modeling may be enhanced. Key confounders and non-linear connections suggested by the models were also verified empirically. New biological intricacies were revealed by discrepancies between models and experimental data, prompting more research (Tang et al., 2023).
We showed that clinical data on its own isn’t enough to make sound illness inferences, underscoring the need of include omics context. These results, together with other confirmations, lend credence to the idea that powerful data resources may be used for computational causal discovery. Our study introduces a helpful paradigm that merges predictive modeling with systematic adversarial challenges, exemplifying a virtuous cycle of data-driven discovery and experiment-guided improvement. Validating causal principles is essential for making sense of results from computational disease models. Future research should encourage constant interactions between computational and wet-lab researchers in order to overcome present restrictions. Expanding experimental workflows to target new molecular layers such as proteins, single-cell profiling, and recording temporality might better disease progression models (Zhong et al., 2025).
The precise characterization of molecular endotypes also requires bigger disease-specific cohort studies with greater baseline sampling. The causal evidence generated from such research might be strengthened by using longitudinal designs that mimic dynamic changes over time. It was important for the translational implications of our results to evaluate modeled disease subtypes and treatment biomarkers in independent clinical settings (Xia et al., 2022; Yang et al., 2023). New diagnostic/prognostic classifiers and precision therapeutics might be developed with help from operationalizing proven data, leading to better patient outcomes. Important first efforts toward validating model-based causal inference from integrated omics are taken in our study. Computational methods show promise in shedding light on disease processes and bolstering personalized therapy, thanks to continuous partnerships that make use of ever-growing multi-dimensional datasets (Zbrzezny & Grzybowski, 2023; Zhang et al., 2023).
5. Conclusions
This study introduced an adversarial validation framework which explains which causal assumptions make disease models ineffective and quantifies the impact of those violations. By integrating multi-omics datasets and clinical layers, the study demonstrated how confounding, temporal, and structural attacks reveal model vulnerabilities that would otherwise remain hidden. This approach offers a pre-deployment “stress test” for causal inference pipelines, helping researchers prioritize which assumptions to verify experimentally. Validated causal models strengthen the reliability of biomarker discovery, risk stratification, and treatment selection in precision medicine. The framework depends on the availability of well-annotated multi-omics data and may miss unmeasured confounders or subtle temporal effects in sparse datasets. Expanding this approach to single-cell, spatial, and longitudinal data will improve its ability to model dynamic disease processes. Broader application across diseases could create a library of validated causal networks to guide translational research and drug development.
CRediT authorship contribution statement
Shafiul Haque: Conceptualization; data curation; formal analysis; investigation; methodology; project administration; resources; software; supervision; validation; visualization; roles/writing – original draft; writing – review & editing. Muhammad Sufyan: Conceptualization; Data curation; formal analysis; investigation; methodology; resources; validation; visualization; roles/writing – original draft. Darin Mansor Mathkor: Conceptualization; methodology; resources; software; validation; visualization; roles/writing – original draft; writing – review & editing. Mohd Wahid: Conceptualization; methodology; resources; software; validation; visualization; roles/writing – original draft; writing – review & editing. Raju K. Mandal: Conceptualization; methodology; resources; software; validation; visualization; roles/writing – original draft; writing – review & editing. Faraz Ahmad: Conceptualization; Data curation; formal analysis; investigation; methodology; project administration; resources; software; supervision; validation; visualization; roles/writing – original draft; writing – review & editing.
Declaration of competing interest
The authors declare that they have no competing financial interests or personal relationships that could have influenced the work presented in this paper.
Data availability
The data underlying this article will be shared on reasonable request to the corresponding author.
Declaration of Generative AI and AI-assisted technologies in the writing process
The authors confirm that there was no use of artificial intelligence (AI)-assisted technology for assisting in the writing or editing of the manuscript and no images were manipulated using AI.
References
- SWATH-MS as a tool for biomarker discovery: From basic research to clinical applications. Proteomics. 2017;17:3-4. https://doi.org/10.1002/pmic.201600278
- [Google Scholar]
- Minfi: A flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30:1363-1369. https://doi.org/10.1093/bioinformatics/btu049
- [Google Scholar]
- Chobtham, K., Constantinou, A.C., 2020. Bayesian network structure learning with causal effects in the presence of latent variables Proceedings of the 10th International Conference on Probabilistic Graphical Models, Proceedings of Machine Learning Research https://proceedings.mlr.press/v138/chobtham20a.html
- STAR: Ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15-21. https://doi.org/10.1093/bioinformatics/bts635
- [Google Scholar]
- Multimodal adversarial representation learning for breast cancer prognosis prediction. Comput Biol Med. 2023;157:106765. https://doi.org/10.1016/j.compbiomed.2023.106765
- [Google Scholar]
- Synthesized 7T MPRAGE From 3T MPRAGE using generative adversarial network and validation in clinical brain imaging: A feasibility study. J Magn Reson Imaging. 2024;59:1620-1629. https://doi.org/10.1002/jmri.28944
- [Google Scholar]
- Multimodal learning with graphs. Nat Mach Intell. 2023;5:340-350. https://doi.org/10.1038/s42256-023-00624-6
- [Google Scholar]
- Hierarchical perception adversarial learning framework for compressed sensing MRI. IEEE Trans Med Imaging. 2023;42:1859-1874. https://doi.org/10.1109/TMI.2023.3240862
- [Google Scholar]
- Optimizing clinical diabetes diagnosis through generative adversarial networks: Evaluation and validation. Diseases. 2023;11:134. https://doi.org/10.3390/diseases11040134
- [Google Scholar]
- Sequential combination of pharmacotherapy and psychotherapy in major depressive disorder: A systematic review and meta-analysis. JAMA Psychiatry. 2021;78:261-269. https://doi.org/10.1001/jamapsychiatry.2020.3650
- [Google Scholar]
- Multi-omics approaches to disease. Genome Biol. 2017;18:83. https://doi.org/10.1186/s13059-017-1215-1
- [Google Scholar]
- A transformer-based generative adversarial network for brain tumor segmentation. Front Neurosci. 2022;16:1054948. https://doi.org/10.3389/fnins.2022.1054948
- [Google Scholar]
- Probabilistic domain-knowledge modeling of disorder pathogenesis for dynamics forecasting of acute onset. Artif Intell Med. 2021;115:102056. https://doi.org/10.1016/j.artmed.2021.102056
- [Google Scholar]
- Synthetic tabular data based on generative adversarial networks in health care: Generation and validation using the divide-and-conquer strategy. JMIR Med Inform. 2023;11:e47859. https://doi.org/10.2196/47859
- [Google Scholar]
- The genotype-tissue expression (GTEx) project: Linking clinical data with molecular analysis to advance personalized medicine. J Pers Med. 2015;5:22-29. https://doi.org/10.3390/jpm5010022
- [Google Scholar]
- C-DARL: Contrastive diffusion adversarial representation learning for label-free blood vessel segmentation. Med Image Anal. 2024;91:103022. https://doi.org/10.1016/j.media.2023.103022
- [Google Scholar]
- Nearest neighbor imputation using spatial-temporal correlations in wireless sensor networks. Inf Fusion. 2014;15:64-79. https://doi.org/10.1016/j.inffus.2012.08.007
- [Google Scholar]
- Wasserstein Generative adversarial networks based differential privacy metaverse data sharing. IEEE J Biomed Health Inform. 2024;28:6348-6359. https://doi.org/10.1109/JBHI.2023.3287092
- [Google Scholar]
- Adversarial style discrepancy minimization for unsupervised domain adaptation. Neural Netw. 2023;157:216-225. https://doi.org/10.1016/j.neunet.2022.10.015
- [Google Scholar]
- Generative adversarial network–based noncontrast CT angiography for aorta and carotid arteries. Radiology. 2023;309 https://doi.org/10.1148/radiol.230681
- [Google Scholar]
- The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297-1303. https://doi.org/10.1101/gr.107524.110
- [Google Scholar]
- Unpaired deep adversarial learning for multi‐class segmentation of instruments in robot‐assisted surgical videos. Robotics Computer Surg. 2023;19 https://doi.org/10.1002/rcs.2514
- [Google Scholar]
- Proteome discoverer—A community enhanced data processing suite for protein informatics. Proteomes. 2021;9:15. https://doi.org/10.3390/proteomes9010015
- [Google Scholar]
- Data synthesis and adversarial networks: A review and meta-analysis in cancer imaging. Med Image Anal. 2023;84:102704. https://doi.org/10.1016/j.media.2022.102704
- [Google Scholar]
- Dose-dependent benzo(a)pyrene [B(a)P]–DNA adduct levels and persistence in f-344 rats following subchronic dietary exposure to b(a)P. Cancer Letters. 2006;240:268-278. https://doi.org/10.1016/j.canlet.2005.09.016
- [Google Scholar]
- On Bayesian principal component analysis. Computational Statistics & Data Analysis. 2007;51:4101-4123. https://doi.org/10.1016/j.csda.2007.01.011
- [Google Scholar]
- XCMS: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem. 2006;78:779-787. https://doi.org/10.1021/ac051437y
- [Google Scholar]
- The SEQUEST family tree. J Am Soc Mass Spectrom. 2015;26:1814-1819. https://doi.org/10.1007/s13361-015-1201-3
- [Google Scholar]
- Consistency and adversarial semi-supervised learning for medical image segmentation. Comput Biol Med. 2023;161:107018. https://doi.org/10.1016/j.compbiomed.2023.107018
- [Google Scholar]
- The cancer genome atlas (TCGA): An immeasurable source of knowledge. Contemp Oncol (Pozn). 2015;19:A68-A77. https://doi.org/10.5114/wo.2014.47136
- [Google Scholar]
- Adversarial counterfactual augmentation: Application in Alzheimer’s disease classification. Front Radio. 2022;2 https://doi.org/10.3389/fradi.2022.1039160
- [Google Scholar]
- An adversarial training framework for mitigating algorithmic biases in clinical machine learning. NPJ Digit Med. 2023;6:55. https://doi.org/10.1038/s41746-023-00805-y
- [Google Scholar]
- Deceptive tricks in artificial intelligence: Adversarial attacks in ophthalmology. J Clin Med. 2023;12:3266. https://doi.org/10.3390/jcm12093266
- [Google Scholar]
- Medical applications of generative adversarial network: A visualization analysis. Acta Radiol. 2023;64:2757-2767. https://doi.org/10.1177/02841851231189035
- [Google Scholar]
- ComBat-seq: Batch effect adjustment for RNA-seq count data. NAR Genom Bioinform. 2020;2:lqaa078. https://doi.org/10.1093/nargab/lqaa078
- [Google Scholar]
- sPGGM: A sample-perturbed Gaussian graphical model for identifying pre-disease stages and signaling molecules of disease progression. Natl Sci Rev. 2025;12:nwaf189. https://doi.org/10.1093/nsr/nwaf189
- [Google Scholar]
