- by Zinchenko, V., Schlicker, A., Kurilov, R., Pouplin, A., Horlacher, M.Estimating the response of tumor cells to specific perturbations is crucial for identifying effective treatments that selectively target cancer cells while sparing healthy ones, enabling personalized medicine approaches. Large-scale initiatives, such as DepMap, have profiled cancer cell line responses to various drug treatments and gene knockouts, facilitating the development of computational models that predict sensitivity of cancer cells to different perturbations. Existing models utilize diverse methods for encoding perturbations, including various chemical fingerprints and types of gene-gene relationships. They also […]
- by Fajardo-Diaz, E., Bignon, E., Dehez, F., Karami, Y., Gonzalez-Aleman, R.Molecular dynamics is a key technique for exploring biomolecular systems at the atomic level. The rapid growth in accessible system sizes and timescales has intensified the need for efficient post-processing methods that extract meaningful insights from the resulting data. Interaction fingerprint (IFP) analyses are a valuable tool for elucidating key atomic interactions within molecular ensembles, yet current specialized software often struggle with extensive trajectories or complex systems. Here, we introduce InterMap, a Python package designed to accelerate IFP detection on […]
- by Pham, H., Lombroso, A. P., Cevik, E. C., Taylor, H. S., O'Donnell, K. J.Motivation: There is a growing need for efficient imputation methods for high-dimensional DNA methylation (DNAm) datasets. Existing microarray imputation approaches, such as k-nearest neighbors (K-NN) or principal component analysis (PCA)-based methods, provide high accuracy but can be computationally intensive, while methods for whole-genome data are not designed for large cohorts. We developed slideimp, an R package that implements sliding window, groupable, parallelized K-NN and optimized PCA imputation to address these limitations. Results: Benchmarks on microarray DNAm datasets demonstrate that slideimp […]
- by Bohnenkaemper, L., Parmigiani, L., Chauve, C., Stoye, J.Genomic rearrangements are major drivers of evolution and genetic disease. However, studying rearrangements requires segmenting the genomes of interest into conserved regions, called synteny blocks, that highlight structural differences between genomes. Synteny blocks are typically defined from annotated genes or derived as a by-product of whole-genome alignments. As these procedures are heuristic and do not explicitly model rearrangements, they can obscure real variation, create false similarities, and affect phylogenetic inference. The importance of synteny block definition has long been recognized, […]
- by Zakharov, G., Eberhardt, R., Frolova, A., Horyslavets, D., Hidari, E., Popov, I., Dennehy, J., Koreshkov, M., Antoniou, P., Kantsypa, V., Ovchinnikov, V., Iyer, V.Whole-exome (WES) and whole-genome (WGS) sequencing are rapidly becoming preferred methods for population-scale analysis of the human genetic landscape. However, there are currently no standardized quality control (QC) pipelines for human WES and WGS datasets. Moveover, there are no open datasets that can be used to test QC pipelines, because most projects (like 1000 genomes and gnomAD) publish only post-QC results. We present WxS-QC, a powerful, scalable, and convenient pipeline for the QC of human WGS and WES cohorts, developed […]
- by Haddad, Y. H., Rom, Y. A., Green, G. S., Cain, A., Raveh, B., Habib, N.Processes such as Alzheimer's disease and aging are shaped by dynamic cascades of cellular and molecular changes. Mapping how genetic and environmental risk factors alter these cascades is critical for pinpointing disease susceptibility and guiding therapeutic intervention, yet this task requires dense sampling of cell-subtype and -subpopulation changes throughout the full course of disease. Bulk measurements provide scale but obscure cell-type and subpopulation resolution, whereas single-cell assays offer resolution but lack sufficient sample size and temporal coverage. We present DeepDynamics, […]
- by Zhang, S., Luo, S., Luo, Y., Su, S., Liu, L., Li, W., Li, J.Subcellular-level spatial transcriptomics data contain unprecedented contexts to uncover finer cellular clusters and their interactions. However, integrative analysis at subcellular resolution meets many challenging questions due to its ultra-large volume, ultra-high sparsity, and severe susceptibility to technical conditions and batch effects. We introduce STUltra, a scalable and accurate framework for integrating subcellular-level spatial omics data across spatial, temporal, and biomedical dimensions. Built on contrastive learning, STUltra combines a robust graph autoencoder with an interval sampling step to enhance batch-effect correction […]
- by Gupta, S., Romero, S., Cai, J. J.Gene knockout experiments are essential for dissecting gene function, and CRISPR has made targeted gene disruption more accessible than ever. Single-cell CRISPR screening enables the construction of rich genetic perturbation landscapes, facilitating the identification of genes whose perturbation strongly reshapes cellular states. However, due to the nonlinear dependencies within gene networks, identifying the most impactful tangible genes remains challenging. Existing virtual knockout methods estimate downstream effects of single-gene deletions but do not evaluate whether such perturbations disrupt global information flow […]
- by Margelevicius, M.Structural alignment of macromolecular complexes is essential for understanding their function and evolution, yet existing methods often rely on aligning individual chains before inferring complex-level correspondences, leading to inaccuracies and inefficiencies. Here we present GTcomplex, a novel algorithm that employs spatial indexing to perform holistic complex-level alignment, directly deriving chain assignments from optimal global superpositions. Benchmarking on diverse datasets—including protein complexes, viral capsids, and nucleic acid complexes—demonstrates that GTcomplex achieves state-of-the-art accuracy with substantial speed improvements over current methods. These […]
- by Cao, D., Guo, C., Cheng, Y., Zhang, W., Shi, M., Xia, X.-Q.The zebrafish (Danio rerio) is a critical vertebrate model organism in biomedical research. Accurate identification and longitudinal tracking of individual fish within cohorts enable linking genotype to complex phenotypes, essential for elucidating the molecular mechanisms underlying disease pathogenesis and trait formation. Nevertheless, the establishment of reliable long-term individual identification remains a significant challenge, primarily due to their diminutive size and minimal interindividual morphological variations. We developed ESC-IDNet, a dual-stage deep learning cascade architecture deployed on the FishIndivID platform (http://bioinfo.ihb.ac.cn/fishindivid), enabling […]
- by Fang, C., Montgomery, K. D., Maguire, S. E., Ramnauth, A. D., Guo, B., Miller, R., Kleinmann, J. E., Hyde, T. M., Martinowich, K., Maynard, K. R., Page, S. C., Hicks, S. C.Recent advances in spatially-resolved transcriptomics have enabled profiling of gene expression in a spatial context, which has led to the generation of large-scale single-cell and spatial atlases with computationally-derived cell type or spatial domain labels. An increasingly important task with these data has become the transfer of cell type or spatial domain annotations from a given reference (or source) atlas into a new target tissue or sample. The reference and target datasets could be at different resolutions or measured on […]
- by Kitani, A., Zhang, B., Himori, K., Matsui, Y.Motivation Glycans are highly diverse biological sequences, but their functional understanding has lagged behind that of proteins and nucleic acids. Many glycans remain incompletely characterized or ambiguously annotated, limiting computational analyses. Existing computational approaches are primarily graph-based, capturing local structural features but struggling to model global patterns and incomplete sequences. Results We present GlycanGT, a foundation model for glycans built on a graph transformer architecture. Glycans were represented as graphs with monosaccharides as nodes and glycosidic bonds as edges, and […]
- by Ergin, E. K., Conrrero, A., Ferguson, K. M., Lange, P. F.The human genome contains approximately 20,000 protein-coding genes. However, millions of diverse protein variants, called proteoforms, exist. Despite originating from the same gene, proteoforms often have distinct biological roles. In bottom-up proteomics, the aggregation of peptide measurements into protein-level quantities often obscures this information. Existing methods for proteoform deconvolution are limited by their handling of missing data, which can introduce significant bias. To address this we developed ProteoForge, which builds on an imputation-aware statistical model to identify and group co-varying […]
- by Kwon, B. C., Mulligan, N., Bettencourt-Silva, J., Li, T.-H., Dandala, B., Lin, F., Tsou, C.-H., Meyer, P.Formulating hypotheses about gene-disease associations requires logical inference from prior data, followed by a laborious literature review. AI models trained on curated datasets (e.g., GWAS Catalog) can suggest SNP-disease links, but validating these predictions still demands manual evidence extraction. To streamline this process, we present GENET (Genomic Evidence Network Exploration Tool), an AI-enhanced, end-to-end visual analytics workflow applied to Age-Related Macular Degeneration (AMD). GENET comprises four sequential steps: (1) biomedical network analysis: a dual-encoder neural model identifies genes or SNPs […]
- by Gupta, S., Kaur, S., Lathwal, S., Agrawal, S., Mahmoudi, T.Global DNA hypomethylation is a defining hallmark of colorectal cancer (CRC) but is poorly captured by existing cell-free DNA (cfDNA) technologies, which typically interrogate only a fraction of CpG sites and are biased toward CpG islands. We developed Asima Rev, an electrical-impedance cfDNA assay that can differentiate healthy individuals from those with cancer by measuring cfDNA aggregation patterns associated with methylation state, enabling functional detection of genome-wide hypomethylation. In a cohort of 46 treatment-naive CRC patients and 33 controls, Asima […]
- by Vaishnavi, S., Rajkumar, N., Venkata Harshit, M., Varun Raju, N., Bhargava Chary, B., Kondaparthi, V.Small Interfering RNA (siRNA) therapeutics have extraordinary potential for targeted gene silencing. They mediate post-transcriptional gene regulation by binding to complementary messenger RNA (mRNA) sequences and degrading them, thereby preventing the production of unwanted proteins. Recent machine learning and deep learning frameworks for predicting siRNA efficacy have only achieved moderate success as these models solely rely either on handcrafted features or on sequential relations and therefore cannot capture the full complexity of siRNA-mRNA interactions. In this context, we propose SiaRNA, […]
- by Yousefabadi, H., Mehrmohamadi, M.Motivation: Predicting antibacterial drug synergy remains difficult due to strain variability and the limited scale of experimentally tested combinations. Existing machine-learning approaches often rely on permissive cross-validation schemes that allow drug pairs to appear across folds, inflating performance. A rigorous evaluation framework and scalable feature representation are needed for robust generalization. Results: We assembled a curated dataset of 3,160 drug-pair-strain interactions covering 97 compounds and 10 bacterial strains. We then developed HALO (Held-out Antibiotic interaction Learning from latent bioactivity Observations), […]
- by Tam, C.Allergies affect billions of people worldwide, posing a substantial challenge to global health because of its elusive molecular determinants. Leveraging high-quality protein structures, we revisit why allergenic proteins trigger allergies, whereas some of their structural analogues do not. We categorize allergenic protein folds into similar (SAP) and dissimilar (DAP) folds on the basis of their similarity to human proteins, establishing a dual schema that reveals distinct molecular and evolutionary patterns. Here, we show that compared with structurally similar, nonallergenic protein […]
- by Rybakov, A., Khlebnikov, D. A., Ovchinnikova, D., Nikolskaya, A., Zinkevich, A., Mironov, A.Predicting specific RNA-protein interactions remains a challenging task: despite the existence of numerous methods, a unified approach has yet to emerge. Additional difficulties emerge from the properties of in vivo IP experiments and their systematic biases, such as the overrepresentation of highly expressed RNAs. Here, we present the PLERIO machine learning framework, which utilizes eCLIP data for a single protein to reconstruct the full spectrum of its potential interactions with the cellular transcriptome (i.e., both highly expressed and lowly expressed […]
- by Mahapatra, S., Subramanian, N. A., Narayanan, M.Key genes of a complex biological system are often identified by inferring a coexpression network from transcriptomic data and analyzing it using network science measures such as centrality. However, relatively modest sample sizes of transcriptomic data, along with heterogeneity in the human population, can lead to uncertainty about the "true" coexpression network. Earlier studies have quantified this uncertainty using bootstrap resampling or similar analysis, but fewer have investigated the extent to which this uncertainty affects downstream network analyses like centrality. […]
