- by Yang, Y., Zhao, L., Orouji, S., Zhu, Y., Johnson, R. L., Maxwell, D. S., Mica, I., Russell, K. P., Al-lazikani, B.Confirming target engagement in tumor experimental models remains a major challenge in oncology drug development. Pharmacodynamic biomarkers can help address this, but few systematic resources link drug targets to candidate biomarkers. We developed TargetTrace, a comprehensive resource to identify and prioritize pharmacodynamic biomarkers across nine key target classes, including transcription factors/cofactors, kinases, phosphatases, ubiquitin ligases, deubiquitinases, acetyltransferases, deacetylases, methyltransferases, and demethylases. Biomarker candidates were gathered from curated molecular interaction resources and refined using external annotations to improve accuracy. For enzyme […]
- by Shen, J., Chikhi, R., Korobeynikov, A., Babaian, A.Freely available nucleic acid sequencing databases have accumulated to a vast archive of genetic diversity, in excess of 50 petabase-pairs from tens of millions of experiments. Together, these data constitute a digital survey of Earth's genome. However, the richness of biological information contained within these repositories remains largely unexplored, in large part owing to the technical challenges of analyzing petabytes of data. Recently, Logan completed the sequence-assembly and compression of 27 million sequencing libraries from the Sequence Read Archive (SRA), […]
- by Rinon, E. M., Visaya, M. V., Sambayan, R.Kernel methods offer a robust framework for integrating multi-modal datasets into a unified representation, thereby facilitating more comprehensive data interpretation. In the presence of incomplete datasets, multiple kernel learning is employed to enhance the efficiency of data completion and integration. We investigate kernel-based approaches to address the incomplete-data problem with applications to yeast protein data. Biological data such as yeast proteins can be represented through multiple modalities, including gene expression profiles, amino acid sequences, three-dimensional structures, and protein interaction networks. […]
- by Wang, D., Jin, J., Qiao, J., Wei, L., Wu, S., Liu, Q.Experimental and predicted RNA three-dimensional structures are expanding rapidly, but RNA structure search still lacks a compact residue-level representation that supports database-scale comparison. Using family-held-out ablations across the currently available experimental RNA structure collection, we found that spatial-neighbour features are markedly more informative for family-level discrimination than conventional backbone and base descriptors. Building on this result, we developed RiboSeek, a search framework based on a 20-letter geometric alphabet (RS-20), an 80-letter structure-and-base composite alphabet (RS-80). Across family-level classification and retrieval […]
- by Silva de Almeida, B. L., Bonidia, R., Bole, M., Avila-Santos, A., Stadler, P. F., Nunes da Rocha, U., de Carvalho, A. C. P. L. F.The prediction of biological sequence properties has traditionally relied on alignment-based methods that assume evolutionary homology and depend on curated reference databases. This, in turn, limits scalability and sensitivity for large or heterogeneous datasets, remote homologs, short sequences, and rapidly evolving genomic regions. Although Machine-Learning (ML) approaches offer alignment-free alternatives, their broader adoption is limited by: (i) the lack of standardized, externally validated benchmark models across diverse datasets, and (ii) the technical expertise required for feature engineering, model selection, and […]
- by Bhattiprolu, S.Three-dimensional organoid cultures have emerged as powerful models for studying human tissue biology, disease mechanisms, and drug responses. Fluorescence confocal microscopy of organoids generates complex volumetric image data that is increasingly analyzed using deep learning pipelines for cell segmentation, morphometry, and phenotyping. However, training and benchmarking such pipelines requires large annotated datasets, the manual curation of which is prohibitively expensive and time-consuming. Here we present a parametric, physics-based computational framework for generating synthetic 3D fluorescence organoid images with exact ground-truth […]
- by Roule, T., Akizu, N.Background: Despite their use, quantitative comparison of epigenomic datasets such as ChIP-seq and CUT&RUN remains challenging, particularly due to difficulties in signal normalization across samples and conditions. Normalization solely based on sequencing depth is often insufficient due to the high variability in signal-to-noise ratios across samples, even from a same experiment. While exogeneous spike-in normalization can address some issues, robust spike-in controls are not always available, and may introduce additional experimental burden and computational complexity. Furthermore, normalization and differential binding […]
- by Ge, C., Li, H.Single-cell CRISPR perturbation screens offer a powerful framework for causal discovery in gene regulatory networks, but existing methods struggle with high-dimensional count data, unmeasured confounding, and the increasing prevalence of high-multiplicity-of-infection (MOI) designs. We introduce RICE, a scalable framework for causal gene network estimation that integrates a reduced control function to address latent confounding with a constrained generalized linear model accommodating both hard and soft interventions. By enforcing differentiable acyclicity constraints, RICE enables efficient GPU-based optimization for large-scale data. Across […]
- by Razavi, M., Tellapragada, C., Giske, C. G.Cefiderocol uptake in Enterobacterales depends partly on TonB-dependent catecholate transporters, including CirA, yet the functional interpretation of CirA missense variation remains limited by an absence of large experimental phenotype datasets. Here we describe a structure-informed Siamese graph neural network (GNN) framework designed to prioritise CirA missense variants that are likely to impair transporter function and thereby contribute to reduced cefiderocol susceptibility. Because large experimental datasets of CirA missense phenotypes are not available, we trained the model on a synthetic mutant […]
- by Genz, L. R., Topf, M.Biomolecular interactions are central to many essential cellular processes, but RNA-containing complexes remain challenging to resolve structurally, even as experimental methods and AI-based prediction have expanded structural coverage. Tools for the integrated analysis of complex interfaces remain limited. We present ProNA3D, a tool that provides a unified platform for analyzing protein-nucleic acid and nucleic acid-only complexes, bridging the gap between structure prediction and functional interpretation. ProNA3D supports both experimental and computationally predicted structures, incorporating scoring metrics for AlphaFold3 predictions. It […]
- by Yamahata, I., Shimamura, T., Hayashi, S.Cell-penetrating peptides (CPPs) can deliver diverse cargos into cells. However, designing CPPs with receptor-selective interaction profiles remains difficult because interactions with individual cell-surface components cannot be tuned independently. Here, we developed a closed-loop in silico framework for receptor-selective CPP design, in which receptor interactions are formulated as explicit objectives in a multi-objective optimization problem. We first constructed a CPP-like candidate library using a sequence generative model fine-tuned on known CPPs. The framework then evaluated candidate peptides by receptor-wise docking, molecular […]
- by Han, S., Sztanka-Toth, T., Senel, E., Elnaggar, A., Patel, J., Mansi, T., Smirnov, D., Greshock, J., Javidi, A.Single-cell foundation models enable reusable representations and streamlined analysis workflows, yet rigorous evaluation of their performance and robustness in real-world pharmaceutical settings remain underexplored. Here, we benchmarked leading single-cell foundation models (scGPT; scGPT_CP, a continually pretrained checkpoint of scGPT; scFoundation; scMulan; CellFM) against established baseline methods (scVI; Harmony) for data integration using over 1.5 million cells from clinical and preclinical samples. Performance was assessed using well-established and complementary metrics for technical correction and biological structure preservation. We further introduced robustness-oriented […]
- by Zhang, X.Large language model (LLM) agents are increasingly used to synthesize heterogeneous bioinformatics evidence, but their reliability for high-volume biological annotation remains poorly characterized. We evaluated three agent configurations on a controlled protein annotation task: Claude App with Claude Opus 4.7, Claude Code CLI with Claude Opus 4.7 and Claude Scientific Skills, and Codex App with GPT-5.4 and Claude Scientific Skills. Each configuration was run three times on the same verbatim prompt, the same 73 selected orthogroup FASTA files (1,705 protein […]
- by Taguchi, Y.-h., Turki, T.CITE-seq jointly profiles cellular transcripts and surface proteins, but integrating RNA and antibody-derived tags (ADTs) remains challenging because the two modalities differ markedly in dimensionality, sparsity, and noise characteristics. We present a tensor-decomposition-based unsupervised feature extraction framework for the parameter-free integration of CITE-seq data. By constructing a gene x cell x protein tensor and applying HOSVD, this method derives the shared latent representations of genes, cells, and proteins without prior gene filtering or modality-weight tuning. Across five ImmGen T-cell CITE-seq […]
- by Miao, Y., Surguladze, N., Lerner, J., Poysungnoen, K., Ariano, K., Li, Y., Zhu, Y., Van Batavia, K., Jepson, J., Van De Klashorst, J., Ni, B. Y. X., Armstrong, A., Rahman, R., Horstmeyer, R., Hickey, J. W.Accurate cell segmentation is an essential step for quantitative analysis of biological imaging data. Recent advances in deep learning have led to the development of generalist segmentation models that perform robustly across multiple imaging modalities, including label-free phase contrast, fluorescence cell culture, and multiplexed fluorescence tissue imaging. However, systematic comparisons of these models at the level of downstream biological analysis remain limited. To address this gap, we evaluated several recent segmentation models, including Cellpose cyto3, Cellpose-SAM, SAM, and CellSAM, on […]
- by Miao, Z., Qu, Y., Huang, S., Laux, L., Peters, S., Aristel, A., Zhang, Z., Niedernhofer, L. J., McMahon, A., Kim, J., Zhang, N.Spatial transcriptomics enables the study of how cells coordinate their molecular states within tissue, providing insight into both normal function and disease processes. A key challenge is to identify gene expression programs that vary continuously across space and are coordinated between cell types. We present CoPro, a computational framework for detecting the spatially coordinated progression of cellular states. CoPro can operate in both supervised and unsupervised modes to identify gene programs that co-vary within or between cell types, and to […]
- by Nolte, K., Baumbach, J., Kollmannsberger, P., Sauer, F. G., Luehken, R.Diptera represent a diverse insect order, including vectors of human and animal pathogens. Their accurate species identification remains a major bottleneck in ecological and epidemiological studies. Morphological identification requires taxonomic expertise, while molecular methods are costly and not universally reliable. Wing geometric morphometrics offers an alternative, but manual landmark annotation is time-consuming and introduces observer bias. We developed ITHILDIN, an automated pipeline for landmark and semilandmark annotation of Diptera wings, combining UNet++ segmentation and an Hourglass landmark prediction model. Using […]
- by Lynch, A. W., Lee, S. S., Hummel, J. P., Geiger, B., Lawrence, M. S., Jin, H., Gulhan, D. C., Park, P. J.The genome of every cancer cell carries a record of the mutational processes that have acted throughout its history. Mutational signature analysis, which infers the activity of mutagenic processes from their characteristic base-change patterns, has become an indispensable tool for interpreting somatic mutations. However, this framework captures only which types of mutations a process generates and not where in the genome they occur – a distribution influenced by replication timing, chromatin organization, transcription, DNA secondary structure, and other genomic features. […]
- by Ou, Z., James, K., Charnock, S., Wipat, A.Selecting representative subsets from large protein sequence datasets is a common challenge in enzyme discovery and related tasks under limited screening capacity. In practice, candidate panels are often constructed using clustering-based redundancy reduction or manual selection guided by phylogenetic or similarity-network analyses, which do not directly optimise subset diversity and require threshold tuning or expert interpretation. Here, we present a bi-level diversity-optimisation framework for representative protein panel selection implemented using a local search heuristic that iteratively updates panel composition to […]
- by Chen, Y., Xu, Y., Cheng, Y., Qi, X., Bai, T., Yang, J., Luo, H., Du, X., Zhu, L., Yang, L., Shi, M., Wang, D., Li, Z., Shu, Y.Seasonal influenza viruses accumulate antigenic changes, eroding population immunity and necessitating recurrent vaccine updates. Hemagglutination inhibition (HI) assays are the standard for measuring antigenic relationships between circulating and vaccine strains; however, their limited throughput constrains the scale and timeliness of surveillance. Here, we present fluProfiler, a foundation-model-based framework that learns a stable mapping from viral sequences to antigenic space and uses this representation to support influenza antigenic prediction, vaccine strain evaluation, and diversity-driven sampling. fluAgPredictor aligns hemagglutinin (HA) and neuraminidase […]
