- by Ma, M., Grima, R.Transcriptional bursting, characterized by stochastic switching between promoter states, underlies cell-to-cell variability in gene expression. Accurately inferring promoter activity from live-cell imaging data remains challenging because the fluorescence signal at any given point is influenced by the history of promoter states. Here, we present DART (Deep learning for the Analysis and Reconstruction of Transcriptional dynamics), a deep learning framework that infers promoter on- and off-states from fluorescence intensity traces, enabling the estimation of activation and inactivation rates and the selection […]
- by McEwen, B., Bernard, C., Stowell, D.Active learning optimises machine learning model training through the data efficient selection of informative samples for annotation and training. In the context of biodiversity monitoring using passive acoustic monitoring, active learning offers a promising strategy to reduce the fundamental data and annotation bottleneck and improve global training efficiency. However, the generalisability of model performance across ecologically relevant strata (e.g. sites, season etc) is often overlooked. As passive acoustic monitoring is extended to larger scales and finer resolutions, inter-strata spatiotemporal variability […]
- by Shen, Q., Gai, K., Li, S., Yang, F., Cui, H., Zhang, S., Zhang, S.Spatially variable genes (SVGs) are crucial for understanding spatial heterogeneity in spatial transcriptomics. Yet, SVGs are challenging to align with established biological axes and lack straightforward mechanistic interpretation, thereby limiting their ability to inform downstream experiments or clinical applications. Here, we introduce the concept of directionally variable genes (DVGs) and temporally variable genes (TVGs). Also, we propose a unified framework, STAVAG, that models spatial or temporal information to identify DVGs and TVGs. STAVAG effectively identifies biologically meaningful DVGs for uncovering […]
- by Varambally, S., Rubey, D., Shimoga Chandrashekar, D., Shovon, A. R., Puli, G. C., Karthikeyan, S. K., Manne, U., Creighton, C. J., Kumar, S.Cancer is a complex disease affecting various organs and is a major cause of death worldwide. During cancer initiation, disease progression, and tumor metastasis, various genomic and proteomic alterations are observed. Recent technological advances have led to the generation of large amounts of molecular data, including genomics and transcriptomics. These large-scale datasets can be utilized to analyze and identify sub-class-specific cancer biomarkers and targets. However, there is a need for the development of user-friendly tools for large-scale data analysis, disseminating […]
- by Hu, Y., Cao, Z., Liu, Y.Protein-protein interactions (PPIs) are fundamental to cellular function, and structurally characterizing them is a key goal in molecular biology. Computational docking, a crucial tool for predicting the structure of protein complexes, faces significant challenges in search efficiency and scoring accuracy. Here, we introduce LieOTDock, a novel framework that leverages the mathematics of Lie groups and Optimal Transport (OT) for highly efficient protein docking pose generation. The core of our method is a data-driven surface sampling technique that identifies salient geometric […]
- by Jiao, Z., Zhang, Y., Lai, Y., Kang, J., Ma, L., Zhao, W., You, J., Cheng, W., Feng, J.Developments in proteomic platforms have enabled the generation of large-scale high-throughput plasma proteomics data [1-3]. With recent breakthroughs in AI modelling, these data have significantly enhanced our understanding of molecular mechanisms underlying human behaviors and diseases [4-6]. However, the replicability of associations between plasma proteomics and phenotypes remains underexplored. Here, we systematically assessed the replicability of associations with recent plasma proteomics data in the UK biobank. Over 75% of cognitive function and mental health traits demonstrated high overall (proteomics-wide) replicability […]
- by Rich, J. M., Luebbert, L., Sullivan, D. K., Rosa, R., Pachter, L.Variant detection from sequencing data is fundamental for genomics and is the first step in a wide range of applications, ranging from genome-wide association studies to disease diagnosis. Widely used tools for variant detection utilize a de novo approach that is based on a combination of read mapping algorithms and statistical methods for identifying genetic variation from error-prone sequencing data. This approach has been successful, although the detection of insertion and deletion variants, as well as the detection of variants […]
- by Zeng, W., Zou, H., Li, X., Wang, X., Peng, S.The interactions between proteins and other biomolecules, such as nucleic acids, form a complex system that supports life activities. Designing proteins capable of targeted biomolecular binding is therefore critical for protein engineering and gene therapy. In this study, we propose a new generative model, EiRA, specifically designed for universal biomolecular binding protein design, which undergo two-stage post-training, i.e., domain-adaptive masking training and binding site-informed preference optimization, based on a general multimodal protein language model. A multidimensional evaluation reveals the SOTA […]
- by Li, J., Chen, K., Chen, X., Zheng, K., Sun, S., Dong, Y., Wang, Z., Xu, Y., McMinn, A., Sung, Y. Y., Mok, W. J., Wong, L. L., liang, y., Wang, M.Mollusca is the second-largest animal phylum and an important marine food resource for humans. While DNA viruses that threaten molluscan aquaculture have received much attention, molluscan RNA viromes remain poorly explored. Here, based on 223 molluscan metatranscriptomes covering eight animal classes, we identified 80 species-level RNA viruses spanning three viral phyla and nine viral families. Phylogenetic result combined structural modeling found a major increase in the number of Pisuviricota-related lineages. Extensive modular evolution in viral genomes was observed including gene […]
- by Gu, T., Ming, D.Terpenoids constitute nature's most chemically diverse metabolite family with vital pharmaceutical and industrial applications, yet existing databases lack systematic integration of precursor metabolic enzymes (HMGR, DXS) and mechanistic insights into terpene diversification. To bridge this gap, we developed the Terpene Synthase Database (TSDB), distinguishing itself through three key innovations: (1) comprehensive integration of MVA/MEP pathway enzymes with downstream terpenoid synthases, (2) enhanced functional annotation via InterProScan domain mapping and phylogenetics to decode catalytic plasticity, and (3) unprecedented taxonomic breadth spanning […]
- by Bernard, C., Postic, G., Ghannay, S., Tahi, F.RNA is a molecule that performs critical roles in cellular biology, with its function closely dependent on its three-dimensional conformation. Predicting and evaluating RNA 3D structures remains a significant challenge in the field. Although many metrics and scoring functions have been developed to assess structural quality, each offers a different perspective, and no single method has emerged as a definitive standard. To address this, we previously introduced RNAdvisor, a comprehensive and automated software platform to evaluate 3D RNA structures using […]
- by Darbani, B., Pedersen, O. B. V., Ostrowski, S. R., Tan, Q., Andersen, V.Genome-wide association studies are vulnerable to confounding factors. This study provides evidence-based guidance for minimizing bias associated with genetic relatedness, SNP-specific non-additive allelic interactions, predisposed genotypes among controls, and multi-allelic polymorphism in case-control studies. The analyses demonstrated that genetic similarity within case or control groups introduces experimental bias, whereas genetic relatedness across case-control samples reduces this bias. These findings contribute to establishing a general framework for filtering of genetically related sub-communities or paired samples, while preserving maximal statistical power. Moreover, […]
- by Miller, H. E., Greenig, M., Tenmann, B., Wang, B.Large language model (LLM) agents hold promise for accelerating biomedical research and development (R&D). Several biomedical agents have recently been proposed, but their evaluation has largely been restricted to question answering (e.g., LAB-Bench) or narrow bioinformatics tasks. Presently, there remains a lack of benchmarks evaluating agent capability in multi-step data analysis workflows or in solving the machine learning (ML) challenges central to AI-driven therapeutics development, such as perturbation response modeling or drug toxicity prediction. We introduce BioML-bench, the first benchmarking […]
- by Knowles, J.Fungal electrical activity exhibits spikes and slow oscillatory modulations over seconds to hours. We introduce a {surd}t-warped wave transform that concentrates long-time structure into compact spectral peaks, improving time-frequency localization for sublinear temporal dynamics. On open fungal datasets (fs{approx}1 Hz) the method yields sharper spectra than STFT, stable {tau}-band trajectories, and species-specific multi-scale "signatures". Coupled with spike statistics and a lightweight ML pipeline, we obtain reproducible diagnostics under leave-one-file-out validation. All analyses are timestamped, audited, and designed for low-RAM devices.
- by Zhang, C., Zhang, S.Spatial transcriptomics technologies have provided invaluable insights by profiling gene expression alongside precise spatial information. However, they encounter high costs, data sparsity, and limited resolution, hindering their broader adoption and utility. Moreover, the lack of flexible simulators capable of generating high-fidelity simulated data has impeded the development of computational tools for spatial transcriptomic data analysis. To this end, we introduce STADiffuser, a versatile deep generative model that leverages diffusion modeling for accurate simulation of spatial transcriptomic data. STADiffuser employs a […]
- by Wakashima, T., Kume, K., Chiba, Y.Carbon fixation is a fundamental metabolic process that sustains ecosystems, yet its origins and evolutionary history remain largely unresolved. In this study, we focused on the Wood-Ljungdahl (WL) pathway, which is considered one of the most ancient carbon fixation pathways and the reductive glycine (rGly) pathway, which shares several reactions with the WL pathway. The evolutionary scenario of the two carbon fixation pathways was inferred in the phylum Thermodesulfobacteriota, which includes microorganisms that operate either the WL pathway or the […]
- by Zhang, J., Zhou, Y., Zhu, T., Zhu, Z.Peptide-based drug design targeting undruggable proteins remains one of the most critical challenges in modern drug discovery. Conventional peptide-discovery pipelines rely on low-throughput experimental screening, which is both time-consuming and prohibitively expensive. Moreover, existing computational approaches for designing peptides against target proteins typically depend on the availability of high-quality structural information. Although recent structure-prediction tools such as AlphaFold3 have achieved breakthroughs in protein modeling, their accuracy for functional interfaces remains limited. The acquisition of high-resolution structures is often expensive, time-intensive, […]
- by Mellina-Andreu, J. L., Cisterna-Garcia, A., Botia, J. A.Motivation: Calculation of semantic similarity of Gene Ontology (GO) term subsets is a fundamental task in functional genomics, comparative studies, and biomedical data integration. Existing tools, primarily in Python or R, often face severe limitations in performance when scaling to large annotation datasets. Results: We present go3, a high-performance, Python-compatible library written in Rust that supports multiple semantic similarity metrics for GO terms and genes. go3 supports both pairwise and batch computations, optimized using Rust's parallelism and memory safety. Compared […]
- by Rockenbach, K. C., Zanini, S. F., Morris, R. J., Wells, R. J., Golicz, A. A.Deep neural networks can be trained to predict gene expression directly from genomic sequence, thereby implicitly learning regulatory sequence patterns from scratch, minimizing the bias imposed by prior assumptions. A challenging, yet promising prospect is the extraction of novel insights into gene-regulatory mechanisms, by probing and interpreting such gene expression models. Using a branched convolutional neural network architecture trained on promoter and terminator sequences of allopolyploid Brassica napus and the closely related model organism Arabidopsis thaliana, we show that deep […]
- by Kellermann, G., Croce, O., Mograbi, B., Hofman, P., Brest, P.Shared epitopes create safety and efficacy issues for T-cell immunotherapy. In order to facilitate the monitoring of immune responses and the engineering required to solve this problem, we performed a computational proteome-wide epitope screening to establish the complete atlas of shared epitopes in the human and murine proteomes. Unlike bacterial or viral antigens, self-antigens like tumor-associated antigens (TAAs) frequently contained a high level of shared MHC-II epitopes identical to unintended other self-proteins. Therefore, shared epitopes should be a mandatory and […]