- by Jiang, Y., Yang, Y. T., Sisu, C., He, T., Won, H., Gerstein, M.Although gene expression in the brain has been extensively investigated and compared to other tissues, the activity of pseudogenes has not been comprehensively surveyed. Here, leveraging large-scale RNA-seq data, we construct consistent pseudogene expression profiles in human and macaque brains and compare them to 29 other tissues. We further annotate pseudogenes with potential cellular roles based on co-clustering them with protein-coding genes. Notably, the majority of the expressed pseudogenes show elevated expression in the brain relative to other tissues, and […]
- by McEwen, B., Bernard, C., Stowell, D.Active learning optimises machine learning model training through the data efficient selection of informative samples for annotation and training. In the context of biodiversity monitoring using passive acoustic monitoring, active learning offers a promising strategy to reduce the fundamental data and annotation bottleneck and improve global training efficiency. However, the generalisability of model performance across ecologically relevant strata (e.g. sites, season etc) is often overlooked. As passive acoustic monitoring is extended to larger scales and finer resolutions, inter-strata spatiotemporal variability […]
- by Ma, M., Grima, R.Transcriptional bursting, characterized by stochastic switching between promoter states, underlies cell-to-cell variability in gene expression. Accurately inferring promoter activity from live-cell imaging data remains challenging because the fluorescence signal at any given point is influenced by the history of promoter states. Here, we present DART (Deep learning for the Analysis and Reconstruction of Transcriptional dynamics), a deep learning framework that infers promoter on- and off-states from fluorescence intensity traces, enabling the estimation of activation and inactivation rates and the selection […]
- by Varambally, S., Rubey, D., Shimoga Chandrashekar, D., Shovon, A. R., Puli, G. C., Karthikeyan, S. K., Manne, U., Creighton, C. J., Kumar, S.Cancer is a complex disease affecting various organs and is a major cause of death worldwide. During cancer initiation, disease progression, and tumor metastasis, various genomic and proteomic alterations are observed. Recent technological advances have led to the generation of large amounts of molecular data, including genomics and transcriptomics. These large-scale datasets can be utilized to analyze and identify sub-class-specific cancer biomarkers and targets. However, there is a need for the development of user-friendly tools for large-scale data analysis, disseminating […]
- by Shen, Q., Gai, K., Li, S., Yang, F., Cui, H., Zhang, S., Zhang, S.Spatially variable genes (SVGs) are crucial for understanding spatial heterogeneity in spatial transcriptomics. Yet, SVGs are challenging to align with established biological axes and lack straightforward mechanistic interpretation, thereby limiting their ability to inform downstream experiments or clinical applications. Here, we introduce the concept of directionally variable genes (DVGs) and temporally variable genes (TVGs). Also, we propose a unified framework, STAVAG, that models spatial or temporal information to identify DVGs and TVGs. STAVAG effectively identifies biologically meaningful DVGs for uncovering […]
- by Hu, Y., Cao, Z., Liu, Y.Protein-protein interactions (PPIs) are fundamental to cellular function, and structurally characterizing them is a key goal in molecular biology. Computational docking, a crucial tool for predicting the structure of protein complexes, faces significant challenges in search efficiency and scoring accuracy. Here, we introduce LieOTDock, a novel framework that leverages the mathematics of Lie groups and Optimal Transport (OT) for highly efficient protein docking pose generation. The core of our method is a data-driven surface sampling technique that identifies salient geometric […]
- by Jiao, Z., Zhang, Y., Lai, Y., Kang, J., Ma, L., Zhao, W., You, J., Cheng, W., Feng, J.Developments in proteomic platforms have enabled the generation of large-scale high-throughput plasma proteomics data [1-3]. With recent breakthroughs in AI modelling, these data have significantly enhanced our understanding of molecular mechanisms underlying human behaviors and diseases [4-6]. However, the replicability of associations between plasma proteomics and phenotypes remains underexplored. Here, we systematically assessed the replicability of associations with recent plasma proteomics data in the UK biobank. Over 75% of cognitive function and mental health traits demonstrated high overall (proteomics-wide) replicability […]
- by Rich, J. M., Luebbert, L., Sullivan, D. K., Rosa, R., Pachter, L.Variant detection from sequencing data is fundamental for genomics and is the first step in a wide range of applications, ranging from genome-wide association studies to disease diagnosis. Widely used tools for variant detection utilize a de novo approach that is based on a combination of read mapping algorithms and statistical methods for identifying genetic variation from error-prone sequencing data. This approach has been successful, although the detection of insertion and deletion variants, as well as the detection of variants […]
- by Bernard, C., Postic, G., Ghannay, S., Tahi, F.RNA is a molecule that performs critical roles in cellular biology, with its function closely dependent on its three-dimensional conformation. Predicting and evaluating RNA 3D structures remains a significant challenge in the field. Although many metrics and scoring functions have been developed to assess structural quality, each offers a different perspective, and no single method has emerged as a definitive standard. To address this, we previously introduced RNAdvisor, a comprehensive and automated software platform to evaluate 3D RNA structures using […]
- by Zeng, W., Zou, H., Li, X., Wang, X., Peng, S.The interactions between proteins and other biomolecules, such as nucleic acids, form a complex system that supports life activities. Designing proteins capable of targeted biomolecular binding is therefore critical for protein engineering and gene therapy. In this study, we propose a new generative model, EiRA, specifically designed for universal biomolecular binding protein design, which undergo two-stage post-training, i.e., domain-adaptive masking training and binding site-informed preference optimization, based on a general multimodal protein language model. A multidimensional evaluation reveals the SOTA […]
- by Li, J., Chen, K., Chen, X., Zheng, K., Sun, S., Dong, Y., Wang, Z., Xu, Y., McMinn, A., Sung, Y. Y., Mok, W. J., Wong, L. L., liang, y., Wang, M.Mollusca is the second-largest animal phylum and an important marine food resource for humans. While DNA viruses that threaten molluscan aquaculture have received much attention, molluscan RNA viromes remain poorly explored. Here, based on 223 molluscan metatranscriptomes covering eight animal classes, we identified 80 species-level RNA viruses spanning three viral phyla and nine viral families. Phylogenetic result combined structural modeling found a major increase in the number of Pisuviricota-related lineages. Extensive modular evolution in viral genomes was observed including gene […]
- by Gu, T., Ming, D.Terpenoids constitute nature's most chemically diverse metabolite family with vital pharmaceutical and industrial applications, yet existing databases lack systematic integration of precursor metabolic enzymes (HMGR, DXS) and mechanistic insights into terpene diversification. To bridge this gap, we developed the Terpene Synthase Database (TSDB), distinguishing itself through three key innovations: (1) comprehensive integration of MVA/MEP pathway enzymes with downstream terpenoid synthases, (2) enhanced functional annotation via InterProScan domain mapping and phylogenetics to decode catalytic plasticity, and (3) unprecedented taxonomic breadth spanning […]
- by Kellermann, G., Croce, O., Mograbi, B., Hofman, P., Brest, P.Shared epitopes create safety and efficacy issues for T-cell immunotherapy. In order to facilitate the monitoring of immune responses and the engineering required to solve this problem, we performed a computational proteome-wide epitope screening to establish the complete atlas of shared epitopes in the human and murine proteomes. Unlike bacterial or viral antigens, self-antigens like tumor-associated antigens (TAAs) frequently contained a high level of shared MHC-II epitopes identical to unintended other self-proteins. Therefore, shared epitopes should be a mandatory and […]
- by Darbani, B., Pedersen, O. B. V., Ostrowski, S. R., Tan, Q., Andersen, V.Genome-wide association studies are vulnerable to confounding factors. This study provides evidence-based guidance for minimizing bias associated with genetic relatedness, SNP-specific non-additive allelic interactions, predisposed genotypes among controls, and multi-allelic polymorphism in case-control studies. The analyses demonstrated that genetic similarity within case or control groups introduces experimental bias, whereas genetic relatedness across case-control samples reduces this bias. These findings contribute to establishing a general framework for filtering of genetically related sub-communities or paired samples, while preserving maximal statistical power. Moreover, […]
- by Miller, H. E., Greenig, M., Tenmann, B., Wang, B.Large language model (LLM) agents hold promise for accelerating biomedical research and development (R&D). Several biomedical agents have recently been proposed, but their evaluation has largely been restricted to question answering (e.g., LAB-Bench) or narrow bioinformatics tasks. Presently, there remains a lack of benchmarks evaluating agent capability in multi-step data analysis workflows or in solving the machine learning (ML) challenges central to AI-driven therapeutics development, such as perturbation response modeling or drug toxicity prediction. We introduce BioML-bench, the first benchmarking […]
- by Knowles, J.Fungal electrical activity exhibits spikes and slow oscillatory modulations over seconds to hours. We introduce a {surd}t-warped wave transform that concentrates long-time structure into compact spectral peaks, improving time-frequency localization for sublinear temporal dynamics. On open fungal datasets (fs{approx}1 Hz) the method yields sharper spectra than STFT, stable {tau}-band trajectories, and species-specific multi-scale "signatures". Coupled with spike statistics and a lightweight ML pipeline, we obtain reproducible diagnostics under leave-one-file-out validation. All analyses are timestamped, audited, and designed for low-RAM devices.
- by Zhang, C., Zhang, S.Spatial transcriptomics technologies have provided invaluable insights by profiling gene expression alongside precise spatial information. However, they encounter high costs, data sparsity, and limited resolution, hindering their broader adoption and utility. Moreover, the lack of flexible simulators capable of generating high-fidelity simulated data has impeded the development of computational tools for spatial transcriptomic data analysis. To this end, we introduce STADiffuser, a versatile deep generative model that leverages diffusion modeling for accurate simulation of spatial transcriptomic data. STADiffuser employs a […]
- by Ward, M., Dao, N., Datta, A., Li, Z.Improving the efficiency of data compression remains essential for feature selection and data modelling. Current approaches for compressing epigenomic/genomic data highly rely on autoencoder that requires substantial computing resources, parameter fine-tuning, training, and time. Here, we developed a training-free, Fast Fourier Transform (FFT)-based method, for data compression with high efficiency and full interpretability. Our FFT method compresses epigenomic data of histone modification up to 1,000-fold while still maintaining high reconstruction fidelity (cosine similarity, 99.7%), does not require any training and […]
- by Shinde, T., Kumpitsch, C., Zhang, Y., Franzosa, E. A., Mohammadzadeh, R., Weinberger, V., Eagan, T. M., Huttenhower, C., Foris, V., Moissl-Eichinger, C.The human respiratory tract (RT) harbors complex microbial communities whose functions are critical to health and disease. Yet, current insights remain fragmented across anatomical sites, populations, and clinical states, limiting the field's ability to define common patterns in health and disease. Here, we present the first global respiratory pan-microbiome atlas, a resource integrating over 4,000 metagenomes across upper, intermediate, and lower RT from diverse cohorts encompassing health, pneumonia, COVID-19, and cystic fibrosis. Standardized taxonomic profiling reveals marked biogeographic structure: in […]
- by Jing, B., Sappington, A., Bafna, M., Shah, R., Tang, A., Krishna, R., Klivans, A., Diaz, D. J., Berger, B.Generating proteins with the full diversity and complexity of functions found in nature is a grand challenge in protein design. Here, we present ProDiT, a multimodal diffusion model that unifies sequence and structure modeling paradigms to enable the design of functional proteins at scale. Trained on sequences, 3D structures, and annotations for 214M proteins across the evolutionary landscape, ProDiT generates diverse, novel proteins that preserve known active and binding site motifs and can be successfully conditioned on a wide range […]