- by Rogic, S., Mancarci, B. O., Xu, B., Xiao, A., Yang, C., Pavlidis, P.Accurate, consistent and comprehensive metadata are essential for the reuse of functional genomics data deposited in repositories such as the Gene Expression Omnibus (GEO), however, achieving this often requires careful manual curation that is time-consuming, costly and prone to errors. In this paper, we evaluate the performance of Large Language Models (LLMs), specifically OpenAI's GPT-4o, as an assistive tool for entity-to-ontology annotation of two commonly encountered descriptors in transcriptomic experiments – mouse strains and cell lines. Using over 9,000 manually […]
- by Chen, E., Xu, J., Liu, Y., Li, Y., Feng, Y., Lu, Q., Ding, X., Niu, Z., Qin, S., Niu, S., Luo, Y., Guo, X., Luo, X.Dendrobium officinale is a typical epiphytic orchid. We report the telomere-to-telomere (T2T) genome assembly for D. officinale, representing the first T2T reference genome within the Orchidaceae family. The assembly is anchored to 19 chromosomes and contains 38 complete telomeres and 15 characterized centromeres. We further generated haplotype-resolved assemblies of the autotetraploid genome, identifying 12,761 sets of tetra-allelic genes. Based on synonymous substitution analysis, we inferred that the autotetraploidization event occurred approximately 0.86 million years ago. A systematic analysis of the […]
- by Wang, Q., Gao, A., Li, Y., Khatri, P., Hu, R., Huang, J., Pawitan, Y., Vu, T. N., Dinh, H. Q.Spatial transcriptomics data are largely available with RNA expression alone, limiting the detection of cell states defined by surface protein abundance. The lack of multi-omics spatial data limits the ability to identify immune cells and their signaling in the tumor microenvironment, as most solid tumors are immunologically poor and exhibit protein-RNA abundance discordance in critical immune cell surface markers. Although emerging technologies enable spatial multi-omics profiling, technical and cost constraints remain a hurdle. We introduce SR2P, a stacking-based machine-learning framework […]
- by Zinnah, K. M. A., Nabil, F. A., Darda, A., Islam, E., Hossain, F. M. A.Marburg virus (MARV) is a highly pathogenic filovirus that causes hemorrhagic fever with a high mortality rate, with very limited treatment options. The urgent need for targeted antiviral agents emphasizes the importance of structure-based drug discovery approaches. The present study aimed to evaluate the antiviral potential of Withaferin A (PubChem CID-265237) against three key proteins of MARV: viral protein 35 (VP35), and nucleoproteins (NP). Three-dimensional structures of these proteins were retrieved from RCSB-Protein Data Bank and docked with Withaferin A […]
- by Caskey, M., Rich, J., Weber, R., Mortazavi, A., Pachter, L., Hallgrimsdottir, I. B.Single-cell genomics technologies enable high-throughput cell profiling, but technical contamination remains an obstacle to accurate downstream analysis. Free-floating ambient molecules released from lysed cells and global bulk contamination introduced during library preparation can distort molecular profiles. These artifacts can obscure cellular identities and reduce the reliability of differential analysis or clustering results. We present an efficient and effective approach to removing ambient and bulk contamination that can be applied to data generated from a wide variety of technologies. We show […]
- by Winkelhardt, D., Berres, S., Uszkoreit, J.Peptide-spectrum match (PSM) rescoring has become standard in proteomics workflows, improving peptide identification accuracy across diverse search engines. Despite the availability of multiple rescoring strategies, systematic comparisons spanning several search engines, datasets, and database configurations remain limited. Here, we benchmarked seven publicly available search engines, evaluating standard target-decoy-based false discovery rate (FDR) estimation alongside Percolator, MS2Rescore, and Oktoberfest across four datasets acquired on different mass spectrometry platforms and searched against protein databases of varying size and composition. Rescoring substantially increased […]
- by Raghavan, B., Rogers, D. M.Controllable protein sequence generation remains a central challenge in computational protein design, as most existing approaches rely on retraining, classifier guidance, or architectural modification to impose conditioning. Here we introduce ProtNHF, a generative model that enables continuous, quantitative control over sequence-level properties through analytical bias functions applied exclusively at inference time. ProtNHF builds on neural Hamiltonian flows, where a lightweight Transformer-based potential energy function, inspired by ESM-2, is combined with an explicit kinetic term to define Hamiltonian dynamics in a […]
- by Li, S., Chou, E., Wang, K., Boyle, A. P., Sartor, M. A.Mapping the genomic locations and patterns of transcription factor binding sites (TFBS) is essential for understanding gene regulation and advancing treatments for diseases driven by DNA modifications, including epigenetic changes and sequence variants. Although several TFBS databases exist, no study has systematically benchmarked these databases across different sequencing technologies and computational algorithms. In this study, we addressed this gap by constructing a TFBS database that integrates all available ENCODE cell line ATAC-seq and Cistrome Data Browser ChIP-seq datasets, comprising 11.3 […]
- by Orlov, A. V., Makus, Y. V., Ashniev, G. A., Orlova, N. N., Nikitin, P. I.Foundation models trained on protein and DNA sequences are increasingly deployed for variant interpretation, drug design, and gene regulation prediction, yet their internal representations remain opaque, limiting both biological insight and trust in model-guided decisions. Existing interpretation approaches establish what these models encode but cannot reveal how biological knowledge is internally organized and computed. Sparse autoencoders (SAEs) offer a complementary approach by decomposing model activations into interpretable features, each capturing a distinct biological concept. Over the past year, SAEs have […]
- by Pancsa, R., Ficho, E., Kalman, Z. E., Gerdan, C., Remenyi, I., Zeke, A., Tusnady, G. E., Dobson, L.Short linear motifs (SLiMs) are small, often transient interaction modules within intrinsically disordered regions (IDRs) of proteins that interact with particular domains and thereby regulate numerous biological processes. The limited sequence information within these short peptides leads to frequent false positive hits in both computational and experimental SLiM identification methods. This makes the description of novel SLiMs challenging and has limited the number of known cases to a few thousand, even though SLiMs play widespread roles in cellular functions. We […]
- by Poursina, A., Hajhashemi, S., Mikaeili Namini, A., Saberi, A., Emad, A., Najafabadi, H. S.Inferring the governing dynamics of differentiation that capture cell state evolution remains a central challenge in single-cell biology. We present Latent Space Dynamics (LSD), a thermodynamics-inspired framework that models cell differentiation as evolution on a learned Waddington landscape in latent space. LSD jointly infers a low-dimensional cell state, a differentiable potential function governing developmental flow, and a local entropy term that quantifies cellular plasticity. Using a neural ordinary differential equation, LSD reconstructs continuous differentiation trajectories from time-ordered single-cell data. Across […]
- by Wong, D. R., Piper, M., Qiao, J., Russo, M., Jean, P., Clevert, D.-A., Arroyo, J., Pashos, E.The identification of genetic perturbations that can reverse disease-associated cellular phenotypes toward a healthy state is a central challenge in early drug discovery. We present a proof-of-concept framework leveraging single-cell foundation models (scFMs) and a large-scale Perturb-seq dataset to prioritize targets for phenotypic reversion of cellular inflammation. We incorporated both basal and proinflammatory signaling conditions, specifically stimulation with interleukin-1 beta (IL-1{beta}) and tumor necrosis factor alpha (TNF-), to assess whether atherosclerotic disease-relevant stimulation improved identification of genes and pathways critical […]
- by Daamen, A., Shrotri, S., Grammer, A., Lipsky, P. E.Atopic dermatitis (AD) is a chronic inflammatory skin disease characterized by immune dysregulation and barrier dysfunction. To define the molecular architecture of AD in greater detail, we integrated lesional (LES) and non-lesional (NLS) transcriptomic data from multiple datasets using gene expression data from normal skin and psoriasis (PSO) and nummular eczema (NME) cohorts as reference. Gene set variation analysis revealed that adult AD exhibits broad immune activation and consistent barrier impairment in both LES and NLS skin, whereas pediatric AD […]
- by Ai, C., Tan, L., Gao, S., Wang, Y.Pseudogenes are recognized as essential components for reconstructing adaptive evolutionary trajectories and understanding genomic remodeling. However, identifying these sequences in large eukaryotic genomes remains technically challenging due to fragmented workflows, complex manual configurations, and the lack of high-performance, parallelized tools capable of processing rapidly growing data volumes. We present EasyPseudogene, an automated and multithreaded pipeline designed for the end-to-end identification of pseudogenes across diverse eukaryotic lineages. Unlike traditional self-mapping tools that often fail to detect unitary pseudogenes when functional counterparts […]
- by Yan, J., Wu, Q., Li, Y., Cai, J., Zhou, M., CACPbell-Valois, F.-X., Siu, S. W.Cancer remains a major global health threat, with its incidence and mortality rates consistently rising in recent years. Anticancer peptides (ACPs) are short amino acid chains that can inhibit the growth or spread of cancer cells. Compared to traditional treatments, ACPs are a promising class of potential cancer therapies due to their multiple mechanisms, potential for combination cancer therapy, enhanced immune function, lower toxicity to normal tissues, fewer side effects, and less drug resistance. Although it is necessary to explore […]
- by Stohn, T., van de Brug, N. D., Theodosiadou, A., Thijssen, B., Jastrzebski, K., Wessels, L. F. A., Bosdriesz, E.Single-cell sequencing technologies increasingly rely on complex nucleotide barcoding schemes to encode cellular identities, experimental conditions, and multiple molecular modalities within a single experiment. While demultiplexing, alignment, and UMI-based quantification form the core preprocessing steps that transform raw sequencing reads into analyzable single-cell data, existing pipelines are often tightly coupled to specific experimental designs and typically assume fixed barcode positions and substitution-only error models. As a result, many emerging assays employing combinatorial, variable-length, or multimodal barcoding designs require custom, hard-coded […]
- by Ivan, J., Lanfear, R.Many phylogenomic studies used non-overlapping windows to address gene tree discordance across a set of aligned genomes. Recently, Ivan et al. (2025) proposed an information theoretic approach to choose an optimal window size given the alignment. However, this approach selects only a single fixed window size per chromosome, which is a useful first step but fails to account for variation in the size of non-recombining regions along each chromosome. Such variation is expected to occur due to the stochastic nature […]
- by Melsted, P., Guthnyjarson, E. M., Nordal, J.We present a GPU implementation of kallisto for RNA-seq transcript quantification. By redesigning the core algorithms: pseudoalignment, equivalence class intersection, and the EM algorithm; for massively parallel execution on GPUs, we achieve a 30-50x speedup over multithreaded CPU kallisto. On a benchmark of 100 Geuvadis samples from Human cell lines the GPU version processes paired-end reads at a rate of 3.6 million per second, completing a typical sample in seconds rather than minutes. For a large dataset of 295 million […]
- by Li, K., Wang, W., Jiang, J., Deng, J., Zhang, J., Qiu, S., Zhang, W.Genomic language models (gLMs) hold great promise for deciphering biological sequences, yet their effectiveness is hindered by the limited number of experimentally verified examples available for model training, a ubiquitous bottleneck for supervised machine learning. To overcome this challenge, we developed circFormer, the first gLM driven approach for circular RNA (circRNA) identification. circFormer integrates curriculum learning with gLM fine tuning: a Nucleotide Transformer model is first trained on a small set of validated circRNAs, the resulting model is used as […]
- by Rodriguez, S., Alberca, L. N., Gavernet, L., Franchini, G. R., Talevi, A.Echinococcosis is a Neglected Tropical Disease (NTD) caused by Echinococcus granulosus and Echinococcus multilocularis, the etiological agents of cystic and alveolar echinococcosis, respectively. These infections pose a significant public health burden, particularly in endemic regions. Cestodes lack key enzymes involved in lipid metabolism and must acquire lipids from their hosts. Fatty Acid Binding Proteins (FABPs), which mediate lipid trafficking and intracellular transport, have therefore emerged as essential and potentially druggable targets. In this study, we implemented an integrated virtual screening […]
