- by Huber, F., Pollmann, J.Quantifying molecular similarity is a cornerstone of cheminformatics, underpinning applications from virtual screening to chemical space visualization. A wide range of molecular fingerprints and similarity metrics, most notably Tanimoto scores, are employed, but their effectiveness is highly context-dependent. In this study, we systematically evaluate several 2D fingerprint types, including circular, path-based, and distance-encoded variants, using both binary and count representations. We highlight the consequences of fingerprint choice, vector folding, and similarity metric selection, revealing critical issues such as fingerprint duplication, […]
- by Chen, D. G., Su, Y., Heath, J. R.T cells interact with the world through T cell receptors (TCRs). The extent to which TCRs determine T cell behavior has not been comprehensively characterized. Our Tarpon model leverages advances in generative artificial intelligence to synthesize large-scale (>1M sequences) TCR atlases across human development and diseases into actionable insights. Tarpon creates: 1) bespoke sampling functions generating realistic Ag-specific TCRs, 2) embeddings revealing CD4+ and CD8+ single-positive TCR repertoires as distinct with divergent physiochemical properties, and 3) cross-dataset mappings of T […]
- by Kuchi, A. S., Acitores Cortina, J. M., Liu, H., Fatapour, Y., Berkowitz, J., Tatonetti, N.Drug-induced acute kidney injury (AKI) affects about 20% of hospitalized AKI patients, a significant contributor to morbidity and mortality. The lack of understanding of the kidney system and functioning of nephrotoxic drugs contributes to hospital-acquired AKI cases. AKI is difficult to predict because of its complex injury mechanism and the numerous pathways through which it manifests. Traditional toxicity biomarkers, like elevated creatinine levels, detect AKI only after significant kidney injury has occurred. Concurrently, advancements in single cell RNA sequencing (scRNAseq) […]
- by Ballard, J. L., Dai, Z., Shen, L., Long, Q.Integrative analysis of multi-omics data provides a more comprehensive and nuanced view of a subject's biological state. However, high-dimensionality and ubiquitous modality missingness present significant analytical challenges. Existing methods for incomplete multi-omics data are scarce, do not fully leverage both modality-specific and shared information, and produce task-biased representations. We propose JASMINE, a self-supervised representation learning method for incomplete multi-omics data that preserves both modality-specific and joint information and enhances sample similarity structure. JASMINE produces embeddings that achieve superior performance across […]
- Balancing Speed and Precision in Protein Folding: A Comparison of AlphaFold2, ESMFold, and OmegaFoldby Hyskova, A., Marsalkova, E., Simecek, P.We compared the performance of three widely used protein structure prediction tools – AlphaFold2, ESMFold, and OmegaFold – using a dataset of over 1,300 newly created records from the PDB database. These structures, resolved between July 2022 and July 2024, ensure unbiased evaluation, as they were unavailable during the training of these tools. Using metrics such as root mean square deviation (RMSD), template modeling score (TM-score), and predicted local distance difference test (pLDDT), we found that AlphaFold2 consistently achieves the […]
- by Celis Garza, D. C., Ellaway, J. I. J., Appasamy, S. D., Krissinel, E., Velankar, S.We present FunCLAN, a novel method for finding, categorising, and measuring conformational relationships between protein complexes, as well as for finding corresponding chain pairs between them. Its purpose is to provide a general tool, applicable to a broad range of complexes. As such, it only utilises sequence similarity and superposition transformations to compare rigid bodies (chains), and how they transform with relation to each other. This lets us clusterise transformations, which lets us identify sets of similar and dissimilar transformations […]
- by Zalewski, M., Badaczewska-Dawid, A. E., Kmiecik, S.Cyclic peptides are promising therapeutics, but their flexible docking remains challenging. We present a protocol based on the well-established CABS-dock method, enhanced with cyclic restraints and Rosetta refinement. The approach was evaluated on 38 benchmark complexes previously used in other docking method studies. While selecting the truly best model remains difficult, near-native solutions are frequently sampled. CABS-dock offers global, unbiased docking without prior binding site knowledge, making it valuable for pose generation, structural ensemble modeling, and integration into AI-driven peptide-protein […]
- by Schnapka, V., Morozova, T., Sen, S., Bonomi, M.Intrinsically disordered proteins are ubiquitous in biological systems and play essential roles in a wide range of biological processes and diseases. Despite recent advances in high-resolution structural biology techniques and breakthroughs in deep learning-based protein structure prediction, accurately determining structural ensembles of IDPs at atomic resolution remains a major challenge. Here we introduce bAIes, a Bayesian framework that integrates AlphaFold2 predictions with physico-chemical molecular mechanics force fields to generate accurate atomic-resolution ensembles of IDPs. We show that bAIes produces structural […]
- by Bibekar, P., Krapp, L. F., Dal Peraro, M.RNA design has emerged to play a crucial role in synthetic biology and therapeutics. Although tertiary structure-based RNA design methods have been developed recently, they still overlook the broader molecular context, such as interactions with proteins, ligands, DNA, or ions, limiting the accuracy and functionality of designed sequences. To address this challenge, we present RISoTTo (RIbonucleic acid Sequence design from TerTiary structure), a parameter-free geometric deep learning approach that generates RNA sequences conditioned on both their backbone scaffolds and the […]
- by Mukundan, S., Madhusudhan, M. S.Golden gate cloning is a popular method for achieving ordered assembly of DNA fragments using complementary four-base pair overhangs. The success of this method relies on the ability of DNA ligases to ligate the different fragments. This ability is affected by the relative efficiency and accuracy with which the ligase recognizes and joins overhanging regions. In this study, we report a dynamic programming approach, called Overhang Optimizer for Golden Gate Assembly (OOGGA), that optimizes these overhangs for their accuracy and […]
- by Gao, Y., He, D., Patro, R.Motivation: Single-cell sequencing data analysis requires robust quality control (QC) to mitigate technical artifacts and ensure reliable downstream results. While tools like alevin-fry and simpleaf (and augmented execution context for the alevin-fry), offer flexibility and computational efficiency to process single-cell data, this ecosystem will further benefit from a standardized QC reporting tailored for its outputs. Results: We introduce QCatch, a Python-based command-line tool that generates comprehensive and interactive HTML QC reports designed specifically for single-cell quantification results. Taking the output […]
- by Cossa, A., Dalmasso, A., Campani, G., Bugani, E., Caprioli, C., Bulla, N., Tirelli, A., Zhan, Y., Pelicci, P. G.Mitochondrial single-cell lineage tracing (MT-scLT) has recently emerged as a scalable and non-invasive tool to trace somatic cell lineages. However, the reliability and resolution of MT-scLT remains highly debated. Here, we present MiTo, the first end-to-end framework for robust MT-scLT data analysis. Thanks to highly-optimized algorithms and user-friendly interfaces, this modular toolkit offers unprecedented control across the entire MT-scLT workflow. Benchmarked against novel real-world datasets (375-2,757 cells; 8-216 lentiviral clones), MiTo outperformed state-of-the-art methods and baselines in MT-scLT data pre-processing […]
- How F repeats help in peptide binding with VIM-2 metallo-beta-lactamase and destabilizing the enzymeby Anand, A. A., Anwar, S., Mondal, R. K., Samanta, S. K.The global surge in antimicrobial resistance is a major public health concern, largely driven by the dissemination of beta-lactamases. Among them, VIM-2 metallo-beta-lactamase poses a significant therapeutic challenge. In this study, we explore the role of phenylalanine and lysine residues, especially their repeat patterns, in modulating peptide binding affinity and structural interactions with VIM-2. Docking, molecular dynamics simulations, free energy decomposition analysis and residue analysis were conducted on gut metagenome-derived AMPs, PolyF and a control peptide (PolyR). Binding affinity with […]
- by Li, S., Tan, B., Ouyang, S., Ling, Z., Huo, M., Shen, T., WANG, j., Feng, X.In bioinformatics, Sankey diagrams have been widely used to elucidate complex biological insights by visualizing gene expression patterns, microbial community dynamics, and cellular interactions. However, computational scalability remains a challenge for large-scale biological networks. In this work, we present OmicsSankey, a novel formulation of the layout optimization problem for Sankey Diagrams that employs eigen decomposition as a closed-form solution, addressing graph disconnection through a teleportation mechanism that enhances connectivity and stabilizes eigenvector solutions. Experimental results on synthetic datasets with varying […]
- by Goutham, B., Narayanan, M.Musification, the process of converting data into music, has found applications in scientific research as a novel medium for data representation and communication. However, in biology, its application has largely been limited to DNA and protein sequences or structural data. Existing approaches that attempt to musify gene expression data often do so at the level of the entire gene set or its low-dimensional representations, lacking finer biological context. In our work, we bridge this gap by representing time-series gene expression […]
- by Rieger, W. J., Boden, M., Arnold, F. H., Mora, A.Enzymes present a sustainable alternative to traditional chemical industries, drug synthesis, and bioremediation applications. Because catalytic residues are the key amino acids that drive enzyme function, their accurate prediction facilitates enzyme function prediction. Sequence similarity-based approaches such as BLAST are fast but require previously annotated homologs. Machine learning approaches aim to overcome this limitation; however, current gold-standard machine learning (ML)-based methods require high-quality 3D structures limiting their application to large datasets. To address these challenges, we developed Squidly, a sequence-only […]
- by Gold, M. P., Reyes, M., Diamant, N., Kuo, T., Hajiramezanali, E., Newburger, J. W., Son, M. B. F., Lee, P. Y., Scalia, G., BenTaieb, A., Kapadia, S. B., Tripathi, A., Corrada Bravo, H., Heimberg, G., Biancalani, T.Determining a gene's functional significance within a specific cellular context has long been a challenge. We introduce a framework for quantifying gene importance by leveraging attributions learned by foundation models (FMs) trained on large corpora of single-cell RNA-sequencing (scRNA-seq) datasets. Attribution scores robustly quantify gene importance across datasets, emphasizing key genes in relevant cell populations, while minimizing technical artifacts. Therefore, we developed SIGnature, a tool that enables rapid search of gene signatures across multiple scRNA-seq atlases. We demonstrated its utility […]
- by Anberbir, T., Bankole, F., Mamo, G., Blasch, G., Dabi, A.Conventional plant phenotyping relies on visual scoring and manual measurements, which are labor-intensive, time-consuming, and prone to human error. To address these limitations, Unmanned Aircraft Systems (UASs) are increasingly being applied in breeding trials to capture various phenotypic traits. High Throughput Phenotyping (HTP) offers enhanced speed, accuracy, and efficiency, while potentially reducing costs in plant breeding programs. This study explores UAS-based phenotyping in wheat breeding trials with the aim to integrate HTP platforms across breeding pipelines. UAS images were acquired […]
- by Pan, Y.-F., He, Y., Liu, Y.-Q., Shan, Y.-T., Liu, S.-N., Liu, X., Pan, X., Bai, Y., Xu, Z., Wang, Z., Ye, J., Holmes, E. C., Li, B., Chen, Y.-Q., Li, Z.-R., Shi, M.Predicting the evolution and function of viruses is a fundamental biological challenge, largely due to high levels of sequence divergence and the limited knowledge available in comparison to cellular organisms. To address this, we present LucaVirus, a unified, multi-modal foundation model specifically designed for viruses. Trained on 25.4 billion nucleotide and amino acid tokens encompassing nearly all known viruses, LucaVirus learns biologically meaningful representations that capture the relationships between nucleotide and amino acid sequences, protein/gene homology, and evolutionary divergence. Building […]
- by Kawakami, T., Hosokawa, S., Masamichi, I., Kurozumi, A., Tanaka, R., Minatsuki, S., ishida, J., Isagawa, T., Kodera, S., Takeda, N.Single-cell RNA sequencing (scRNA-seq) of patient samples holds promise for understanding disease mechanisms, but faces the challenge of excessive cost and effort in acquisition, processing, and data analysis, making it essential to leverage existing data. Pulmonary artery hypertension (PAH) is a refractory disease characterized by pulmonary vascular remodeling, and access to patient specimens is limited due to difficulties in tissue collection. In this study, we employed transfer learning with Geneformer, a deep learning algorithm pre-trained with scRNA-seq datasets and fine-tuned […]