- by Zheng, A., Lee, I., Shivakumar, V. S., Ahmed, O. S., Langmead, B.Minimizer digestion is an increasingly common component of bioinformatics tools, including tools for De Bruijn-Graph assembly and sequence classification. We describe a new open source tool and library to facilitate efficient digestion of genomic sequences. It can produce digests based on the related ideas of minimizers, modimizers or syncmers. Digest uses efficient data structures, scales well to many threads, and produces digests with expected spacings between digested elements. Digest is implemented in C++17 with a Python API, and is available […]
- by Guan, C., Wan, F., Torres, M. D. T., de la Fuente-Nunez, C.A variety of deep generative models have been adopted to perform de novo functional protein generation. Compared to 3D protein design, sequence-based generation methods, which aim to generate amino acid sequences with desired functions, remain a major approach for functional protein generation due to the abundance and quality of protein sequence data, as well as the relatively low modeling complexity for training. Although these models are typically trained to match protein sequences from the training data, exact matching of every […]
- by Sreekumar, S., J.L, B. J., S.S, L., Pani, S. C., Rajee, N. E., Stephanie, A. P., Raphael, T., Dhanasingh, I.Tuberculosis, caused by Mycobacterium tuberculosis ranks second globally in terms of infectious disease-related deaths, after HIV. The resistance of M. tuberculosis to existing beta-lactam antibiotics is primarily due to chromosomally encoded gene blaC, which can hydrolyse predominantly available beta-lactam antibiotics. Despite the available beta-lactamase inhibitors with beta-lactam ring such as clavulanate, being efficacious, they lead the bacteria to develop an inhibitor escape mechanism. In contrast, the natural product inhibitors without beta-lactam ring that might resist bacterial escape mechanisms have the […]
- by Wong, D. R., Hill, A., Moccia, R.Modeling genetic perturbations and their effect on the transcriptome is a key area of pharmaceutical research. Due to the complexity of the transcriptome, there has been much excitement and development in deep learning (DL) because of its ability to model complex relationships. In particular, the transformer-based foundation model paradigm emerged as the gold-standard of predicting post-perturbation responses. However, understanding these increasingly complex models and evaluating their practical utility is lacking, along with simple but appropriate benchmarks to compare predictive methods. […]
- by Cheng, W., Song, Z., Zhang, Y., Wang, S., Wang, D., Yang, M., Li, L., Ma, J.Modeling long-range DNA dependencies is crucial for understanding genome structure and function across a wide range of biological contexts. However, effectively capturing these extensive dependencies, which may span millions of base pairs in tasks such as three-dimensional (3D) chromatin folding prediction, remains a significant challenge. Furthermore, a comprehensive benchmark suite for evaluating tasks that rely on long-range dependencies is notably absent. To address this gap, we introduce DNALONGBENCH, a benchmark dataset encompassing five important genomics tasks that consider long-range dependencies […]
- by Das, G., Das, T., Ghosh, Z.Long non-coding RNA (lncRNA)-Protein Interaction (LPI) across diverse biological systems directly and indirectly regulates various cellular processes. Experimental assays to recognize the protein binding partners of lncRNAs is highly time consuming and expensive. In-silico predictive approaches involving pattern recognition technique provides a promising alternative to it by reducing the search space. Our work identifies such hidden pattern from within the Cross-linking immunoprecipitation sequencing (CLIP-Seq) data which aids to overcome the problem of obtaining real negative dataset and thus offers a […]
- by Ranek, J. S., Greenwald, N. F., Goldston, M., Fullaway, C. C., Sowers, C., Kong, A., Mouron, S., Quintela-Fandino, M., West, R. B., Angelo, M.While recent innovations in spatial biology have driven new insights into how tissue organization is altered in disease, interpreting these datasets in a generalized and scalable fashion remains a challenge. Computational workflows for discovering condition-specific differences in tissue organization typically rely on pairwise comparisons or unsupervised clustering. In many cases, these approaches are computationally expensive, lack statistical rigor, and are insensitive to low-prevalence cellular niches that are nevertheless highly discriminative and predictive of patient outcomes. Here, we present QUICHE – […]
- by Gamache-Poirier, S., Souvane, A., Leclerc, W., Villeneuve, C., Hardy, S. V.A common depiction for biological signaling networks is the influence graph in which the activation and inhibition effects between molecular species are shown with vertices and arcs connecting them. Another formalism for reaction-based models is the Petri nets which has a graphical representation and a mathematical notation that enables structural analysis and quantitative simulation. In this paper, we present an algorithm based on Petri nets topological features for the transformation of the computational model of a biological signaling network into […]
- by Raciti, D., Van Auken, K., Arnaboldi, V., Tabone, C. J., Muller, H.-M., Sternberg, P. W.Biological knowledgebases are essential resources for biomedical researchers, providing ready access to gene function and genomic data. Professional, manual curation of knowledgebases, however, is labor-intensive and thus high-performing machine learning methods that improve biocuration efficiency are needed. Here we report on sentence-level classification to identify biocuration-relevant sentences in the full text of published references for two gene function data types: gene expression and protein kinase activity. We performed a detailed characterization of sentences from references in the WormBase bibliography and […]
- by Haddox, H. K., Angehrn, G., Sesta, L., Jennings-Shaffer, C., Temple, S. D., Galloway, J. G., DeWitt, W. S., Bloom, J. D., Matsen, F. A., Neher, R. A.RNA viruses like SARS-CoV-2 have a high mutation rate, which contributes to their rapid evolution. The rate of mutations depends on the mutation type (e.g., A-to-C, A-to-G, etc.) and can vary between sites in the viral genome. Understanding this variation can shed light on the mutational processes at play, and is crucial for quantitative modeling of viral evolution. Using the millions of available SARS-CoV-2 full-genome sequences, we estimate rates of synonymous mutations for all 12 possible nucleotide mutation types and […]
- by Xia, F., Verbiest, M., Lundstrom, O., Sonay, T. B., Baudis, M., Anisimova, M.Short tandem repeats (STRs) have been reported to influence gene expression across various human tissues. While STR variations are enriched in colorectal (CRC), stomach (STAD), and endometrial (UCEC) cancers, particularly in microsatellite instable (MSI) tumors, their functional effects and regulatory mechanisms on gene expression remain poorly understood across these cancer types. Here, we leverage whole-exome sequencing and gene expression data to identify STRs for which repeat lengths are associated with the expression of nearby genes (eSTRs) in CRC, STAD and […]
- by Sun, C., Sun, Y., Xu, K., He, Z., Li, H., Li, Y., Yu, Z., wang, Y., Lin, X., Xu, X., Hu, P., Bo, X., Liao, M., Chen, H.The development of sequence-based deep learning methods has greatly increased our understanding of how sequence determines function. In parallel, numerous interpretable algorithms have been developed to address complex tasks, such as elucidating sequence regulatory syntax and analyzing non-coding variants from trained models. However, few studies have systematically compared and evaluated the performance and interpretability of these algorithms. Here, we introduce a comprehensive benchmark framework for evaluating sequence-to-function models. We systematically evaluated multiple models and DNA language foundation models using 369 […]
- by Unger, M., Loeffler, C. M. L., Zigutyte, L., Sainath, S., Lenz, T., Vibert, J., Mock, A., Froehling, S., Graham, T. A., Carrero, Z. I., Kather, J. N.Background: Genomic data is essential for clinical decision-making in precision oncology. Bioinformatic algorithms are widely used to analyze next-generation sequencing (NGS) data, but they face two major challenges. First, these pipelines are highly complex, involving multiple steps and the integration of various tools. Second, they generate features that are human-interpretable but often result in information loss by focusing only on predefined genetic properties. This limitation restricts the full potential of NGS data in biomarker extraction and slows the discovery of […]
- by Wang, J., Yang, M., Zong, C., Verkhivker, G., Xiao, F., Hu, G.MotivationRecent advances in human genomics have revealed that missense mutations in a single protein can lead to distinctly different phenotypes. In particular, some mutations in oncoproteins like Ras, MEK, PI3K, PTEN, and SHP2 are linked various cancers and Neurodevelopmental Disorders (NDDs). While numerous tools exist for predicting the pathogenicity of missense mutations, linking these variants to certain phenotypes remains a major challenge, particularly in the context of personalized medicine. ResultsTo fill this gap, we developed protPheMut (Protein Phenotypic Mutations Analyzer), […]
- by Ghazi, A. R., Thompson, K. N., Bhosle, A., Mei, Z., Yan, Y., Wang, F., Wang, K., Franzosa, E. A., Huttenhower, C.Genetic and genomic variation among microbial strains can dramatically influence their phenotypes and environmental impact, including on human health. However, inferential methods for quantifying these differences have been lacking. Strain-level metagenomic profiling data has several features that make traditional statistical methods challenging to use, including high dimensionality, extreme variation among samples, and complex phylogenetic relatedness. We present Anpan, a set of quantitative methods addressing three key challenges in microbiome strain epidemiology. First, adaptive filtering designed to interrogate microbial strain gene […]
- by Zhou, X., Han, C., Zhang, Y., Su, J., Zhuang, K., Jiang, S., Yuan, Z., Zheng, W., Dai, F., Zhou, Y., Tao, Y., Wu, D., Yuan, F.Proteins, natures intricate molecular machines, are the products of billions of years of evolution and play fundamental roles in sustaining life. Yet, deciphering their molecular language – that is, understanding how protein sequences and structures encode and determine biological functions – remains a cornerstone challenge in modern biology. Here, we introduce Evola, an 80 billion frontier protein-language generative model designed to decode the molecular language of proteins. By integrating information from protein sequences, structures, and user queries, Evola generates precise […]
- by Labun, K., Rio, O., Tjeldnes, H., Swirski, M., Komisarczuk, A. Z., Haapaniemi, E. M., Valen, E.CRISPR/Cas is a revolutionary technology for genome editing. Although hailed as a potential cure for a wide range of genetic disorders, CRISPR/Cas translation faces severe challenges due to unintended off-target editing. Predicting these off-targets are difficult and necessitates trade-offs between speed and sensitivity. Here, we develop the original concept of symbolic alignments to efficiently identify off-targets without sacrificing sensitivity. We also introduce data structures that allow near-instant alignment-free probabilistic ranking of guides based on their off-target counts. Implemented in the […]
- by May, M., Hewitt, T., Mashford, B., Hammill, D., Davies, A., Andrews, T. D.Precision medicine requires a comprehensive mapping of genotype to phenotype to provide patients with individually tailored treatment. However, when using flow cytometry to identify phenotypes, such as the quantity of various immune cell populations in tissue and blood used to identify autoimmune disorders, it is often unclear which cellular phenotypes are from healthy and disease individuals, especially when including the effects of population diversity, due to the high-dimensional nature of the data. To identify and segregate healthy phenotype from various […]
- by Ding, J., Lin, J., Jiang, S., Wang, Y., Mao, Z., Fang, Z., Tang, J., Li, M., Qiu, X.The ability to pre-train on vast amounts of data to build foundation models (FMs) has achieved remarkable success in numerous domains, including natural language processing, computer vision, and, more recently, single-cell genomics–epitomized by GeneFormer, scGPT, and scFoundation. However, as single-cell FMs begin to train on increasingly large corpora, significant privacy and ethical concerns arise. Moreover, unlike text data, single-cell data is unordered and exhibits a unique tabular structure that most existing single-cell FMs overlook. In this study, we propose Tabula, […]
- by Honcharuk, V., Zainab, A., Horimoto, Y., Takemoto, K., Diez, D., Kawaoka, S., Vandenbon, A.Spatial transcriptomics provides a revolutionary approach to mapping gene expression within tissues, offering critical insights into the spatial organization of cellular and molecular processes. However, generating new spatial transcriptomics data is expensive and technically demanding, and analyzing such data requires advanced bioinformatics expertise. While publicly available datasets are growing rapidly, existing databases offer limited tools for interactive exploration and cross-sample comparisons. Here, we introduce DeepSpaceDB, a next-generation spatial transcriptomics database designed to address these issues. DeepSpaceDB focuses on interactivity and […]