• by Strauch, J., Asiaee, A.
    The development of models to predict sensitivity to anticancer drugs is an area of significant interest, given the diverse responses to treatment among patients and the considerable expense and time involved in anticancer drug development. Leveraging "omic" data and anticancer response information from the Cancer Cell Line Encyclopedia, we propose a novel approach utilizing multitask learning to enhance prediction accuracy and inference. We extended a multitask learning framework called the Data Shared Lasso to develop the Data Shared Elastic Net. […]
  • by Zhang, L., XU, K., XU, Y., Wang, Z.
    Spatial transcriptomics data provide insights into gene expression patterns within tissue contexts, where identifying spatial domains with similar gene expression is crucial. Traditional clustering methods for spatial domain clustering often neglect spatial continuity, resulting in disjointed domains. Although recent computational approaches have integrated spatial information, they face limitations in recognizing domain boundaries, scalability, and the need for independent clustering steps. Here, we introduce stDyer, a novel end-to-end deep learning framework for spatial domain clustering in ST data. stDyer utilizes a […]
  • by Talevich, E., Tseng, E., Diallo, A., Sellami, N., Elliott, A., Cantarel, B. L., Tonthat, N., Chatterjee, P., Tai, P. W. L., Aldridge, C.
    Despite recombinant adeno-associated viruses (rAAVs) being the leading platform for gene therapy, there is a lack of standardized computational analysis methods and reporting to assess the contents of each capsid through long-read sequencing. PacBio's highly accurate long-read HiFi sequencing enables comprehensive characterization of AAV genomes but requires bioinformatics expertise for analyzing, interpreting and comparing the results. To address this need and improve the understanding of functional viral payloads, our working group established standardized nomenclature and reporting for long-read sequencing data […]
  • by Prakasham, R. S., Anumalla, M., Batchu, U. R., Bhukya, B.
    Delftia tsuruhatensis IICT-RSP4, an uricase producing bacterium was isolated using i-chip method from soil and characterized. Here, we report the draft genome sequence of D. tsuruhatensis IICT-RSP4. The genome data comprised of 6,627,718bp (6.6 MB) with a GC content of 66.6% with 7 protein encoding genes, 346 sub-systems with 6165 coding sequences and 112 RNAs. The genome revealed five functional secondary metabolite biosynthetic gene clusters viz. terpene, resorcinol, NRP+PKS, T2PKS, and RiPPS related to antimicrobial, anticancer and antimalarial functionality. In […]
  • by Farheen, F., Broyles, B. K., Zhang, Y., Ibtehaz, N., Erkine, A. M., Kihara, D.
    Analysis of factors that lead to the functionality of transcriptional activation domains remains a crucial and yet challenging task owing to the significant diversity in their sequences and their intrinsically disordered nature. Almost all existing methods that have aimed to predict activation domains have involved traditional machine learning approaches, such as logistic regression, that are unable to capture complex patterns in data or plain convolutional neural networks and have been limited in exploration of structural features. However, there is a […]
  • by Billera, L., Oresten, A., Stalmarck, A., Sato, K., Kaduk, M., Murrell, B.
    Just as language is composed of sublexical tokens that combine to form words, sentences, and paragraphs, protein backbones are composed of sub-structural elements that combine to form helices, sheets, folds, domains, and chains. Autoregressive language models operate on discrete tokens, whereas protein structure is inherently continuous, and generative approaches to protein design have borrowed more from image generation than language modeling. But autoregressive models do not inherently require their inputs and outputs to be discrete. Here we describe a generative […]
  • by Braunger, J. M., Velten, B.
    Pooled single cell CRISPR screens have emerged as a powerful tool in functional genomics to probe the effect of genetic interventions at scale. A crucial step in the analysis of the resulting data is the assignment of cells to gRNAs corresponding to a specific genetic intervention. However, this step is challenging due to a lack of systematic benchmarks and accessible software to apply and compare different guide assignment strategies. To address this, we here propose crispat (CRISPR guide assignment tool), […]
  • by Gui, Y., Li, C., Xu, Y.
    Spatial transcriptomics (ST) technologies have emerged as an effective tool to identify the spatial architecture of the tissue, facilitating a comprehensive understanding of organ function and tissue microenvironment. Spatial domain identification is the first and most critical step in ST data analysis, which requires thoughtful utilization of tissue microenvironment and morphological priors. To this end, we propose a graph contrastive learning framework, GRAS4T, which combines contrastive learning and subspace module to accurately distinguish different spatial domains by capturing tissue microenvironment […]
  • by Williams, C. M., O'Connell, J., Freyman, W. A., 23andMe Research Team,, Gignoux, C. R., Ramachandran, S., Williams, A. L.
    Haplotype phasing, the process of determining which genetic variants are physically located on the same chromosome, is crucial for various genetic analyses. In this study, we first benchmark SHAPEIT and Beagle, two state-of-the-art phasing methods, on two large datasets: >8 million diverse, research-consented 23andMe, Inc. customers and the UK Biobank (UKB). We find that both perform exceptionally well. Beagle's median switch error rate (SER) (after excluding single SNP switches) in white British trios from UKB is 0.026% compared to 0.00% […]
  • by Hong, X., Zhan, J., Zhou, Y.
    Success in protein structure prediction by the deep learning method AlphaFold 2 naturally gives arise the question if we can do the same for RNA structure prediction. One reason for the success in protein structure prediction is that the structural space of proteins at the fragment level has been nearly complete for many years. Here, we examined the completeness of RNA fragment structural space at dimeric, trimeric, tetrameric, and pentameric levels. We showed that the RNA structural space is not […]
  • by Bhowmik, R., Manaithiya, A., Parkkinen, J., Kumar, S., Mathew, B., Parikka, M., Carta, F., Supuran, C. T., Parkkila, S., Aspatwar, A.
    Mycobacterium tuberculosis (Mtb) {beta}-carbonic anhydrases ({beta}-CAs) are crucial enzymes responsible for regulating pH by catalyzing the conversion of CO2 to HCO3-, which is essential for its survival in acidic environments in the host. By inhibiting Mtb {beta}-CAs, we can potentially discover new targets for anti-tuberculosis drugs with a different mechanism of action than existing FDA-approved drugs. This is crucial since Mtb has demonstrated the ability to develop different degrees of resistance to current drugs over time. This study employed machine […]
  • by Xiang, W., Xiong, Z., Huan, C., Xiong, J., Zhang, W., Fu, Z., Zheng, M., Liu, B., Shi, Q.
    Assigning appropriate property labels, such as functional terms and catalytic activity, to proteins, remains a significant challenge, particularly for the non-homologous ones. In contrast to prior approaches that mostly focused on protein sequence features, we employ pretrained protein language model to encode the sequence features, and natural language model for the semantic information of property descriptions. Specifically, we present FAPM, a contrastive model between natural language and protein sequence language, which combines the pretrained protein sequence model with the pretrained […]
  • by Li, Y., Chen, E., Xu, J., Zhang, W., Zeng, X., Liu, Y., Luo, X.
    Error self-correction is a pivotal first step in the analysis of long-read sequencing data. However, most existing methods for this purpose are primarily tailored for noisy sequencing data with error rates exceeding 5%, often collapsing true variants in repeats and haplotypes. Alternatively, some methods are heavily optimized for PacBio HiFi reads, leaving a gap in methods specifically designed for Nanopore R10 reads basecalled with high accuracy or super accuracy models, which typically have error rates below 2%. Here, we introduce […]
  • by Hirota, K., Salim, F., Yamada, T.
    Progress in sequencing technology has led to determination of large numbers of protein sequences, and large enzyme databases are now available. Although many computational tools for enzyme annotation were developed, sequence information is unavailable for many enzymes, known as orphan enzymes. These orphan enzymes hinder sequence similarity-based functional annotation, leading gaps in understanding the association between sequences and enzymatic reactions. Therefore, we developed DeepES, a deep learning-based tool for enzyme screening to identify orphan enzyme genes, focusing on biosynthetic gene […]
  • by Sun, E., Zhao, E., Li, Q., Lu, W., Li, Y., Yang, C., Chen, T., Mou, Z., Zhao, D.
    Orchids are a kind of horticultural plant with highly ornamental and medical value. N-acetylserotonin deacetylase (ASDAC) is the only reverse enzyme of the melatonin biosynthesis pathway, and plays an important role in regulating the balance of melatonin. Melatonin as a multifunctional molecule, is typically involved in plant growth and development regulation, as well as abiotic stress tolerance. Here, we aimed at identifying ASDAC genes from the orchid genome to provide valuable information for further study of the role of melatonin […]
  • by Tadros, D. M., Racle, J., Gfeller, D.
    CD8+ T-cell activation is initiated by the recognition of epitopes presented on class I major histocompatibility complex (MHC-I) molecules. Identifying such epitopes is useful for molecular understanding of cellular immune responses and can guide the development of personalized vaccines for various diseases including cancer. Here, we capitalize on high-quality MHC-I peptidomics data available from different species and an expanded architecture of our MHC-I ligand predictor (MixMHCpred) to carefully explore how much predictions can be extrapolated to MHC-I alleles without known […]
  • by McArthur, R. N., Zehmakan, A. N., Charleston, M. A., Huttley, G. A.
    The algorithms for phylogenetic reconstruction are central to computational molecular evolution. The relentless pace of data acquisition has exposed their poor scalability and the conclusion that the conventional application of these methods is impractical and not justifiable from an energy usage perspective. Furthermore, the drive to improve the statistical performance of phylogenetic methods produces increasingly parameter-rich models of sequence evolution, which worsens the computational performance. Established theoretical and algorithmic results identify supertree methods as critical to divide-and-conquer strategies for improving […]
  • by Choi, H., Park, J., Kim, S., Kim, J., Lee, D., Bae, S., Shin, H., Lee, D.
    Large-scale single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) have transformed biomedical research into a data-driven field, enabling the creation of comprehensive data atlases. These methodologies facilitate detailed understanding of biology and pathophysiology, aiding in the discovery of new therapeutic targets. However, the complexity and sheer volume of data from these technologies present analytical challenges, particularly in robust cell typing, integration and understanding complex spatial relationships of cells. To address these challenges, we developed CELLama (Cell Embedding Leverage Language Model […]
  • by ma, x., jia, r., wang, y., ye, h.
    Backgroundgene synthesis sequencing using the long-read Oxford Nanopore Technologies (ONT) provides a cost-effective option for gene synthesis quality control. Despite the advantage of using long reads, however, accurate base calling is influenced by modified bases. ResultsWe introduce a method for filtering abnormal modified base calling in Oxford Nanopore Technologies sequencing. This method is based on the mapping results and perform an exact binomial test on the proportion of single base forward and reverse chain depth to determine the presence of […]
  • by Jun, S.-H., McCall, M.
    MicroRNAs play a central role in regulating gene expression and modulating diseases. Despite the importance of micro RNAs, statistical methods for analyzing them have received far less attention compared to messenger RNAs. In fact, it is common practice to apply the methods developed for messenger RNA-seq data to analyze micro RNA-seq data. This study critically examines and challenges the assumptions of messenger RNA-based methods when applied to micro RNAs, highlighting the competitive nature of micro RNA expression. We propose a […]

Related Journals