Skip to content

mt1022/cubar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

cubar

Comprehensive Codon Usage Bias Analysis in R

CRAN status DOI Lifecycle: stable

Table of Contents

Overview

Codon usage bias refers to the non-uniform usage of synonymous codons (codons that encode the same amino acid) across different organisms, genes, and functional categories. cubar is a comprehensive R package for analyzing codon usage bias in coding sequences. It provides a unified framework for calculating established codon usage metrics, conducting sliding-window analyses or differential usage analyses, and optimizing sequences for heterologous expression.

Features

🧬 Codon-Level Analysis

  • RSCU calculation: Relative synonymous codon usage analysis
  • Amino acid usage: Frequency of each amino acid in sequences
  • Codon weights: Calculate weights based on gene expression, tRNA availability, and mRNA stability
  • Optimal codon inference: Machine learning-based identification of optimal codons
  • Codon-anticodon visualization: Visualization of codon-tRNA pairing relationships

πŸ“Š Gene-Level Metrics

  • Codon frequency tabulation: Count codon occurrences across sequences
  • CAI (Codon Adaptation Index): Measure similarity to highly expressed genes
  • ENC (Effective Number of Codons): Assess codon usage bias strength
  • Fop (Fraction of Optimal codons): Calculate proportion of optimal codons
  • tAI (tRNA Adaptation Index): Match codon usage to tRNA availability
  • CSCg (Codon Stabilization Coefficients): Quantify mRNA stability effects
  • Dp (Deviation from Proportionality): Analyze virus-host codon usage relationships
  • GC content metrics: Overall GC, GC3s (3rd codon positions), GC4d (4-fold degenerate sites)

πŸ› οΈ Utilities & Tools

  • Sliding window analysis: Positional codon usage patterns within genes
  • Sequence optimization: Redesign sequences for optimal expression
  • Differential codon usage: Statistical comparison between sequence sets
  • Quality control: Comprehensive CDS validation and preprocessing

Why Choose cubar?

  • πŸš€ High Performance: Process large datasets (>100,000 sequences) efficiently using optimized Biostrings and data.table backends
  • 🧬 Flexible Genetic Codes: Support for all NCBI genetic codes plus custom genetic code tables
  • πŸ”— R Ecosystem Integration: Seamlessly integrate with other bioinformatics and data analysis packages
  • πŸ“š Comprehensive Documentation: Extensive tutorials, examples, and theoretical background
  • πŸ”¬ Research Ready: Implements established metrics with proper citations and validation

Installation

Stable Release (Recommended)

Install the latest stable version from CRAN:

install.packages("cubar")

Development Version

Install the latest development version from GitHub:

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}

# Install cubar from GitHub
devtools::install_github("mt1022/cubar", dependencies = TRUE)

Dependencies

System Requirements:

  • R (β‰₯ 4.1.0)

Required Packages:

  • Biostrings (β‰₯ 2.60.0) - Bioconductor package for sequence manipulation
  • IRanges (β‰₯ 2.34.0) - Bioconductor infrastructure for range operations
  • data.table (β‰₯ 1.14.0) - High-performance data manipulation
  • ggplot2 (β‰₯ 3.3.5) - Data visualization
  • rlang (β‰₯ 0.4.11) - Language tools

Note: Bioconductor packages will be installed automatically, but you may need to update your R installation if you encounter compatibility issues.

Documentation & Tutorials

πŸ“– Complete documentation is available within R (?function_name) and on our package website.

🎯 Getting Started

πŸ“š Advanced Topics

Example Workflow

Here's a typical analysis workflow demonstrating key functionality:

library(cubar)
library(ggplot2)

# 1. Load and quality-check sequences
data(yeast_cds)
clean_cds <- check_cds(yeast_cds)

# 2. Calculate codon frequencies
codon_freq <- count_codons(clean_cds)

# 3. Calculate multiple metrics
enc <- get_enc(codon_freq)           # Effective number of codons
gc3s <- get_gc3s(codon_freq)         # GC content at 3rd positions

# 4. Analyze highly expressed genes
data(yeast_exp)
yeast_exp <- yeast_exp[yeast_exp$gene_id %in% rownames(codon_freq), ]
high_expr <- head(yeast_exp[order(-yeast_exp$fpkm), ], 500)
rscu_high <- est_rscu(codon_freq[high_expr$gene_id, ])
cai <- get_cai(codon_freq, rscu_high)

# 5. Visualize results
df <- data.frame(ENC = enc, CAI = cai, GC3s = gc3s)
ggplot(df, aes(color = GC3s, x = ENC, y = CAI)) + 
  geom_point(alpha = 0.6) + 
  scale_color_viridis_c() +
  labs(title = "Codon Usage Bias Relationships",
       x = "Effective Number of Codons", y = "Codon Adaptation Index")

πŸ†˜ Getting Help

Related Packages

For complementary analysis, consider these R packages:

  • Biostrings - Sequence input/output and manipulation
  • Peptides - Peptide and protein property calculations

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • GitHub Copilot was used to suggest code snippets during development
  • GitHub Education for providing free access to development tools
  • The R and Bioconductor communities for excellent foundational packages
  • Contributors and users who have provided feedback and improvements