Comprehensive Codon Usage Bias Analysis in R
- Overview
- Features
- Why Choose cubar?
- Installation
- Documentation & Tutorials
- Example Workflow
- π Getting Help
- Related Packages
- License
- Acknowledgments
Codon usage bias refers to the non-uniform usage of synonymous codons (codons that encode the same amino acid) across different organisms, genes, and functional categories. cubar is a comprehensive R package for analyzing codon usage bias in coding sequences. It provides a unified framework for calculating established codon usage metrics, conducting sliding-window analyses or differential usage analyses, and optimizing sequences for heterologous expression.
- RSCU calculation: Relative synonymous codon usage analysis
- Amino acid usage: Frequency of each amino acid in sequences
- Codon weights: Calculate weights based on gene expression, tRNA availability, and mRNA stability
- Optimal codon inference: Machine learning-based identification of optimal codons
- Codon-anticodon visualization: Visualization of codon-tRNA pairing relationships
- Codon frequency tabulation: Count codon occurrences across sequences
- CAI (Codon Adaptation Index): Measure similarity to highly expressed genes
- ENC (Effective Number of Codons): Assess codon usage bias strength
- Fop (Fraction of Optimal codons): Calculate proportion of optimal codons
- tAI (tRNA Adaptation Index): Match codon usage to tRNA availability
- CSCg (Codon Stabilization Coefficients): Quantify mRNA stability effects
- Dp (Deviation from Proportionality): Analyze virus-host codon usage relationships
- GC content metrics: Overall GC, GC3s (3rd codon positions), GC4d (4-fold degenerate sites)
- Sliding window analysis: Positional codon usage patterns within genes
- Sequence optimization: Redesign sequences for optimal expression
- Differential codon usage: Statistical comparison between sequence sets
- Quality control: Comprehensive CDS validation and preprocessing
- π High Performance: Process large datasets (>100,000 sequences) efficiently using optimized
Biostrings
anddata.table
backends - 𧬠Flexible Genetic Codes: Support for all NCBI genetic codes plus custom genetic code tables
- π R Ecosystem Integration: Seamlessly integrate with other bioinformatics and data analysis packages
- π Comprehensive Documentation: Extensive tutorials, examples, and theoretical background
- π¬ Research Ready: Implements established metrics with proper citations and validation
Install the latest stable version from CRAN:
install.packages("cubar")
Install the latest development version from GitHub:
# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install cubar from GitHub
devtools::install_github("mt1022/cubar", dependencies = TRUE)
System Requirements:
- R (β₯ 4.1.0)
Required Packages:
Biostrings
(β₯ 2.60.0) - Bioconductor package for sequence manipulationIRanges
(β₯ 2.34.0) - Bioconductor infrastructure for range operationsdata.table
(β₯ 1.14.0) - High-performance data manipulationggplot2
(β₯ 3.3.5) - Data visualizationrlang
(β₯ 0.4.11) - Language tools
Note: Bioconductor packages will be installed automatically, but you may need to update your R installation if you encounter compatibility issues.
π Complete documentation is available within R (?function_name
) and on our package website.
- Introduction to cubar - Basic usage and core functionality
- Non-standard Genetic Codes - Working with alternative genetic codes
- Codon Optimization - Sequence optimization strategies
- Mathematical Foundations - Detailed theory behind the metrics
- Function Reference - Complete function documentation
Here's a typical analysis workflow demonstrating key functionality:
library(cubar)
library(ggplot2)
# 1. Load and quality-check sequences
data(yeast_cds)
clean_cds <- check_cds(yeast_cds)
# 2. Calculate codon frequencies
codon_freq <- count_codons(clean_cds)
# 3. Calculate multiple metrics
enc <- get_enc(codon_freq) # Effective number of codons
gc3s <- get_gc3s(codon_freq) # GC content at 3rd positions
# 4. Analyze highly expressed genes
data(yeast_exp)
yeast_exp <- yeast_exp[yeast_exp$gene_id %in% rownames(codon_freq), ]
high_expr <- head(yeast_exp[order(-yeast_exp$fpkm), ], 500)
rscu_high <- est_rscu(codon_freq[high_expr$gene_id, ])
cai <- get_cai(codon_freq, rscu_high)
# 5. Visualize results
df <- data.frame(ENC = enc, CAI = cai, GC3s = gc3s)
ggplot(df, aes(color = GC3s, x = ENC, y = CAI)) +
geom_point(alpha = 0.6) +
scale_color_viridis_c() +
labs(title = "Codon Usage Bias Relationships",
x = "Effective Number of Codons", y = "Codon Adaptation Index")
- π GitHub Issues: Report bugs, request features, or ask questions
- π Documentation: Check function help (
?function_name
) and online docs
For complementary analysis, consider these R packages:
- Biostrings - Sequence input/output and manipulation
- Peptides - Peptide and protein property calculations
This project is licensed under the MIT License - see the LICENSE file for details.
- GitHub Copilot was used to suggest code snippets during development
- GitHub Education for providing free access to development tools
- The R and Bioconductor communities for excellent foundational packages
- Contributors and users who have provided feedback and improvements