Description of the In silico Parameters
Below is the description of the parameters used to assess the quality of the mRNA sequences.
1. GC % and U %
- Description: Represents the percentage of Guanine (G) + Cytosine (C) and Uracil (U) nucleotides in the sequence, respectively.
- Calculation:
- GC %: Count of (G + C) / Total Length × 100.
- U %: Count of (U) / Total Length × 100.
- Interpretation:
- GC Content: Higher GC content generally correlates with stable mRNA secondary structures. Extremes (>70 or <30) can affect translation efficiency and stability.
- U Content: U-rich sequences are prone to instability via hydrolysis and serve as recognition sites for specific RNA-binding proteins, making them a key driver of innate immunogenicity and reduced translation efficiency.
2. Codon Adaptation Index (CAI)
- Description: A measurement of how well the codons in a Coding mRNA Sequence match the preferred codon usage of a specific host organism.
- Calculation:
- Each codon is assigned a "relative adaptiveness" score (w), calculated as the frequency of that codon divided by the frequency of the most abundant codon for the same amino acid.
- The CAI is the geometric mean of these relative adaptiveness scores across the CDS sequence.
- Interpretation:
- Range: 0.0 to 1.0.
- High CAI (> 0.8): Indicates the sequence uses codons highly preferred by the host, typically leading to high protein expression efficiency.
- Low CAI (< 0.6): Indicates the presence of rare or non-optimal codons, which may reduce translation speed or accuracy.
3. Rare Codons %
- Description: The percentage of codons in the CDS that are considered "rare" or limiting for translation in the host organism (human).
- Calculation:
- The code checks every codon against a specific predefined list of rare codons (e.g., AGA, AGG, CGA, CGG, etc.).
- Formula: (Count of Rare Codons / Total Codons) × 100.
- Interpretation:
- High Percentage: A high frequency of rare codons can cause ribosomal stalling, premature termination, or misfolding of the protein.
- Low Percentage: Desirable for smooth and efficient translation.
4. Kozak Strength
- Description: Evaluates the sequence context immediately surrounding the Start Codon (AUG) to predict the efficiency of translation initiation. It specifically looks at the nucleotide at position -3 (in the 5' UTR) and position +4 (in the CDS).
- Calculation Logic:
- Purines: Adenine (A), Guanine (G).
- Interpretation:
- Strong: High probability of the ribosome recognizing the start codon; optimal initiation.
- Adequate 1: Good initiation context driven primarily by the favorable nucleotide at position -3 (the most critical position), though the +4 position is suboptimal.
- Adequate 2: Moderate initiation context; lacks the critical purine at -3 but is partially compensated by a favorable nucleotide at +4.
- Weak: The ribosome may skip the start codon (leaky scanning), resulting in lower protein yield.
5. Homopolymers Count
- Description: The count of homopolymer runs (repeats of the same nucleotide, e.g., AAAAA, UUUUU) longer than a specific length (default is 5).
- Calculation:
- The code scans for runs of A, U, G, or C that are 5 nucleotides or longer.
- Interpretation:
- High Count: Long homopolymer runs can cause "ribosomal slippage" or frameshifting during translation, leading to non-functional proteins. They may also create synthesis difficulties during mRNA manufacturing.
6. uORF (Upstream Open Reading Frames) Count
- Description: Detects the presence of small coding regions within the 5' Untranslated Region (5' UTR) before the main start codon.
- Calculation:
- Scans the 5' UTR for a Start Codon (AUG) followed by a Stop Codon (UAA, UAG, UGA) downstream.
- It counts a uORF if the distance between Start and Stop is at least 9 nucleotides.
- Interpretation:
- Present: Ribosomes may translate this small upstream region and fall off before reaching the main gene, significantly reducing expression of the target protein.
- Absent: The ribosome can proceed directly to the main start codon (preferred).
7. 4-Base G Homopolymers Count
- Description: Structural motifs formed by guanine-rich nucleic acid sequences in the 5' UTR.
- Calculation:
- The code searches for patterns of four or more consecutive Guanines (GGGG).
- Interpretation:
- Presence:4-Base G Homopolymers form very stable secondary structures that act as physical barriers to the ribosome, inhibiting translation initiation.
8. Toxic Motifs Count
- Description: Specific sequence patterns known to be detrimental to cell health, mRNA stability, or manufacturing.
- Calculation:
- Matches the sequence against a predefined list of toxic patterns (e.g., AAUAAA, UUUUUUU, GGAGGG).
- Interpretation:
- Result: A count of how many times these patterns appear.
- Implication: These motifs should be minimized as they can trigger immune responses, decrease stability, or resemble signal sequences (like poly-A signals) in the wrong place.
9. AU Rich Elements (AREs) Count
- Description: Destabilizing elements typically found in the 3' UTR (though scanned here in the whole sequence).
- Calculation:
- Counts occurrences of patterns like AUUUA and UUUA.
- Interpretation:
- High Count: AREs recruit degradation enzymes, leading to a shorter half-life for the mRNA (rapid decay). This is often undesirable if long-term protein expression is the goal.
10. Slippery Sites Count
- Description: Specific 7-nucleotide sequences prone to causing ribosomal frameshifting.
- Calculation:
- Scans the sequence for exact matches to a predefined list of slippery sites (e.g., UUUUUUA, AAAAAAC).
- Interpretation:
- Presence: Increases the risk that the ribosome "slips" by one base during translation, changing the reading frame and producing a garbled protein sequence downstream.
11. Unwanted Codon Pairs Count
- Description: Certain pairs of adjacent codons that interact poorly within the ribosome, slowing down translation or causing ribosome stalling.
- Calculation:
- The code analyzes the sequence in 6-nucleotide windows (codon pairs) and counts matches against a list of unwanted pairs (e.g., UAUCGC).
- Interpretation:
- High Count: Indicates a sequence that may translate inefficiently ("Codon Pair Bias"). Removing these pairs can improve translation elongation rates.
12. AUG Count in 5' UTR
- Description: The AUG count in the 5′ untranslated region (5′ UTR) refers to the number of AUG codons present upstream of the main start codon in an mRNA sequence.
- Calculation:
- This metric is calculated by scanning the 5′ UTR sequence and counting the occurrences of the AUG codon.
- Interpretation:
- A higher AUG count in the 5′ UTR may indicate the presence of upstream open reading frames (uORFs), which can reduce translation efficiency by diverting ribosomes from the main start codon. Multiple AUGs can also contribute to ribosome stalling or premature initiation events, potentially impacting protein expression levels.