Description of the Insilico Parameters
Below is the description of the parameters used to assess the quality of the mRNA sequences.
1. GC % and U %
- Description: Represents the percentage of Guanine (G) + Cytosine (C) and Uracil (U) nucleotides in the sequence, respectively.
- Calculation:
- GC %: Count of (G + C) / Total Length × 100.
- U %: Count of (U) / Total Length × 100.
- Interpretation:
- GC Content: Higher GC content generally correlates with stable mRNA secondary structures. Extremes (too high or too low) can affect translation efficiency and stability.
- U Content: U-rich sequences are prone to instability via hydrolysis or serve as recognition sites for specific RNA-binding proteins.
2. Codon Adaptation Index (CAI)
- Description: A measurement of how well the codons in a Coding DNA Sequence match the preferred codon usage of a specific host organism.
- Calculation:
- Each codon is assigned a "relative adaptiveness" score (w), calculated as the frequency of that codon divided by the frequency of the most abundant codon for the same amino acid.
- The CAI is the geometric mean of these relative adaptiveness scores across the entire sequence.
- Interpretation:
- Range: 0.0 to 1.0.
- High CAI (> 0.8): Indicates the sequence uses codons highly preferred by the host, typically leading to high protein expression efficiency.
- Low CAI (< 0.6): Indicates the presence of rare or non-optimal codons, which may reduce translation speed or accuracy.
3. Rare Codons %
- Description: The percentage of codons in the CDS that are considered "rare" or limiting for translation in the host organism (human).
- Calculation:
- The code checks every codon against a specific predefined list of rare codons (e.g., AGA, AGG, CGA, CGG, etc.).
- Formula: (Count of Rare Codons / Total Codons) × 100.
- Interpretation:
- High Percentage: A high frequency of rare codons can cause ribosomal stalling, premature termination, or misfolding of the protein.
- Low Percentage: Desirable for smooth and efficient translation.
4. Kozak Strength
- Description: Evaluates the sequence context immediately surrounding the Start Codon (AUG) to predict the efficiency of translation initiation. It specifically looks at the nucleotide at position -3 (in the 5' UTR) and position +4 (in the CDS).
- Calculation Logic:
- Purines: Adenine (A), Guanine (G).
- Interpretation:
- Strong: High probability of the ribosome recognizing the start codon; optimal initiation.
- Adequate 1: Good initiation context driven primarily by the favorable nucleotide at position -3 (the most critical position), though the +4 position is suboptimal.
- Adequate 2: Moderate initiation context; lacks the critical purine at -3 but is partially compensated by a favorable nucleotide at +4.
- Weak: The ribosome may skip the start codon (leaky scanning), resulting in lower protein yield.
5. Homopolymers Count
- Description: The count of homopolymer runs (repeats of the same nucleotide, e.g., AAAAA, UUUUU) longer than a specific length (default is 5).
- Calculation:
- The code scans for runs of A, U, G, or C that are 5 nucleotides or longer.
- Interpretation:
- High Count: Long homopolymer runs can cause "ribosomal slippage" or frameshifting during translation, leading to non-functional proteins. They may also create synthesis difficulties during mRNA manufacturing.
6. uORF (Upstream Open Reading Frames) Count
- Description: Detects the presence of small coding regions within the 5' Untranslated Region (5' UTR) before the main start codon.
- Calculation:
- Scans the 5' UTR for a Start Codon (AUG) followed by a Stop Codon (UAA, UAG, UGA) downstream.
- It counts a uORF if the distance between Start and Stop is at least 9 nucleotides.
- Interpretation:
- Present: Ribosomes may translate this small upstream region and fall off before reaching the main gene, significantly reducing expression of the target protein.
- Absent: The ribosome can proceed directly to the main start codon (preferred).
7. 4-Base G Homopolymers Count
- Description: Structural motifs formed by guanine-rich nucleic acid sequences in the 5' UTR.
- Calculation:
- The code searches for patterns of four or more consecutive Guanines (GGGG).
- Interpretation:
- Presence:4-Base G Homopolymers form very stable secondary structures that act as physical barriers to the ribosome, inhibiting translation initiation.
8. Toxic Motifs Count
- Description: Specific sequence patterns known to be detrimental to cell health, mRNA stability, or manufacturing.
- Calculation:
- Matches the sequence against a predefined list of toxic patterns (e.g., AAUAAA, UUUUUUU, GGAGGG).
- Interpretation:
- Result: A count of how many times these patterns appear.
- Implication: These motifs should be minimized as they can trigger immune responses, decrease stability, or resemble signal sequences (like poly-A signals) in the wrong place.
9. AU Rich Elements (AREs) Count
- Description: Destabilizing elements typically found in the 3' UTR (though scanned here in the whole sequence).
- Calculation:
- Counts occurrences of patterns like AUUUA and UUUA.
- Interpretation:
- High Count: AREs recruit degradation enzymes, leading to a shorter half-life for the mRNA (rapid decay). This is often undesirable if long-term protein expression is the goal.
10. Slippery Sites Count
- Description: Specific 7-nucleotide sequences prone to causing ribosomal frameshifting.
- Calculation:
- Scans the sequence for exact matches to a predefined list of slippery sites (e.g., UUUUUUA, AAAAAAC).
- Interpretation:
- Presence: Increases the risk that the ribosome "slips" by one base during translation, changing the reading frame and producing a garbled protein sequence downstream.
11. Unwanted Codon Pairs Count
- Description: Certain pairs of adjacent codons that interact poorly within the ribosome, slowing down translation or causing ribosome stalling.
- Calculation:
- The code analyzes the sequence in 6-nucleotide windows (codon pairs) and counts matches against a list of unwanted pairs (e.g., UAUCGC).
- Interpretation:
- High Count: Indicates a sequence that may translate inefficiently ("Codon Pair Bias"). Removing these pairs can improve translation elongation rates.