Description of the Insilico Parameters

Below is the description of the parameters used to assess the quality of the mRNA sequences.

1. GC % and U %
  • Description: Represents the percentage of Guanine (G) + Cytosine (C) and Uracil (U) nucleotides in the sequence, respectively.
  • Calculation:
    • GC %: Count of (G + C) / Total Length × 100.
    • U %: Count of (U) / Total Length × 100.
  • Interpretation:
    • GC Content: Higher GC content generally correlates with stable mRNA secondary structures. Extremes (too high or too low) can affect translation efficiency and stability.
    • U Content: U-rich sequences are prone to instability via hydrolysis or serve as recognition sites for specific RNA-binding proteins.
2. Codon Adaptation Index (CAI)
  • Description: A measurement of how well the codons in a Coding DNA Sequence match the preferred codon usage of a specific host organism.
  • Calculation:
    • Each codon is assigned a "relative adaptiveness" score (w), calculated as the frequency of that codon divided by the frequency of the most abundant codon for the same amino acid.
    • The CAI is the geometric mean of these relative adaptiveness scores across the entire sequence.
  • Interpretation:
    • Range: 0.0 to 1.0.
    • High CAI (> 0.8): Indicates the sequence uses codons highly preferred by the host, typically leading to high protein expression efficiency.
    • Low CAI (< 0.6): Indicates the presence of rare or non-optimal codons, which may reduce translation speed or accuracy.
3. Rare Codons %
  • Description: The percentage of codons in the CDS that are considered "rare" or limiting for translation in the host organism (human).
  • Calculation:
    • The code checks every codon against a specific predefined list of rare codons (e.g., AGA, AGG, CGA, CGG, etc.).
    • Formula: (Count of Rare Codons / Total Codons) × 100.
  • Interpretation:
    • High Percentage: A high frequency of rare codons can cause ribosomal stalling, premature termination, or misfolding of the protein.
    • Low Percentage: Desirable for smooth and efficient translation.
4. Kozak Strength
  • Description: Evaluates the sequence context immediately surrounding the Start Codon (AUG) to predict the efficiency of translation initiation. It specifically looks at the nucleotide at position -3 (in the 5' UTR) and position +4 (in the CDS).
  • Calculation Logic:
    • Purines: Adenine (A), Guanine (G).
  • Interpretation:
    • Strong: High probability of the ribosome recognizing the start codon; optimal initiation.
    • Adequate 1: Good initiation context driven primarily by the favorable nucleotide at position -3 (the most critical position), though the +4 position is suboptimal.
    • Adequate 2: Moderate initiation context; lacks the critical purine at -3 but is partially compensated by a favorable nucleotide at +4.
    • Weak: The ribosome may skip the start codon (leaky scanning), resulting in lower protein yield.
5. Homopolymers Count
  • Description: The count of homopolymer runs (repeats of the same nucleotide, e.g., AAAAA, UUUUU) longer than a specific length (default is 5).
  • Calculation:
    • The code scans for runs of A, U, G, or C that are 5 nucleotides or longer.
  • Interpretation:
    • High Count: Long homopolymer runs can cause "ribosomal slippage" or frameshifting during translation, leading to non-functional proteins. They may also create synthesis difficulties during mRNA manufacturing.
6. uORF (Upstream Open Reading Frames) Count
  • Description: Detects the presence of small coding regions within the 5' Untranslated Region (5' UTR) before the main start codon.
  • Calculation:
    • Scans the 5' UTR for a Start Codon (AUG) followed by a Stop Codon (UAA, UAG, UGA) downstream.
    • It counts a uORF if the distance between Start and Stop is at least 9 nucleotides.
  • Interpretation:
    • Present: Ribosomes may translate this small upstream region and fall off before reaching the main gene, significantly reducing expression of the target protein.
    • Absent: The ribosome can proceed directly to the main start codon (preferred).
7. 4-Base G Homopolymers Count
  • Description: Structural motifs formed by guanine-rich nucleic acid sequences in the 5' UTR.
  • Calculation:
    • The code searches for patterns of four or more consecutive Guanines (GGGG).
  • Interpretation:
    • Presence:4-Base G Homopolymers form very stable secondary structures that act as physical barriers to the ribosome, inhibiting translation initiation.
8. Toxic Motifs Count
  • Description: Specific sequence patterns known to be detrimental to cell health, mRNA stability, or manufacturing.
  • Calculation:
    • Matches the sequence against a predefined list of toxic patterns (e.g., AAUAAA, UUUUUUU, GGAGGG).
  • Interpretation:
    • Result: A count of how many times these patterns appear.
    • Implication: These motifs should be minimized as they can trigger immune responses, decrease stability, or resemble signal sequences (like poly-A signals) in the wrong place.
9. AU Rich Elements (AREs) Count
  • Description: Destabilizing elements typically found in the 3' UTR (though scanned here in the whole sequence).
  • Calculation:
    • Counts occurrences of patterns like AUUUA and UUUA.
  • Interpretation:
    • High Count: AREs recruit degradation enzymes, leading to a shorter half-life for the mRNA (rapid decay). This is often undesirable if long-term protein expression is the goal.
10. Slippery Sites Count
  • Description: Specific 7-nucleotide sequences prone to causing ribosomal frameshifting.
  • Calculation:
    • Scans the sequence for exact matches to a predefined list of slippery sites (e.g., UUUUUUA, AAAAAAC).
  • Interpretation:
    • Presence: Increases the risk that the ribosome "slips" by one base during translation, changing the reading frame and producing a garbled protein sequence downstream.
11. Unwanted Codon Pairs Count
  • Description: Certain pairs of adjacent codons that interact poorly within the ribosome, slowing down translation or causing ribosome stalling.
  • Calculation:
    • The code analyzes the sequence in 6-nucleotide windows (codon pairs) and counts matches against a list of unwanted pairs (e.g., UAUCGC).
  • Interpretation:
    • High Count: Indicates a sequence that may translate inefficiently ("Codon Pair Bias"). Removing these pairs can improve translation elongation rates.