Preprocessing

Core preprocessing utilities for isoform-resolution single-cell data.

compute_whole_data_transcript_abundance


def compute_whole_data_transcript_abundance(
    adata, # AnnData object with transcript-level counts. Must have a 'geneId' column in adata.var.
): # A new AnnData object with one observation representing the aggregated data 
where X holds transcript abundance percentages.

Compute transcript abundance percentages for the entire dataset by aggregating transcript counts across all cells into a single composite sample.

This function sums the transcript counts over all cells, constructs a new AnnData with a single observation, and then computes transcript abundance percentages using compute_transcript_abundance_pct.

plot_isoform_overview


def plot_isoform_overview(
    adata, # Transcript-level AnnData with `geneId` in ``adata.var``.
    celltype_col:str='cell_type', # column in adata.obs for panel-3 grouping
    min_counts:int=1, # >0 = gene “expressed”
    figsize:tuple=(18, 6), # Figure size passed to ``plt.subplots``.
    save:Optional=None, # path to PNG/PDF if you want to save
):

Reproduce the three-panel overview:

① Stacked bar: single-isoform vs multi-isoform genes + total transcripts ② Histogram: frequency of n isoforms per gene ③ Box/strip plot: genes detected per cell, grouped by cell type

gene_wise_correlation


def gene_wise_correlation(
    adata_1, adata_2, label_1:str='Short_Reads', label_2:str='Long_Reads', density_hist:bool=True,
    facet_obs:NoneType=None, title:Optional=None, figsize:Tuple=(6, 6), dpi:int=150, save:Optional=None,
    show:bool=True, return_data:bool=False
):

Call self as a function.

cell_umi_violin


def cell_umi_violin(
    adata_1, adata_2, label_1:str='Short_Reads', label_2:str='Long_Reads', facet_obs:NoneType=None,
    title:str='UMI Distribution per Cell', figsize:Tuple=(5, 5), dpi:int=150, save:Optional=None, show:bool=True,
    return_data:bool=False
):

Call self as a function.

cell_umi_correlation


def cell_umi_correlation(
    adata_1, adata_2, label_1:str='Short_Reads', label_2:str='Long_Reads', facet_obs:NoneType=None
):

Call self as a function.

gene_wise_bland_altman


def gene_wise_bland_altman(
    adata_1, adata_2, label_1:str='Short_Reads', label_2:str='Long_Reads', facet_obs:NoneType=None,
    title:Optional=None, figsize:Tuple=(6, 5), dpi:int=150, save:Optional=None, show:bool=True,
    return_data:bool=False
):

Call self as a function.

gene_wise_by_length


def gene_wise_by_length(
    adata_1, adata_2, td, label_1:str='gene matrix', label_2:str='isoform matrix', n_quantiles:int=4,
    length_agg:str='median', gene_key:str='gene_name', plot_type:str='correlation'
):

Call self as a function.

plot_discovery_rarefaction


def plot_discovery_rarefaction(
    adata_1, adata_2, label_1:str='gene matrix', label_2:str='isoform matrix', gene_col:str='geneId',
    fractions:NoneType=None, n_iter:int=5, random_state:int=42
):

Rarefaction curve showing feature discovery as a function of cells subsampled. Three lines: 1. Unique genes detected (gene matrix) 2. Unique isoforms detected (isoform matrix) 3. Unique genes detected via isoform matrix (gene-collapsed)

plot_delta_dominance


def plot_delta_dominance(
    adata, label:str='isoform matrix', gene_col:str='geneId', gene_name_col:str='gene_name',
    cell_type_col:str='cell_type', min_isoforms:int=2, min_counts:int=20, n_label:int=25, n_heatmap:int=12
):

Call self as a function.

cell_umi_capture_ratio


def cell_umi_capture_ratio(
    adata_1, adata_2, label_1:str='gene matrix', label_2:str='isoform matrix', facet_obs:NoneType=None
):

For each shared cell, compute log2(adata_2 UMIs / adata_1 UMIs) and show as violin plot, optionally faceted by a obs column (e.g. cell_type).

Ratio < 0 means adata_2 captures fewer UMIs (expected for isoform vs gene).

plot_length_by_annotation_status


def plot_length_by_annotation_status(
    td, adata, adata_tx_col:Optional=None, layer:Optional=None, log_scale:bool=True,
    title:str='Transcript Length: Annotated vs Novel', figsize:Tuple=(10, 5), dpi:int=150, save:Optional=None,
    show:bool=True, return_data:bool=False
):

Compare length distributions between annotated and potentially novel transcripts.

Transcripts in adata that match GTF annotations are “annotated”. Those without GTF match may be novel isoforms.

plot_length_by_biotype


def plot_length_by_biotype(
    td, adata, adata_tx_col:Optional=None, biotype_col:str='transcript_type', # column in adata.var or from GTF
    top_n_biotypes:int=8, log_scale:bool=True, show_counts:bool=True, title:str='Transcript Length by Biotype',
    figsize:Optional=None, dpi:int=150, save:Optional=None, show:bool=True, return_data:bool=False
):

Plot transcript length distribution grouped by biotype (protein_coding, lncRNA, etc.).

Biotypes are retrieved from TranscriptData (GTF annotation).

plot_coverage_by_transcript_length


def plot_coverage_by_transcript_length(
    bam_path:str, # Path to BAM file.
    td, # TranscriptData object.
    length_bins:List=[(0, 1000), (1000, 2000), (2000, 5000), (5000, 100000)], # Transcript length bins as (min, max) pairs.
    n_percentiles:int=101, max_per_bin:int=500, min_reads:int=10, title:str='Coverage by Transcript Length',
    figsize:Tuple=(10, 5), dpi:int=150, save:Optional=None, show:bool=True, return_data:bool=False
):

Gene body coverage stratified by transcript length.

OPTIMIZED: Uses count_coverage for fast pre-filtering.

LRGASP found: “coverage strongly dropped towards the 5’ end for the longest molecules” This reveals if longer transcripts have worse 5’ capture.

plot_read_length_vs_position


def plot_read_length_vs_position(
    bam_path:str, # Path to BAM file.
    td, # TranscriptData object.
    max_transcripts:int=2000, min_length:int=500, min_reads:int=10, title:str='Read Length vs Transcript Position',
    figsize:Tuple=(10, 5), dpi:int=150, save:Optional=None, show:bool=True, return_data:bool=False
):

Analyze correlation between read length and position along transcript.

Reveals if truncated reads are biased to certain positions: - Short reads at 3’ = RT fell off early (5’ dropout) - Short reads uniform = random degradation - Long reads at 3’ = good full-length capture

plot_full_length_ratio


def plot_full_length_ratio(
    bam_path:str, # Path to BAM file.
    td, # TranscriptData object.
    tss_tolerance:int=50, # How close to annotated TSS to count as "reaching TSS" (default 50bp).
    tes_tolerance:int=50, # How close to annotated TES to count as "reaching TES" (default 50bp).
    max_transcripts:int=5000, min_length:int=500, min_reads:int=5, title:str='Full-Length Transcript Capture',
    figsize:Tuple=(10, 5), dpi:int=150, save:Optional=None, show:bool=True, return_data:bool=False
):

Calculate the percentage of reads that are full-length (TSS to TES).

This is THE key metric for long-read RNA-seq - what fraction of reads span from the annotated TSS to TES?

plot_tes_enrichment


def plot_tes_enrichment(
    bam_path:str, # Path to BAM file.
    td, # TranscriptData object.
    window:int=2000, # Window size around TES (default ±2000bp).
    n_bins:int=100, max_transcripts:int=5000, min_length:int=500, title:str='TES Enrichment (Poly-A Site)',
    figsize:Tuple=(8, 5), dpi:int=150, save:Optional=None, show:bool=True, return_data:bool=False
):

Calculate TES (Transcription End Site / Poly-A site) enrichment.

For poly-A primed single-cell data, reads should be enriched at TES. This measures how well the 3’ end is captured.

plot_tss_enrichment


def plot_tss_enrichment(
    bam_path:str, # Path to BAM file.
    td, # TranscriptData object.
    window:int=2000, # Window size around TSS (default ±2000bp).
    n_bins:int=100, # Number of bins for the profile.
    max_transcripts:int=5000, min_length:int=500, title:str='TSS Enrichment', figsize:Tuple=(8, 5), dpi:int=150,
    save:Optional=None, show:bool=True, return_data:bool=False
):

Calculate TSS (Transcription Start Site) enrichment score.

For long-read data, this measures how well reads capture the true 5’ end. High TSS enrichment = good full-length capture.

TSS enrichment score (ENCODE definition): Signal at TSS / Signal at flanks

plot_read_start_positions


def plot_read_start_positions(
    bam_path:str, td, n_bins:int=100, max_transcripts:int=2000, min_reads:int=10, min_length:int=500,
    title:str="Read Start Position Distribution (5' → 3')", figsize:Tuple=(10, 6), dpi:int=150, save:Optional=None,
    show:bool=True, return_data:bool=False
):

Plot where reads START along the transcript body (5’ to 3’).

Unlike coverage (which is flattened by long reads), read start positions reveal the true 3’ bias from poly-A priming.

For poly-A primed single-cell data, expect reads to START near the 3’ end.

plot_gene_coverage_single


def plot_gene_coverage_single(
    bam_path:str, # Path to BAM file.
    td, # TranscriptData object.
    gene:str, # Gene name or gene ID to plot.
    transcript_id:Optional=None, # Specific transcript ID. If None, uses longest transcript for the gene.
    n_percentiles:int=101, # Number of percentile points (default 101).
    strand_specificity:str='NONE', # "NONE", "FIRST_READ", or "SECOND_READ".
    show_reads:bool=False, # If True, show individual read positions as a heatmap below coverage.
    title:Optional=None, figsize:Tuple=(10, 5), dpi:int=150, save:Optional=None, show:bool=True,
    return_data:bool=False
):

Plot gene body coverage for a SINGLE gene.

Useful for inspecting coverage patterns of specific genes of interest.

plot_gene_coverage


def plot_gene_coverage(
    bam_path:str, # Path to BAM file (must be coordinate-sorted and indexed).
    td, # TranscriptData object with transcript annotations.
    adata:NoneType=None, # If provided, uses adata.var[adata_rank_col] to select top transcripts.
This is the FASTEST option - skips BAM scanning for ranking.
    n_threads:int=4, # Threads for BAM decompression (default 4).
    n_percentiles:int=101, min_length:int=500, end_bias_bases:int=100,
    top_n_transcripts:int=1000, # Number of top transcripts to analyze (default 1000).
    max_depth:int=5000000, # Skip transcripts with depth > this value (default 5M). Avoids hotspots.
    quick_mode:bool=True, # If True and adata not provided, only sample first 100bp for ranking.
    normalize_by:str='max', title:str='Gene Body Coverage', figsize:Tuple=(12, 5), dpi:int=150, save:Optional=None,
    show:bool=True, return_data:bool=False
):

Gene body coverage replicating Picard CollectRnaSeqMetrics algorithm.

FAST VERSION with multiple optimization strategies: - Option 1: Use adata expression to pre-select top transcripts (fastest) - Option 2: Skip hotspots with depth > max_depth - Option 3: Quick mode - sample first 100bp for ranking (default)

plot_gene_body_coverage_comparison


def plot_gene_body_coverage_comparison(
    bam_paths:dict, # Dictionary mapping sample names to BAM file paths.
    td, n_bins:int=100, max_transcripts:int=1500, min_reads:int=10, min_length:int=500,
    stranded:bool=True, # If True, only count reads matching transcript strand (default True).
    title:str='Gene Body Coverage Comparison', figsize:Tuple=(10, 6), dpi:int=150, save:Optional=None,
    show:bool=True, return_data:bool=False
):

Compare gene body coverage profiles across multiple samples.

OPTIMIZED: Uses count_coverage for fast pre-filtering.

plot_gene_body_coverage


def plot_gene_body_coverage(
    bam_path:str, # Path to BAM file (genome-aligned).
    td, # TranscriptData object with transcript annotations.
    n_bins:int=100, # Number of bins along transcript body (default 100).
    max_transcripts:int=2000, # Maximum transcripts to sample for speed.
    min_reads:int=10, # Minimum reads per transcript to include.
    min_length:int=500, # Minimum transcript length (bp) to include.
    stranded:bool=True, # If True, only count reads matching transcript strand (default True).
For single-cell RNA-seq with poly-A priming, should be True.
    title:str="Gene Body Coverage (5' → 3')", figsize:Tuple=(10, 6), dpi:int=150, save:Optional=None, show:bool=True,
    return_data:bool=False
):

Plot gene body coverage profile from 5’ to 3’ end using BAM file.

OPTIMIZED: Uses count_coverage for fast pre-filtering before detailed analysis.

Works with GENOME-aligned BAMs by mapping genomic coordinates to transcript positions.

plot_end_bias_note


def plot_end_bias_note(
    
):

Note: 5’/3’ end bias analysis requires read-level data (BAM files).

To assess whether transcripts are truncated at the 5’ or 3’ end, you need to analyze the coverage profile across transcript bodies from aligned reads.

Recommended tools for this analysis: - RSeQC (geneBody_coverage.py) - Picard CollectRnaSeqMetrics - Custom analysis with pysam

What to look for: - Flat coverage = good full-length capture - 3’ bias (common in poly-A selection) = higher coverage at 3’ end - 5’ dropout (cDNA synthesis issue) = lower coverage at 5’ end

This function is a placeholder - implement with BAM data if available.

plot_length_bias_by_expression


def plot_length_bias_by_expression(
    td, # TranscriptData object.
    adata, # AnnData object with transcript-level data.
    adata_tx_col:Optional=None, layer:Optional=None, n_expr_bins:int=5, # Number of expression level bins.
    title:str='Length Bias by Expression Level', figsize:Tuple=(12, 5), dpi:int=150, save:Optional=None,
    show:bool=True, return_data:bool=False
):

For transcripts at similar expression levels, are longer ones detected in fewer cells?

This controls for expression level to reveal true length bias in detection. If there’s length bias, within each expression bin, longer transcripts should have lower detection rates.

plot_length_bias


def plot_length_bias(
    td, # TranscriptData object (contains all GTF transcripts).
    adata, # AnnData object with detected transcripts.
    adata_tx_col:Optional=None, # Column in adata.var containing transcript IDs.
    n_bins:int=25, # Number of bins for the histogram.
    title:str='Length Bias Assessment', figsize:Tuple=(12, 5), dpi:int=150, save:Optional=None, show:bool=True,
    return_data:bool=False
): # 1. Overlaid histograms (GTF vs Detected)
2. Enrichment ratio (Detected/Expected) by length bin
3. Cumulative distribution comparison

Compare length distribution of ALL transcripts in GTF vs those DETECTED in your data.

This reveals systematic length bias - if longer transcripts are under-represented in your detected set compared to the annotation, you have length bias.

plot_length_expression_correlation


def plot_length_expression_correlation(
    td, adata, adata_tx_col:Optional=None, layer:Optional=None, color_by:str='density', log_expression:bool=True,
    log_length:bool=True, # Changed default to True
    min_expression:float=0, # New: filter low expression
    show_marginals:bool=True, # New: marginal histograms
    title:str='Transcript Length vs Expression', figsize:Tuple=(8, 8), dpi:int=150, save:Optional=None,
    show:bool=True, return_data:bool=False
):

Scatter plot of transcript length vs mean expression with marginal histograms.

Useful for identifying length biases in detection.

plot_biotype_by_group


def plot_biotype_by_group(
    td, # TranscriptData object.
    adata, # AnnData object with transcript-level data.
    groupby:str, # Column in adata.obs to group by.
    adata_tx_col:Optional=None, layer:Optional=None, min_total:int=1,
    top_n_biotypes:int=6, # Number of biotypes to show (rest grouped as "other").
    normalize:bool=True, # If True, show percentages (stacked to 100%). If False, show counts.
    title:Optional=None, palette:Optional=None, figsize:Optional=None, dpi:int=150, save:Optional=None,
    show:bool=True, return_data:bool=False
):

Stacked bar chart showing biotype composition across groups (e.g., cell types).

plot_biotype_breakdown


def plot_biotype_breakdown(
    td, # TranscriptData object.
    adata, # AnnData object with transcript-level data.
    adata_tx_col:Optional=None, # Column in adata.var containing transcript IDs.
    top_n:int=8, # Number of top biotypes to show (rest grouped as "other").
    title:str='Transcript Biotype Breakdown', colors:Optional=None, figsize:Tuple=(7, 5), dpi:int=150,
    save:Optional=None, show:bool=True, return_data:bool=False
):

Bar chart showing breakdown of transcripts by biotype (protein_coding, lncRNA, etc.).

Similar style to the isoforms-per-gene plot.

plot_detected_transcript_lengths


def plot_detected_transcript_lengths(
    td, # TranscriptData object.
    adata, # AnnData object with transcript-level data.
    adata_tx_col:Optional=None, # Column in adata.var containing transcript IDs.
    log_scale:bool=True, # Whether to log10-transform lengths.
    show_stats:bool=True, # Whether to show summary statistics.
    bins:int=50, # Number of histogram bins.
    title:str='Detected Transcript Lengths', color:Optional=None, figsize:Tuple=(8, 5), dpi:int=150,
    save:Optional=None, show:bool=True, return_data:bool=False
):

Simple histogram of annotated lengths for all detected transcripts.

Shows the length distribution of transcripts present in your dataset based on GTF annotations.

plot_transcript_length_distribution


def plot_transcript_length_distribution(
    td, # TranscriptData object.
    adata, # AnnData object with transcript-level data.
    groupby:Optional=None, # Column in adata.obs to group by.
    adata_tx_col:Optional=None, layer:Optional=None, log_scale:bool=True, # Changed default to True
    plot_type:str='violin', # "violin", "box", or "ridge".
    show_stats:bool=True, # New: show statistical annotations
    title:Optional=None, palette:Optional=None, figsize:Optional=None, dpi:int=150, save:Optional=None,
    show:bool=True, return_data:bool=False
):

Plot distribution of transcript lengths, optionally grouped by cell type.

plot_isoforms_per_gene_by_group


def plot_isoforms_per_gene_by_group(
    td, # TranscriptData object from allos.transcript_data.
    adata, # AnnData object with transcript-level data.
    groupby, # Column(s) in adata.obs to group by.
    layer:Optional=None, # Layer to use for counts. If None, uses 'counts' layer or X.
    min_total:int=1, # Minimum total count for a transcript to be considered present.
    adata_tx_col:Optional=None, # Column in adata.var containing transcript IDs.
    title:Optional=None, # Plot title. If None, auto-generated.
    palette:Union=None, # Colors for groups. Pass a list (ordered by group) or a dict
mapping {group_name: color}. If None, uses ALLOS palette.
    figsize:Optional=None, # Figure size. If None, auto-calculated.
    dpi:int=150, # Figure DPI.
    bar_mode:str='grouped', # 'grouped' for side-by-side bars, 'faceted' for separate subplots.
    show_legend:bool=True, # Whether to show legend (grouped mode only).
    save:Optional=None, # Path to save figure.
    show:bool=True, # Whether to display the plot.
    return_data:bool=False, # If True, return (dict, fig). Otherwise return None.
): # If return_data=True: (dict with counts/percent/totals, figure)

Plot isoforms per gene by group using TranscriptData object.

plot_isoforms_per_gene


def plot_isoforms_per_gene(
    td, # TranscriptData object from allos.transcript_data.
    adata:NoneType=None, # If provided, filter to transcripts present in adata.
    adata_tx_col:NoneType=None, # Column in adata.var containing transcript IDs. If None, uses var.index.
    title:str='Isoforms per Gene', # Plot title.
    colors:Optional=None, # Colors for bars. If None, uses ALLOS palette.
    show_percentages:bool=True, # Whether to show percentage labels on bars.
    figsize:Tuple=(5, 4), # Figure size.
    dpi:int=150, # Figure DPI.
    save:Optional=None, # Path to save figure.
    show:bool=True, # Whether to display the plot.
    return_data:bool=False, # If True, return (DataFrame, fig, ax). Otherwise return None.
): # If return_data=True: (DataFrame with bin counts, figure, axes)

Plot isoform distribution using TranscriptData object.

compute_gene_body_coverage


def compute_gene_body_coverage(
    bam_path:str, td, n_bins:int=101, # Picard uses 0-100 inclusive = 101 values
    min_length:int=500, top_n:int=1000, n_workers:int=20, verbose:bool=True
)->CoverageResult:

Gene body coverage matching Picard’s algorithm exactly.

CoverageResult


def CoverageResult(
    coverage:ndarray, five_prime_bias:float, three_prime_bias:float, ratio:float, n_transcripts:int, n_bins:int,
    total_records:int, mapped_reads:int, elapsed_seconds:float, bam_path:str
)->None:

Transcript Length Analysis

plot_transcript_length_abundance


def plot_transcript_length_abundance(
    td, adata, adata_tx_col:Optional=None, bins:int=60, y_axis:str='percent', # "percent", "count", or "density"
    title:str='Transcript Length vs Abundance', figsize:Tuple=(8, 5), dpi:int=150, save:Optional=None
):

Call self as a function.

API reference

plot_isoform_complexity


def plot_isoform_complexity(
    adata, # transcript-level AnnData
    label:str='isoform matrix', gene_col:str='geneId',
    facet_obs:NoneType=None, # e.g. "cell_type" — if None, uses pseudobulk across all cells
    min_isoforms:int=2, # only analyse genes with >= N isoforms
    min_counts:int=10, # minimum total counts per gene to include
):

For each gene with >= min_isoforms, compute across cells (or per cell type): - Dominant isoform fraction: counts of most expressed isoform / total gene counts - Shannon entropy: -sum(p * log2(p)) across isoforms (0 = one isoform dominates, log2(n_isoforms) = perfectly even) - Normalised entropy: entropy / log2(n_isoforms) — 0 to 1 scale

Shows distribution of these metrics across genes.

filter_transcripts_by_abundance


def filter_transcripts_by_abundance(
    adata, # AnnData object with transcript-level counts. Must have a 'geneId' column in adata.var.
    threshold_pct, # Minimum overall transcript abundance percentage required to keep a transcript.
    verbose:bool=False, # If True, prints the number of transcripts kept and the threshold used. Default is False.
): # A new AnnData object with the filtered transcript matrix.

Filter transcripts from an AnnData object based on their overall transcript abundance percentage, computed by aggregating transcript counts across all cells and leveraging overall gene counts.

The overall abundance percentage for each transcript is calculated as:

overall_pct = (total transcript count) / (total gene count for its gene) * 100

This function uses the compute_whole_data_transcript_abundance function internally to leverage the gene mapping information (via the ‘geneId’ in adata.var). Transcripts with an overall abundance percentage below the specified threshold (threshold_pct) are filtered out.

compute_transcript_abundance_pct


def compute_transcript_abundance_pct(
    adata, # AnnData object with transcript-level counts. Must have a column 'geneId' in adata.var.
): # A new AnnData object with the same obs and var as the input, where X holds 
transcript abundance percentages.

Compute transcript abundance percentages (“percent spliced in”) for each transcript within its gene. For each cell and transcript, this is computed as:

transcript_pct = (transcript count / total gene count) * 100

get_sot_gene_matrix


def get_sot_gene_matrix(
    adata, # Input AnnData with transcript-level counts. Must have a column `geneId`
in `adata.var` containing the gene ID for each transcript.
): # A new AnnData object where columns (var) represent unique genes, 
and values are aggregated transcript counts.

Construct a gene-level count matrix from transcript-level data.

subset_common_cells


def subset_common_cells(
    dataset1:AnnData, # First dataset to be subsetted.
    dataset2:AnnData, # Second dataset to compare with.
)->AnnData: # Subset of `dataset1` containing only cells also found in `dataset2`.

Subset dataset1 to only include cells that are also present in dataset2.

transfer_obs


def transfer_obs(
    dataset1:AnnData, # Source AnnData object with .obs metadata to transfer.
    dataset2:AnnData, # Target AnnData object to receive .obs metadata.
)->AnnData: # The modified `dataset2` with .obs from `dataset1` transferred.

Transfer .obs metadata from dataset1 to dataset2 one by one, while preserving the .var DataFrame of dataset2.