Transcript Data

Load, index, and query transcript annotations from GTF files for long-read single-cell analysis.

ClassifyBubble


def ClassifyBubble(
    bubble, attributes
):

Call self as a function.


FindBubbles


def FindBubbles(
    spliceGraph
):

Same as Wang’s FindBubbles – single pass from root to sink.


BuildSpliceGraph


def BuildSpliceGraph(
    inputDF
):

Call self as a function.


SmoothEnds


def SmoothEnds(
    inputDF
):

Same as Wang’s SmoothEnds but operates on our small 2-row DataFrame: columns: [‘transcript_ID’,‘chrom’,‘strand’,‘exonStart’,‘exonEnd’]


CheckOverlap


def CheckOverlap(
    tuple1, tuple2
):

Call self as a function.


TranscriptIndex


def TranscriptIndex(
    tx_by_id:Dict, gene_to_tx:Dict, genename_to_tx:Dict, cache_key:str
):

Fast, read-only index built from a GTF/GFF. O(1) lookups for blocks & metadata.


TxRecord


def TxRecord(
    transcript_id:str, gene_id:Optional, gene_name:Optional, transcript_name:Optional, transcript_type:Optional,
    chrom:str, strand:int, span:Tuple, exons:ndarray, cds:ndarray, utr5:ndarray, utr3:ndarray
)->None:

classify_transcript_pair_from_index


def classify_transcript_pair_from_index(
    idx:TranscriptIndex, tid1:str, tid2:str
)->DataFrame:

Enumerate transcript-structure differences between two isoforms using Wang’s bubble algorithm, backed by our TranscriptIndex.

Returns a DataFrame with columns: [‘transcript1’, ‘transcript2’, ‘event’, ‘coordinates’]

TranscriptData


TranscriptData


def TranscriptData(
    gtf_file:str, # Path to GTF/GFF file.
    reference_fasta:Optional=None, # Path to reference genome FASTA for sequence retrieval.
    cache_dir:Optional=None, # Directory to cache the built index.
    gene_name_map:Union=None, # Mapping from transcript_id to gene_name. Useful for GFF files
(like PacBio) that lack gene_name attributes. Can be:
- Path to TSV file with columns [transcript_id, gene_name]
- Dict mapping transcript_id -> gene_name  
- DataFrame (use transcript_id_col and gene_name_col to specify columns)
For PacBio format (e.g., 'PB.10.82:DDX11L17'), the base
transcript_id will be extracted for matching.
    transcript_id_col:Optional=None, # Column name for transcript_id when gene_name_map is a DataFrame.
    gene_name_col:Optional=None, # Column name for gene_name when gene_name_map is a DataFrame.
):

Fast, drop-in replacement for your old TranscriptData, backed by TranscriptIndex.


TranscriptData.get_transcripts_by_gene_name


def get_transcripts_by_gene_name(
    gnm:str
)->List:

Call self as a function.


TranscriptData.get_transcript_info


def get_transcript_info(
    transcript_id:str
)->dict:

Call self as a function.

Storing TranscriptData in AnnData

These helper functions allow you to store transcript annotations directly in your AnnData object. This is useful because:

  1. Convenience - No need to pass transcript_data to every plotting function
  2. Persistence - The GTF/FASTA paths are stored in adata.uns and survive write_h5ad/read_h5ad
  3. Automatic reconstruction - After loading, get_transcript_data() rebuilds the object from stored paths

get_transcript_data


def get_transcript_data(
    adata, # Annotated data object with registered transcript data
): # Transcript data object

Get TranscriptData from an AnnData object.

If the TranscriptData object is cached in adata.uns, returns it directly. Otherwise, reconstructs it from stored GTF/FASTA paths (useful after loading from h5ad).


register_transcripts


def register_transcripts(
    adata, # Annotated data object to register transcripts with
    td_or_path, # Either a TranscriptData object or path to GTF file
    reference_fasta:NoneType=None, # Path to reference FASTA file (only used if td_or_path is a path)
):

Register transcript annotations with an AnnData object.

This stores transcript data in adata.uns so it persists with the object and can be reconstructed after saving/loading from h5ad.