Transcript Data
ClassifyBubble
def ClassifyBubble(
bubble, attributes
):
Call self as a function.
FindBubbles
def FindBubbles(
spliceGraph
):
Same as Wang’s FindBubbles – single pass from root to sink.
BuildSpliceGraph
def BuildSpliceGraph(
inputDF
):
Call self as a function.
SmoothEnds
def SmoothEnds(
inputDF
):
Same as Wang’s SmoothEnds but operates on our small 2-row DataFrame: columns: [‘transcript_ID’,‘chrom’,‘strand’,‘exonStart’,‘exonEnd’]
CheckOverlap
def CheckOverlap(
tuple1, tuple2
):
Call self as a function.
TranscriptIndex
def TranscriptIndex(
tx_by_id:Dict, gene_to_tx:Dict, genename_to_tx:Dict, cache_key:str
):
Fast, read-only index built from a GTF/GFF. O(1) lookups for blocks & metadata.
TxRecord
def TxRecord(
transcript_id:str, gene_id:Optional, gene_name:Optional, transcript_name:Optional, transcript_type:Optional,
chrom:str, strand:int, span:Tuple, exons:ndarray, cds:ndarray, utr5:ndarray, utr3:ndarray
)->None:
classify_transcript_pair_from_index
def classify_transcript_pair_from_index(
idx:TranscriptIndex, tid1:str, tid2:str
)->DataFrame:
Enumerate transcript-structure differences between two isoforms using Wang’s bubble algorithm, backed by our TranscriptIndex.
Returns a DataFrame with columns: [‘transcript1’, ‘transcript2’, ‘event’, ‘coordinates’]
TranscriptData
TranscriptData
def TranscriptData(
gtf_file:str, # Path to GTF/GFF file.
reference_fasta:Optional=None, # Path to reference genome FASTA for sequence retrieval.
cache_dir:Optional=None, # Directory to cache the built index.
gene_name_map:Union=None, # Mapping from transcript_id to gene_name. Useful for GFF files
(like PacBio) that lack gene_name attributes. Can be:
- Path to TSV file with columns [transcript_id, gene_name]
- Dict mapping transcript_id -> gene_name
- DataFrame (use transcript_id_col and gene_name_col to specify columns)
For PacBio format (e.g., 'PB.10.82:DDX11L17'), the base
transcript_id will be extracted for matching.
transcript_id_col:Optional=None, # Column name for transcript_id when gene_name_map is a DataFrame.
gene_name_col:Optional=None, # Column name for gene_name when gene_name_map is a DataFrame.
):
Fast, drop-in replacement for your old TranscriptData, backed by TranscriptIndex.
TranscriptData.get_transcripts_by_gene_name
def get_transcripts_by_gene_name(
gnm:str
)->List:
Call self as a function.
TranscriptData.get_transcript_info
def get_transcript_info(
transcript_id:str
)->dict:
Call self as a function.
Storing TranscriptData in AnnData
These helper functions allow you to store transcript annotations directly in your AnnData object. This is useful because:
- Convenience - No need to pass
transcript_datato every plotting function - Persistence - The GTF/FASTA paths are stored in
adata.unsand survivewrite_h5ad/read_h5ad - Automatic reconstruction - After loading,
get_transcript_data()rebuilds the object from stored paths
get_transcript_data
def get_transcript_data(
adata, # Annotated data object with registered transcript data
): # Transcript data object
Get TranscriptData from an AnnData object.
If the TranscriptData object is cached in adata.uns, returns it directly. Otherwise, reconstructs it from stored GTF/FASTA paths (useful after loading from h5ad).
register_transcripts
def register_transcripts(
adata, # Annotated data object to register transcripts with
td_or_path, # Either a TranscriptData object or path to GTF file
reference_fasta:NoneType=None, # Path to reference FASTA file (only used if td_or_path is a path)
):
Register transcript annotations with an AnnData object.
This stores transcript data in adata.uns so it persists with the object and can be reconstructed after saving/loading from h5ad.