Allos

An Isoform resoloution single cell RNA-seq toolkit.

Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular diversity by allowing the study of gene expression at the individual cell level. However, traditional methods of quantification obscure the most fine-grained transcriptional layer—the individual transcripts at sequence-level resolution—in favor of gene-level binning. This strategy is useful when the sequencing read length is short, as it avoids the difficult task of transcript assembly, which is made even more challenging by the nature of single-cell data.

Traditional scRNA-seq methods often rely on short-read sequencing technologies, which can limit the ability to resolve full-length transcripts and isoforms. Long-read sequencing technologies, such as those provided by Oxford Nanopore and PacBio, overcome this limitation by producing reads that span entire transcripts. This capability is crucial for accurately identifying and characterizing novel isoforms and understanding the complexity of transcriptomes in single cells. By integrating long-read sequencing with single-cell approaches, researchers can gain deeper insights into cellular heterogeneity and the functional implications of transcriptomic variations.

Allos is designed to give users familiar with single-cell analysis, and indeed object-oriented programming in general, an extended toolkit wrapping around Scanpy and the wider scverse to facilitate analysis workflows of this type of data. Allos is the culmination of many smaller individualized scripts packaged together into a common framework centered on anndata objects. It also contains many recreations (of varying faithfulness to their originals*) of functions proposed by others in other libraries, languages, or contexts—we by no means intend to plagiarize these methods but only expand them to a wider audience.

Allos intends to offer a full suite of modules for every step of single-cell isoform resolution data, from preprocessing, working with annotations, plotting, identifying differential features, isoform switches, and more. We hope to expand Allos with the input of our user base as the field of long-read single-cell further matures.

Install

pip install allos

Basic workflow

The first thing we need to work with Allos is data. Our goal is for Allos to handle many, if not all, types of single-cell isoform resolution data, making it platform and protocol agnostic. While it is important to consider the biases, strengths, limitations, and caveats of each approach, all methods utilize a transcript matrix from which a gene matrix can be derived. Allos also allows users to supply their own custom annotations in the form of a GTF or use a reference annotation. If a reference annotation is provided, all plotting functions will use the underlying GTF to retrieve the transcript information.

To help you follow the basic workflow, we provide an easy way to download one of our test datasets. The Sicelore dataset comprises 1,121 single-cell transcriptomes from two technical replicates derived from an embryonic day 18 (E18) mouse brain, consisting of 951 cells and 190 cells. The dataset was generated using the 10x Genomics Chromium system and sequenced with both Oxford Nanopore and Illumina platforms, producing 322 million Nanopore reads and 70 million Illumina reads for the 951-cell replicate, and 32 million Nanopore reads with 43 million Illumina reads for the 190-cell replicate. The dataset captured a median of 2,427 genes and 6,047 UMIs per cell, enabling the identification of 33,002 annotated transcript isoforms and 4,388 novel isoforms. Major cell types in the dataset include radial glia, cycling radial glia, intermediate progenitors, Cajal-Retzius cells, and maturing GABAergic and glutamatergic neurons, providing a detailed resource for studying transcriptome-wide alternative splicing, isoform expression, and sequence diversity during mouse brain development.

import allos.preprocessing as pp
sicelore_mouse_data = pp.process_mouse_data()


🔎 Looking for file at: /data/analysis/data_mcandrew/Allos_new/allos_env/lib/python3.10/site-packages/allos/resources/e18.mouse.clusters.csv
✅ File found at: /data/analysis/data_mcandrew/Allos_new/allos_env/lib/python3.10/site-packages/allos/resources/e18.mouse.clusters.csv
✅ File already exists at: /data/analysis/data_mcandrew/Allos_new/allos_env/lib/python3.10/site-packages/allos/resources/data/mouse_1.txt.gz

🔄 Decompressing /data/analysis/data_mcandrew/Allos_new/allos_env/lib/python3.10/site-packages/allos/resources/data/mouse_1.txt.gz to /data/analysis/data_mcandrew/Allos_new/allos_env/lib/python3.10/site-packages/allos/resources/data/mouse_1.txt...
✅ Decompression complete.
Test data (mouse_1) downloaded successfully
✅ File already exists at: /data/analysis/data_mcandrew/Allos_new/allos_env/lib/python3.10/site-packages/allos/resources/data/mouse_2.txt.gz

🔄 Decompressing /data/analysis/data_mcandrew/Allos_new/allos_env/lib/python3.10/site-packages/allos/resources/data/mouse_2.txt.gz to /data/analysis/data_mcandrew/Allos_new/allos_env/lib/python3.10/site-packages/allos/resources/data/mouse_2.txt...
✅ Decompression complete.
Test data (mouse_2) downloaded successfully

/data/analysis/data_mcandrew/Allos_new/allos_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1756: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")

In this dataset, we observe the mouse data organized in a cell-by-transcript matrix format. This structure allows us to examine the expression levels of individual transcripts across different cells. Additionally, each transcript is associated with a specific gene, providing a hierarchical view of gene expression. This organization is crucial for understanding the relationship between transcripts and their corresponding genes.

sicelore_mouse_data

View of AnnData object with n_obs × n_vars = 1109 × 31986
    obs: 'batch', 'cell_type'
    var: 'geneId'

print(sicelore_mouse_data.var.head(10))

                       geneId
transcriptId                 
ENSMUST00000156717.1     Klc2
ENSMUST00000212520.1   Capn15
ENSMUST00000025798.12    Klc2
ENSMUST00000231280.1    Eva1c
ENSMUST00000039286.4     Atg5
ENSMUST00000144552.7   Znhit3
ENSMUST00000112304.8    Ppm1b
ENSMUST00000162041.7     Gcc2
ENSMUST00000053506.6     Bbs1
ENSMUST00000028207.12    Crat

Lets take a look at our data. An AnnData object is a core data structure used in Scanpy to store single-cell data, including gene or isoform expression counts. It organizes the expression matrix alongside rich metadata: cell-level annotations (.obs), feature-level annotations (.var), and optional layers like normalized data, embeddings (.obsm), and graphs (.obsp). It supports both in-memory and on-disk (HDF5-backed) storage, making it efficient for large-scale single-cell transcriptomics datasets.

sicelore_mouse_data.obs.cell_type.unique()

array(['mature Glutamatergic', 'Cajal-Retzius', 'imature Glutamatergic',
       'intermediate progenitor', 'radial glia', 'cycling radial glia',
       'imature GABAergic', 'mature GABAergic'], dtype=object)

IsoAdata objects are fully compatible with standard AnnData objects. We can use them just like any conventional AnnData instance. Let’s take a closer look at how they function.

The .var of an anndata object is a Pandas dataframe. Pandas is a Python library for data manipulation and analysis, built around two core structures: DataFrame (2D tables) and Series (1D arrays). It provides fast, intuitive tools for filtering, transforming, summarizing, and joining structured data. This makes it really easy to preform any manipulations we may need. Lets get our .var in a more usable format.

Lets take a quick look at an individual gene CLTA and examine some of its transcripts.

# sanity check
print("\nAll CLTA isoforms still present:")
display(sicelore_mouse_data.var.query("geneId == 'Clta'").head())


All CLTA isoforms still present:

	geneId
transcriptId
ENSMUST00000107851.9	Clta
ENSMUST00000107845.3	Clta
ENSMUST00000107846.9	Clta
ENSMUST00000170241.7	Clta
ENSMUST00000107849.9	Clta

import scanpy as sc

# Make a copy of the original data to preserve it
original_sicelore_mouse_data = sicelore_mouse_data.copy()

# Perform the operations
sc.pp.normalize_total(sicelore_mouse_data, target_sum=1e6)
sc.pp.log1p(sicelore_mouse_data)

# Select the top 5000 highly variable genes
sc.pp.highly_variable_genes(sicelore_mouse_data, n_top_genes=5000)
sicelore_mouse_data = sicelore_mouse_data[:, sicelore_mouse_data.var.highly_variable]

sc.pp.neighbors(sicelore_mouse_data)
sc.tl.umap(sicelore_mouse_data)
sc.pl.umap(sicelore_mouse_data, color='cell_type')

# Restore the original data
sicelore_mouse_data = original_sicelore_mouse_data

/home/mcandrew/.local/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:207: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)
/home/mcandrew/.local/lib/python3.10/site-packages/scanpy/tools/_utils.py:41: UserWarning: You’re trying to run this on 5000 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
  warnings.warn(

We can easily collapse the transcript matrix to a gene matrix and see how the dimensionality reduction plot differs.

import allos.preprocessing as pp

gene_anndata = pp.get_sot_gene_matrix(sicelore_mouse_data)

# Make a copy of the gene matrix data to preserve it
original_gene_anndata = gene_anndata.copy()

# Perform the operations on the gene matrix
sc.pp.normalize_total(gene_anndata, target_sum=1e6)
sc.pp.log1p(gene_anndata)

# Select the top 5000 highly variable genes for gene_anndata
sc.pp.highly_variable_genes(gene_anndata, n_top_genes=2000)
gene_anndata = gene_anndata[:, gene_anndata.var.highly_variable]

sc.pp.neighbors(gene_anndata)
sc.tl.umap(gene_anndata)
sc.pl.umap(gene_anndata, color='cell_type')

# Restore the original gene matrix data
gene_anndata = original_gene_anndata

/home/mcandrew/.local/lib/python3.10/site-packages/scanpy/tools/_utils.py:41: UserWarning: You’re trying to run this on 2014 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
  warnings.warn(

Let’s examine the most differentially expressed genes across the pre-annotated cell types.

sc.pp.normalize_total(gene_anndata, target_sum=1e6)
sc.pp.log1p(gene_anndata)
# Perform differential expression analysis to rank genes on the gene matrix
sc.tl.rank_genes_groups(gene_anndata, 'cell_type', method='t-test')

# Plot the ranked gene groups with adjusted figure size
sc.pl.rank_genes_groups(gene_anndata, n_genes=20, sharey=False, figsize=(12, 10))

sc.pl.rank_genes_groups_heatmap(gene_anndata, show_gene_labels=True)

WARNING: dendrogram data not found (using key=dendrogram_cell_type). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

/home/mcandrew/.local/lib/python3.10/site-packages/scanpy/tools/_utils.py:41: UserWarning: You’re trying to run this on 12561 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
  warnings.warn(

Now, let’s examine the differentially expressed transcripts. Unlike our previous focus on genes, we will shift our attention to the transcript level. The finest resoloution transcriptional layer. We should still see considerable gene wise overlap in the two rankings as we would expect in many cases the higher gene count to be driven by a single isoform.

# Concatenate the gene name before the transcript ID in the transcript matrix
transcript_matrix = sicelore_mouse_data.copy()

# Convert 'geneId' to string to avoid TypeError when concatenating with index
transcript_matrix.var['geneId'] = transcript_matrix.var['geneId'].astype(str)

# Assuming 'gene_name' and 'transcript_id' are columns in the var DataFrame of the AnnData object
transcript_matrix.var['gene_transcript_id'] = transcript_matrix.var['geneId'] + '_' + transcript_matrix.var.index.astype(str)

# Truncate the gene_transcript_id to fit better in plots
transcript_matrix.var['gene_transcript_id'] = transcript_matrix.var['gene_transcript_id'].str.slice(0, 20)

# Update the var index to use the new truncated gene_transcript_id
transcript_matrix.var.index = transcript_matrix.var['gene_transcript_id']

transcript_matrix.var_names_make_unique()

# Log transform and CPM normalize the transcript matrix
sc.pp.normalize_total(transcript_matrix, target_sum=1e6)
sc.pp.log1p(transcript_matrix)

# Perform differential expression analysis to rank genes on the transcript matrix
sc.tl.rank_genes_groups(transcript_matrix, 'cell_type', method='t-test')

# Plot the ranked gene groups with adjusted figure size
sc.pl.rank_genes_groups(transcript_matrix, n_genes=20, sharey=False, figsize=(12, 10))

sc.pl.rank_genes_groups_heatmap(transcript_matrix, show_gene_labels=True)

WARNING: dendrogram data not found (using key=dendrogram_cell_type). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

/home/mcandrew/.local/lib/python3.10/site-packages/scanpy/tools/_utils.py:41: UserWarning: You’re trying to run this on 31986 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
  warnings.warn(

Lets visualise the isoforms of a specific gene Nnat based on multiple transcripts from this gene in the top rankings of imature GABAergic vs rest.

# Perform the operations on a copy of the sicelore_mouse_data
sicelore_mouse_data_copy = sicelore_mouse_data.copy()

sc.pp.normalize_total(sicelore_mouse_data_copy, target_sum=1e6)
sc.pp.log1p(sicelore_mouse_data_copy)

sc.pp.neighbors(sicelore_mouse_data_copy)
sc.tl.umap(sicelore_mouse_data_copy)

/home/mcandrew/.local/lib/python3.10/site-packages/scanpy/tools/_utils.py:41: UserWarning: You’re trying to run this on 31986 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
  warnings.warn(

import allos.visuals as vs
vs.plot_transcripts(sicelore_mouse_data_copy, gene_id='Nnat')

import allos.visuals as vs
vs.plot_transcripts(sicelore_mouse_data_copy, gene_id='Pkm')

from allos.switch_search import SwitchSearch

ss = SwitchSearch(sicelore_mouse_data, n_jobs=20)

DIU_results = ss.find_switches_chi2(
            primary_col="cell_type",
            within="primary",
            return_transcript_metrics=False,
            calc_effect_size=True,
            effect_size_mode="signed")

DIU_results.sort_values(by='adj_pval').head(20)

	gene_id	group_1	group_2	chi2_stat	p_value	dof	adj_pval	effect_size
132	Pkm	cycling radial glia	mature Glutamatergic	1143.749269	1.024230e-250	1	3.196623e-247	0.660202
376	Pkm	intermediate progenitor	mature Glutamatergic	806.015722	2.655468e-177	1	4.143857e-174	0.645525
10	Rps24	Cajal-Retzius	mature Glutamatergic	588.590625	5.075377e-130	1	5.280084e-127	0.266358
51	Pkm	cycling radial glia	imature Glutamatergic	561.650730	3.677313e-124	1	2.869223e-121	0.518047
107	Clta	cycling radial glia	mature Glutamatergic	493.165339	2.917931e-109	1	1.821372e-106	-0.551298
6	Rps24	Cajal-Retzius	imature Glutamatergic	410.352750	3.071439e-91	1	1.597660e-88	-0.265810
91	Pkm	cycling radial glia	mature GABAergic	364.909877	2.401553e-81	1	1.070750e-78	0.536311
257	Pkm	imature Glutamatergic	intermediate progenitor	347.045272	1.864580e-77	1	7.274195e-75	-0.503371
221	Pkm	imature GABAergic	mature Glutamatergic	298.938009	5.612351e-67	1	1.946239e-64	-0.335503
440	Myl6	mature Glutamatergic	radial glia	278.406919	1.670166e-62	1	5.212587e-60	-0.457042
33	Clta	cycling radial glia	imature Glutamatergic	271.440157	5.508539e-61	1	1.562923e-58	0.488458
78	Clta	cycling radial glia	mature GABAergic	255.383636	1.741035e-57	1	4.528141e-55	0.535687
364	Clta	intermediate progenitor	mature Glutamatergic	248.347904	5.951361e-56	1	1.428784e-53	0.392436
357	Pkm	intermediate progenitor	mature GABAergic	224.882542	7.788047e-51	1	1.736178e-48	0.521634
455	Tecr	mature Glutamatergic	radial glia	210.371235	1.139917e-47	1	2.371787e-45	-0.440902
8	Nnat	Cajal-Retzius	mature GABAergic	214.270403	2.963283e-47	2	5.780253e-45	-0.224859
106	Cdc42	cycling radial glia	mature Glutamatergic	196.089166	1.490406e-44	1	2.736211e-42	-0.406290
372	Myl6	intermediate progenitor	mature Glutamatergic	192.934738	7.273883e-44	1	1.261210e-41	-0.387228
9	Nnat	Cajal-Retzius	mature Glutamatergic	194.473950	5.895470e-43	2	9.684085e-41	-0.173092
5	Nnat	Cajal-Retzius	imature Glutamatergic	184.439226	8.902890e-41	2	1.389296e-38	-0.164863

vs.plot_transcript_exspression_dotplot(sicelore_mouse_data_copy, gene_id='Clta')

/data/analysis/data_mcandrew/Allos_new/allos_env/lib/python3.10/site-packages/allos/visuals.py:474: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_xticklabels(new_labels, rotation=45, ha='right', fontsize=11)

vs.plot_transcript_expression_violin(sicelore_mouse_data_copy, gene_id='Clta')

vs.plot_transcript_exspression_dotplot(sicelore_mouse_data_copy, gene_id='Akap9')

/data/analysis/data_mcandrew/Allos_new/allos_env/lib/python3.10/site-packages/allos/visuals.py:474: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_xticklabels(new_labels, rotation=45, ha='right', fontsize=11)

Let’s explore these transcripts of interest by visualizing them using the method described below. We will instantiate our transcript plot objects with the appropriate GTF file. For demonstration purposes, we provide the code to download the GTF file used in this dataset. This process will help us better understand the structure of the transcripts involved in this Isoform switch. Please ensure that you have a stable internet connection to download the GTF file if it’s not already available locally.

from allos.transcript_plots import TranscriptPlots
import os
import urllib.request
from pathlib import Path

# Example Ensembl URLs for mouse GRCm39 (release 109)
gtf_url = "ftp://ftp.ensembl.org/pub/release-109/gtf/mus_musculus/Mus_musculus.GRCm39.109.gtf.gz"

# Store data one directory back
data_dir = Path("..") / "data"
data_dir.mkdir(parents=True, exist_ok=True)

gtf_file_local = data_dir / "Mus_musculus.GRCm39.109.gtf.gz"

# Download if not already present
if not gtf_file_local.is_file():
    print(f"Downloading {gtf_url}...")
    urllib.request.urlretrieve(gtf_url, gtf_file_local)

from allos.transcript_data import TranscriptData
# Instantiate your TranscriptData
td = TranscriptData(
    gtf_file=gtf_file_local
)

from allos.transcript_plots import TranscriptPlots

tp = TranscriptPlots(transcript_data=td, intron_scale=0.15, intron_scale_mode="relative")

from allos.switch_search import get_top_n_isoforms

from allos.switch_search import get_top_n_isoforms
top_n = get_top_n_isoforms(sicelore_mouse_data_copy, gene_id='Snap25', strip=True)
tp.draw_transcripts_list(top_n, draw_cds=True, show_ticks=True)

To focus on the top three most expressed isoforms in the dataset, we can filter and visualize them using the following approach. This allows us to concentrate on the most significant isoforms for further analysis or presentation, useful as some genes have a lot of isoforms and many are very lowly exspressed.

from allos.switch_search import get_top_n_isoforms
top_n = get_top_n_isoforms(sicelore_mouse_data_copy, gene_id='Clta', strip=True, top_n=3)
tp.draw_transcripts_list(top_n, draw_cds=True, show_ticks=True)

from allos.switch_search import get_top_n_isoforms
import matplotlib.pyplot as plt
tids = get_top_n_isoforms(sicelore_mouse_data_copy, gene_id='Pkm', strip=True, top_n=3)
PALETTES = {
    "tab10" : [plt.get_cmap("tab10")(i) for i in range(10)],
    "tab20" : [plt.get_cmap("tab20")(i) for i in range(20)],
    "pastel": [plt.get_cmap("Pastel1")(i % 9) for i in range(9)],
    "dark"  : [plt.get_cmap("Dark2")(i % 8) for i in range(8)],
    "bold"  : ["#1f77b4","#ff7f0e","#2ca02c","#d62728",
               "#9467bd","#8c564b","#e377c2","#7f7f7f",
               "#bcbd22","#17becf"],
}

def pick_colors(name, n):
    base = PALETTES.get(name, PALETTES["tab10"])
    return [base[i % len(base)] for i in range(n)]

# %%  # Rps24 examples
tp = TranscriptPlots(transcript_data=td, intron_scale=0.15, intron_scale_mode="relative")

tp.colors = pick_colors("tab10", len(tids))
tp.draw_transcripts_list(tids, draw_cds=True, show_ticks=True)

tp.colors = pick_colors("pastel", len(tids))
tp.draw_transcripts_list(tids, draw_cds=True, show_ticks=True)

Now, let’s visualize the top isoforms using a dimensionality reduction technique. In this case, we will utilize UMAP for the plot.

top_n = get_top_n_isoforms(sicelore_mouse_data_copy, gene_id='Clta', strip=False, top_n=3)


vs.plot_transcripts(sicelore_mouse_data_copy, transcripts=top_n)

To gain a deeper understanding of how isoforms structurally differ across various cell types, we can analyze the inclusion of different exons in the final transcript structure. A useful metric for this analysis is the Percent Spliced In (PSI) value, which quantifies the proportion of transcripts that include a particular exon. By calculating the PSI for each exon, we can identify and compare the structural variations of isoforms between cell types, providing insights into their functional implications and regulatory mechanisms. This approach allows us to pinpoint specific exons that contribute to the diversity of isoform expression and understand their role in cellular differentiation and function.

Let’s explore another potential isoform switch event.

vs.plot_transcript_exspression_dotplot(sicelore_mouse_data_copy, gene_id='Akap9')

/data/analysis/data_mcandrew/Allos_new/allos_env/lib/python3.10/site-packages/allos/visuals.py:474: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_xticklabels(new_labels, rotation=45, ha='right', fontsize=11)

top_n = get_top_n_isoforms(sicelore_mouse_data_copy, gene_id='Clta', strip=True, top_n=3)
tp.draw_transcripts_list(top_n, draw_cds=True)

top_n = get_top_n_isoforms(sicelore_mouse_data_copy, gene_id='Akap9', strip=False, top_n=2)

vs.plot_transcripts(sicelore_mouse_data_copy, transcripts=top_n)

The visualization of cell features, such as genes and peaks, is often challenged by sparsity and technical dropout. These factors can obscure the clarity of visualizations, particularly when combined with clustering techniques used to annotate cell types. To enhance the visibility of patterns within the data, kernel density estimation (KDE) can be employed. KDE helps to highlight underlying trends by smoothing the data, making patterns more pronounced. However, it is crucial to use KDE alongside the actual count data to avoid potential misinterpretations. Relying solely on KDE without considering the raw counts can lead to misleading conclusions, as it may exaggerate or obscure the true distribution of features. Therefore, a balanced approach that integrates both KDE and raw data visualization is recommended for accurate interpretation.

vs.plot_density_multi(sicelore_mouse_data_copy, features=top_n)