Nsforest Modules

Nextflow modules for the NSForest marker-gene discovery branch of the sc-nsforest-qc-nf workflow. These modules run inside the ghcr.io/nih-nlm/sc-nsforest-qc-nf/nsforest:latest container and are orchestrated by main.nf.

The nsforest-cli package bundled in this container wraps the NSForest algorithm from the J. Craig Venter Institute. For algorithm details and citation information see the NSForest documentation.

Execution order:

download_h5ad_process — download h5ad from CellxGene (https) or S3
filter_adata_process — tissue / disease / development stage ontology filter + min cluster size
dendrogram_process — cluster dendrogram + cluster order CSV
prep_medians_process — median expression per cluster (runs once on full filtered h5ad)
prep_binary_scores_process — binary scores per cluster (runs once on full filtered h5ad)
plot_histograms_process — non-zero median / binary score histograms
run_nsforest_process — NSForest scatter/gather by cluster batch
merge_nsforest_results_process — gather partial NSForest results
plots_process — boxplots, scatter, expression plots

Cluster Stats Process

cluster_stats_process

Source: modules/nsforest/cluster_stats.nf

Cluster Statistics Module

Computes basic cluster statistics (cell counts, percentages).

Input:

h5ad: Path to adata_filtered.h5ad

Output:

Params referenced:

params.outdir
params.publish_mode

Dendrogram Process

dendrogram_process

Source: modules/nsforest/dendrogram.nf

Dendrogram Module

Computes a hierarchical dendrogram over cluster medians and writes the cluster traversal order used to scatter run_nsforest_process.

Input:

h5ad: Path to adata_filtered.h5ad

Output:

Flat filenames: {organ}_{first_author}_{journal}_{year}_{vid}_{cluster_header_safe}_*.{csv,svg}

Params referenced:

params.outdir
params.publish_mode

Download H5Ad Process

download_h5ad_process

Source: modules/nsforest/download_h5ad.nf

Download H5AD Module

Downloads an h5ad file from CellxGene (https://) or AWS S3 (s3://). Retries up to 3 times on failure.

Input:

url: https:// or s3:// URL to the h5ad file

Output:

Filter Adata Process

filter_adata_process

Source: modules/nsforest/filter_adata.nf

Filter AnnData Module

Three-stage filtering using ontology IDs — identical logic to cellxgene-harvester:

Tissue — tissue_ontology_term_id.isin(obo_ids) from uberon_{organ}.json
Disease — disease_ontology_term_id.isin(obo_ids) from disease_normal.json
Age — development_stage_ontology_term_id.isin(obo_ids) from hsapdv_adult_N.json

Then:

Min cluster size — drops clusters with < min_cluster_size cells

Input:

h5ad: Path to input h5ad (downloaded from CellxGene)

uberon_json: Path to uberon_{organ}.json from cellxgene-harvester resolve-uberon

disease_json: Path to disease_normal.json from cellxgene-harvester resolve-disease

hsapdv_json: Path to hsapdv_adult_N.json from cellxgene-harvester resolve-hsapdv

Output:

{organ}_{first_author}_{journal}_{year}_{cluster_header_safe}_{embedding_safe}_{vid}*.{csv,svg}

Params referenced:

params.min_cluster_size
params.outdir
params.publish_mode

Merge Nsforest Results Process

merge_nsforest_results_process

Source: modules/nsforest/merge_nsforest_results.nf

Params referenced:

params.outdir
params.publish_mode

Plot Histograms Process

plot_histograms_process

Source: modules/nsforest/plot_histograms.nf

Plot Histograms Module

Creates histograms of non-zero median and binary score distributions to help assess marker gene signal quality before running NSForest.

Input:

medians_csv: {prefix}_medians.csv

binary_scores_csv:{prefix}_binary_scores.csv

Output:

Params referenced:

params.outdir
params.publish_mode

Plots Process

plots_process

Source: modules/nsforest/plots.nf

Plots Module

Creates boxplots, scatter plots, and expression dotplots for NSForest marker genes. Maps Ensembl IDs to gene symbols using cell-kn gene mapping.

Input:

h5ad: Path to adata_filtered.h5ad

results_csv: {prefix}_results.csv from merge_nsforest_results_process

Output:

Params referenced:

params.outdir
params.publish_mode

Prep Binary Scores Process

prep_binary_scores_process

Source: modules/nsforest/prep_binary_scores.nf

Params referenced:

params.outdir
params.publish_mode

Prep Medians Process

prep_medians_process

Source: modules/nsforest/prep_medians.nf

Params referenced:

params.outdir
params.publish_mode

Run Nsforest Process

run_nsforest_process

Source: modules/nsforest/run_nsforest.nf

Run NSForest Module (Parallelized by Cluster Batch)

Runs NSForest algorithm to identify marker gene combinations. Each cluster batch processed independently (one vs all).