Nsforest Modules

Nextflow modules for the NSForest marker-gene discovery branch of the sc-nsforest-qc-nf workflow. These modules run inside the ghcr.io/nih-nlm/sc-nsforest-qc-nf/nsforest:latest container and are orchestrated by main.nf.

The nsforest-cli package bundled in this container wraps the NSForest algorithm from the J. Craig Venter Institute. For algorithm details and citation information see the NSForest documentation.

Execution order:

  1. download_h5ad_process — download h5ad from CellxGene (https) or S3

  2. filter_adata_process — tissue / disease / development stage ontology filter + min cluster size

  3. dendrogram_process — cluster dendrogram + cluster order CSV

  4. prep_medians_process — median expression per cluster (runs once on full filtered h5ad)

  5. prep_binary_scores_process — binary scores per cluster (runs once on full filtered h5ad)

  6. plot_histograms_process — non-zero median / binary score histograms

  7. run_nsforest_process — NSForest scatter/gather by cluster batch

  8. merge_nsforest_results_process — gather partial NSForest results

  9. plots_process — boxplots, scatter, expression plots

Cluster Stats Process

cluster_stats_process

Source: modules/nsforest/cluster_stats.nf

Cluster Statistics Module

Computes basic cluster statistics (cell counts, percentages).

Input:

  • h5ad: Path to adata_filtered.h5ad

Output:

Params referenced:

  • params.outdir

  • params.publish_mode

Dendrogram Process

dendrogram_process

Source: modules/nsforest/dendrogram.nf

Dendrogram Module

Computes a hierarchical dendrogram over cluster medians and writes the cluster traversal order used to scatter run_nsforest_process.

Input:

  • h5ad: Path to adata_filtered.h5ad

Output:

Flat filenames: {organ}_{first_author}_{journal}_{year}_{vid}_{cluster_header_safe}_*.{csv,svg}

Params referenced:

  • params.outdir

  • params.publish_mode

Download H5Ad Process

download_h5ad_process

Source: modules/nsforest/download_h5ad.nf

Download H5AD Module

Downloads an h5ad file from CellxGene (https://) or AWS S3 (s3://). Retries up to 3 times on failure.

Input:

  • url: https:// or s3:// URL to the h5ad file

Output:

Filter Adata Process

filter_adata_process

Source: modules/nsforest/filter_adata.nf

Filter AnnData Module

Three-stage filtering using ontology IDs — identical logic to cellxgene-harvester:
  1. Tissue — tissue_ontology_term_id.isin(obo_ids) from uberon_{organ}.json

  2. Disease — disease_ontology_term_id.isin(obo_ids) from disease_normal.json

  3. Age — development_stage_ontology_term_id.isin(obo_ids) from hsapdv_adult_N.json

Then:

  1. Min cluster size — drops clusters with < min_cluster_size cells

Input:

  • h5ad: Path to input h5ad (downloaded from CellxGene)

  • uberon_json: Path to uberon_{organ}.json from cellxgene-harvester resolve-uberon

  • disease_json: Path to disease_normal.json from cellxgene-harvester resolve-disease

  • hsapdv_json: Path to hsapdv_adult_N.json from cellxgene-harvester resolve-hsapdv

Output:

{organ}_{first_author}_{journal}_{year}_{cluster_header_safe}_{embedding_safe}_{vid}*.{csv,svg}

Params referenced:

  • params.min_cluster_size

  • params.outdir

  • params.publish_mode

Merge Nsforest Results Process

merge_nsforest_results_process

Source: modules/nsforest/merge_nsforest_results.nf

Params referenced:

  • params.outdir

  • params.publish_mode

Plot Histograms Process

plot_histograms_process

Source: modules/nsforest/plot_histograms.nf

Plot Histograms Module

Creates histograms of non-zero median and binary score distributions to help assess marker gene signal quality before running NSForest.

Input:

  • medians_csv: {prefix}_medians.csv

  • binary_scores_csv:{prefix}_binary_scores.csv

Output:

Params referenced:

  • params.outdir

  • params.publish_mode

Plots Process

plots_process

Source: modules/nsforest/plots.nf

Plots Module

Creates boxplots, scatter plots, and expression dotplots for NSForest marker genes. Maps Ensembl IDs to gene symbols using cell-kn gene mapping.

Input:

  • h5ad: Path to adata_filtered.h5ad

  • results_csv: {prefix}_results.csv from merge_nsforest_results_process

Output:

Params referenced:

  • params.outdir

  • params.publish_mode

Prep Binary Scores Process

prep_binary_scores_process

Source: modules/nsforest/prep_binary_scores.nf

Params referenced:

  • params.outdir

  • params.publish_mode

Prep Medians Process

prep_medians_process

Source: modules/nsforest/prep_medians.nf

Params referenced:

  • params.outdir

  • params.publish_mode

Run Nsforest Process

run_nsforest_process

Source: modules/nsforest/run_nsforest.nf

Run NSForest Module (Parallelized by Cluster Batch)

Runs NSForest algorithm to identify marker gene combinations. Each cluster batch processed independently (one vs all).