Nsforest Modules
Nextflow modules for the NSForest marker-gene discovery branch of the
sc-nsforest-qc-nf workflow. These modules run inside the
ghcr.io/nih-nlm/sc-nsforest-qc-nf/nsforest:latest container and are
orchestrated by main.nf.
The nsforest-cli package bundled in this container wraps the
NSForest algorithm
from the J. Craig Venter Institute. For algorithm details and citation
information see the
NSForest documentation.
Execution order:
download_h5ad_process— download h5ad from CellxGene (https) or S3filter_adata_process— tissue / disease / development stage ontology filter + min cluster sizedendrogram_process— cluster dendrogram + cluster order CSVprep_medians_process— median expression per cluster (runs once on full filtered h5ad)prep_binary_scores_process— binary scores per cluster (runs once on full filtered h5ad)plot_histograms_process— non-zero median / binary score histogramsrun_nsforest_process— NSForest scatter/gather by cluster batchmerge_nsforest_results_process— gather partial NSForest resultsplots_process— boxplots, scatter, expression plots
Cluster Stats Process
cluster_stats_process
Source: modules/nsforest/cluster_stats.nf
Cluster Statistics Module
Computes basic cluster statistics (cell counts, percentages).
Input:
h5ad: Path to adata_filtered.h5ad
Output:
Params referenced:
params.outdirparams.publish_mode
Dendrogram Process
dendrogram_process
Source: modules/nsforest/dendrogram.nf
Dendrogram Module
Computes a hierarchical dendrogram over cluster medians and writes the cluster traversal order used to scatter run_nsforest_process.
Input:
h5ad: Path to adata_filtered.h5ad
Output:
Flat filenames: {organ}_{first_author}_{journal}_{year}_{vid}_{cluster_header_safe}_*.{csv,svg}
Params referenced:
params.outdirparams.publish_mode
Download H5Ad Process
download_h5ad_process
Source: modules/nsforest/download_h5ad.nf
Download H5AD Module
Downloads an h5ad file from CellxGene (https://) or AWS S3 (s3://). Retries up to 3 times on failure.
Input:
url: https:// or s3:// URL to the h5ad file
Output:
Filter Adata Process
filter_adata_process
Source: modules/nsforest/filter_adata.nf
Filter AnnData Module
- Three-stage filtering using ontology IDs — identical logic to cellxgene-harvester:
Tissue — tissue_ontology_term_id.isin(obo_ids) from uberon_{organ}.json
Disease — disease_ontology_term_id.isin(obo_ids) from disease_normal.json
Age — development_stage_ontology_term_id.isin(obo_ids) from hsapdv_adult_N.json
Then:
Min cluster size — drops clusters with < min_cluster_size cells
Input:
h5ad: Path to input h5ad (downloaded from CellxGene)
uberon_json: Path to uberon_{organ}.json from cellxgene-harvester resolve-uberon
disease_json: Path to disease_normal.json from cellxgene-harvester resolve-disease
hsapdv_json: Path to hsapdv_adult_N.json from cellxgene-harvester resolve-hsapdv
Output:
{organ}_{first_author}_{journal}_{year}_{cluster_header_safe}_{embedding_safe}_{vid}*.{csv,svg}
Params referenced:
params.min_cluster_sizeparams.outdirparams.publish_mode
Merge Nsforest Results Process
merge_nsforest_results_process
Source: modules/nsforest/merge_nsforest_results.nf
Params referenced:
params.outdirparams.publish_mode
Plot Histograms Process
plot_histograms_process
Source: modules/nsforest/plot_histograms.nf
Plot Histograms Module
Creates histograms of non-zero median and binary score distributions to help assess marker gene signal quality before running NSForest.
Input:
medians_csv: {prefix}_medians.csv
binary_scores_csv:{prefix}_binary_scores.csv
Output:
Params referenced:
params.outdirparams.publish_mode
Plots Process
plots_process
Source: modules/nsforest/plots.nf
Plots Module
Creates boxplots, scatter plots, and expression dotplots for NSForest marker genes. Maps Ensembl IDs to gene symbols using cell-kn gene mapping.
Input:
h5ad: Path to adata_filtered.h5ad
results_csv: {prefix}_results.csv from merge_nsforest_results_process
Output:
Params referenced:
params.outdirparams.publish_mode
Prep Binary Scores Process
prep_binary_scores_process
Source: modules/nsforest/prep_binary_scores.nf
Params referenced:
params.outdirparams.publish_mode
Prep Medians Process
prep_medians_process
Source: modules/nsforest/prep_medians.nf
Params referenced:
params.outdirparams.publish_mode
Run Nsforest Process
run_nsforest_process
Source: modules/nsforest/run_nsforest.nf
Run NSForest Module (Parallelized by Cluster Batch)
Runs NSForest algorithm to identify marker gene combinations. Each cluster batch processed independently (one vs all).