Nsforest Modules ================ Nextflow modules for the NSForest marker-gene discovery branch of the ``sc-nsforest-qc-nf`` workflow. These modules run inside the ``ghcr.io/nih-nlm/sc-nsforest-qc-nf/nsforest:latest`` container and are orchestrated by ``main.nf``. The ``nsforest-cli`` package bundled in this container wraps the `NSForest `_ algorithm from the J. Craig Venter Institute. For algorithm details and citation information see the `NSForest documentation `_. Execution order: 1. ``download_h5ad_process`` — download h5ad from CellxGene (https) or S3 2. ``filter_adata_process`` — tissue / disease / development stage ontology filter + min cluster size 3. ``dendrogram_process`` — cluster dendrogram + cluster order CSV 4. ``prep_medians_process`` — median expression per cluster (runs once on full filtered h5ad) 5. ``prep_binary_scores_process`` — binary scores per cluster (runs once on full filtered h5ad) 6. ``plot_histograms_process`` — non-zero median / binary score histograms 7. ``run_nsforest_process`` — NSForest scatter/gather by cluster batch 8. ``merge_nsforest_results_process`` — gather partial NSForest results 9. ``plots_process`` — boxplots, scatter, expression plots Cluster Stats Process ^^^^^^^^^^^^^^^^^^^^^ .. rubric:: ``cluster_stats_process`` *Source:* ``modules/nsforest/cluster_stats.nf`` Cluster Statistics Module Computes basic cluster statistics (cell counts, percentages). Input: ~~~~~~ ------ @param tuple: - meta: Map with organ, first_author, year, author_cell_type - h5ad: Path to adata_filtered.h5ad Output: ~~~~~~~ ------- @emit results: tuple(meta, [cluster_statistics CSV]) Flat filenames: {organ}_{first_author}_{journal}_{year}_{cluster_header}_{vid}_cluster_statistics.csv **Params referenced:** - ``params.outdir`` - ``params.publish_mode`` Dendrogram Process ^^^^^^^^^^^^^^^^^^ .. rubric:: ``dendrogram_process`` *Source:* ``modules/nsforest/dendrogram.nf`` Dendrogram Module Computes a hierarchical dendrogram over cluster medians and writes the cluster traversal order used to scatter run_nsforest_process. Input: ~~~~~~ ------ @param tuple: - meta: Map with organ, first_author, journal, year, author_cell_type, embedding, vid - h5ad: Path to adata_filtered.h5ad Output: ~~~~~~~ ------- @emit stats: tuple(meta, h5ad, cluster_order.csv) — drives scatter @emit results: tuple(meta, [dendrogram SVG + cluster_order CSV]) Flat filenames: {organ}_{first_author}_{journal}_{year}_{vid}_{cluster_header_safe}_*.{csv,svg} **Params referenced:** - ``params.outdir`` - ``params.publish_mode`` Download H5Ad Process ^^^^^^^^^^^^^^^^^^^^^ .. rubric:: ``download_h5ad_process`` *Source:* ``modules/nsforest/download_h5ad.nf`` Download H5AD Module Downloads an h5ad file from CellxGene (https://) or AWS S3 (s3://). Retries up to 3 times on failure. Input: ~~~~~~ ------ @param tuple: - meta: Map with first_author, year (and all other dataset metadata) - url: https:// or s3:// URL to the h5ad file Output: ~~~~~~~ ------- @emit h5ad: tuple(meta, {first_author}_{year}.h5ad) Filter Adata Process ^^^^^^^^^^^^^^^^^^^^ .. rubric:: ``filter_adata_process`` *Source:* ``modules/nsforest/filter_adata.nf`` Filter AnnData Module Three-stage filtering using ontology IDs — identical logic to cellxgene-harvester: 1. Tissue — tissue_ontology_term_id.isin(obo_ids) from uberon_{organ}.json 2. Disease — disease_ontology_term_id.isin(obo_ids) from disease_normal.json 3. Age — development_stage_ontology_term_id.isin(obo_ids) from hsapdv_adult_N.json Then: ~~~~~ 4. Min cluster size — drops clusters with < min_cluster_size cells Input: ~~~~~~ ------ @param tuple: - meta: Map with organ, first_author, journal, year, author_cell_type, embedding, vid, filter - h5ad: Path to input h5ad (downloaded from CellxGene) - uberon_json: Path to uberon_{organ}.json from cellxgene-harvester resolve-uberon - disease_json: Path to disease_normal.json from cellxgene-harvester resolve-disease - hsapdv_json: Path to hsapdv_adult_N.json from cellxgene-harvester resolve-hsapdv Output: ~~~~~~~ ------- @emit results: tuple(meta, adata_filtered.h5ad, [stats CSVs and SVGs]) Flat filenames: {organ}_{first_author}_{journal}_{year}_{cluster_header_safe}_{embedding_safe}_{vid}_adata_filtered.h5ad {organ}_{first_author}_{journal}_{year}_{cluster_header_safe}_{embedding_safe}_{vid}*.{csv,svg} **Params referenced:** - ``params.min_cluster_size`` - ``params.outdir`` - ``params.publish_mode`` Merge Nsforest Results Process ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. rubric:: ``merge_nsforest_results_process`` *Source:* ``modules/nsforest/merge_nsforest_results.nf`` **Params referenced:** - ``params.outdir`` - ``params.publish_mode`` Plot Histograms Process ^^^^^^^^^^^^^^^^^^^^^^^ .. rubric:: ``plot_histograms_process`` *Source:* ``modules/nsforest/plot_histograms.nf`` Plot Histograms Module Creates histograms of non-zero median and binary score distributions to help assess marker gene signal quality before running NSForest. Input: ~~~~~~ ------ @param tuple: - meta: Map with organ, first_author, journal, year, author_cell_type, embedding, vid - medians_csv: {prefix}_medians.csv - binary_scores_csv:{prefix}_binary_scores.csv Output: ~~~~~~~ ------- @emit histograms: tuple(meta, [hist_nonzero_*.svg]) Flat filenames: {organ}_{first_author}_{journal}_{year}_{cluster_header_safe}_{embedding}_{vid}_hist_nonzero_*.svg **Params referenced:** - ``params.outdir`` - ``params.publish_mode`` Plots Process ^^^^^^^^^^^^^ .. rubric:: ``plots_process`` *Source:* ``modules/nsforest/plots.nf`` Plots Module Creates boxplots, scatter plots, and expression dotplots for NSForest marker genes. Maps Ensembl IDs to gene symbols using cell-kn gene mapping. Input: ~~~~~~ ------ @param tuple: - meta: Map with organ, first_author, journal, year, author_cell_type, embedding, vid - h5ad: Path to adata_filtered.h5ad - results_csv: {prefix}_results.csv from merge_nsforest_results_process Output: ~~~~~~~ ------- @emit plots: tuple(meta, [boxplots, scatter plots, dotplots as SVG/HTML]) Flat filenames: {organ}_{first_author}_{journal}_{year}_{cluster_header_safe}_{embedding_safe}_{vid}*.{svg,html} **Params referenced:** - ``params.outdir`` - ``params.publish_mode`` Prep Binary Scores Process ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. rubric:: ``prep_binary_scores_process`` *Source:* ``modules/nsforest/prep_binary_scores.nf`` **Params referenced:** - ``params.outdir`` - ``params.publish_mode`` Prep Medians Process ^^^^^^^^^^^^^^^^^^^^ .. rubric:: ``prep_medians_process`` *Source:* ``modules/nsforest/prep_medians.nf`` **Params referenced:** - ``params.outdir`` - ``params.publish_mode`` Run Nsforest Process ^^^^^^^^^^^^^^^^^^^^ .. rubric:: ``run_nsforest_process`` *Source:* ``modules/nsforest/run_nsforest.nf`` Run NSForest Module (Parallelized by Cluster Batch) Runs NSForest algorithm to identify marker gene combinations. Each cluster batch processed independently (one vs all).