Nsforest Modules
================

Nextflow modules for the NSForest marker-gene discovery branch of the
``sc-nsforest-qc-nf`` workflow. These modules run inside the
``ghcr.io/nih-nlm/sc-nsforest-qc-nf/nsforest:latest`` container and are
orchestrated by ``main.nf``.

The ``nsforest-cli`` package bundled in this container wraps the
`NSForest <https://github.com/JCVenterInstitute/NSForest>`_ algorithm
from the J. Craig Venter Institute. For algorithm details and citation
information see the
`NSForest documentation <https://nsforest.readthedocs.io>`_.

Execution order:

1. ``download_h5ad_process`` — download h5ad from CellxGene (https) or S3
2. ``filter_adata_process`` — tissue / disease / development stage ontology filter + min cluster size
3. ``dendrogram_process`` — cluster dendrogram + cluster order CSV
4. ``prep_medians_process`` — median expression per cluster (runs once on full filtered h5ad)
5. ``prep_binary_scores_process`` — binary scores per cluster (runs once on full filtered h5ad)
6. ``plot_histograms_process`` — non-zero median / binary score histograms
7. ``run_nsforest_process`` — NSForest scatter/gather by cluster batch
8. ``merge_nsforest_results_process`` — gather partial NSForest results
9. ``plots_process`` — boxplots, scatter, expression plots


Cluster Stats Process
^^^^^^^^^^^^^^^^^^^^^

.. rubric:: ``cluster_stats_process``

*Source:* ``modules/nsforest/cluster_stats.nf``

Cluster Statistics Module

Computes basic cluster statistics (cell counts, percentages).


Input:
~~~~~~
------
@param tuple:
  - meta: Map with organ, first_author, year, author_cell_type
  - h5ad: Path to adata_filtered.h5ad


Output:
~~~~~~~
-------
@emit results: tuple(meta, [cluster_statistics CSV])
  Flat filenames: {organ}_{first_author}_{journal}_{year}_{cluster_header}_{vid}_cluster_statistics.csv

**Params referenced:**

- ``params.outdir``
- ``params.publish_mode``


Dendrogram Process
^^^^^^^^^^^^^^^^^^

.. rubric:: ``dendrogram_process``

*Source:* ``modules/nsforest/dendrogram.nf``

Dendrogram Module

Computes a hierarchical dendrogram over cluster medians and writes the
cluster traversal order used to scatter run_nsforest_process.


Input:
~~~~~~
------
@param tuple:
  - meta: Map with organ, first_author, journal, year, author_cell_type, embedding, vid
  - h5ad: Path to adata_filtered.h5ad


Output:
~~~~~~~
-------
@emit stats:   tuple(meta, h5ad, cluster_order.csv)  — drives scatter
@emit results: tuple(meta, [dendrogram SVG + cluster_order CSV])
  Flat filenames: {organ}_{first_author}_{journal}_{year}_{vid}_{cluster_header_safe}_*.{csv,svg}

**Params referenced:**

- ``params.outdir``
- ``params.publish_mode``


Download H5Ad Process
^^^^^^^^^^^^^^^^^^^^^

.. rubric:: ``download_h5ad_process``

*Source:* ``modules/nsforest/download_h5ad.nf``

Download H5AD Module

Downloads an h5ad file from CellxGene (https://) or AWS S3 (s3://).
Retries up to 3 times on failure.


Input:
~~~~~~
------
@param tuple:
  - meta: Map with first_author, year (and all other dataset metadata)
  - url:  https:// or s3:// URL to the h5ad file


Output:
~~~~~~~
-------
@emit h5ad: tuple(meta, {first_author}_{year}.h5ad)


Filter Adata Process
^^^^^^^^^^^^^^^^^^^^

.. rubric:: ``filter_adata_process``

*Source:* ``modules/nsforest/filter_adata.nf``

Filter AnnData Module

Three-stage filtering using ontology IDs — identical logic to cellxgene-harvester:
  1. Tissue  — tissue_ontology_term_id.isin(obo_ids) from uberon_{organ}.json
  2. Disease — disease_ontology_term_id.isin(obo_ids) from disease_normal.json
  3. Age     — development_stage_ontology_term_id.isin(obo_ids) from hsapdv_adult_N.json

Then:
~~~~~
  4. Min cluster size — drops clusters with < min_cluster_size cells


Input:
~~~~~~
------
@param tuple:
  - meta:         Map with organ, first_author, journal, year, author_cell_type, embedding, vid, filter
  - h5ad:         Path to input h5ad (downloaded from CellxGene)
  - uberon_json:  Path to uberon_{organ}.json from cellxgene-harvester resolve-uberon
  - disease_json: Path to disease_normal.json from cellxgene-harvester resolve-disease
  - hsapdv_json:  Path to hsapdv_adult_N.json from cellxgene-harvester resolve-hsapdv


Output:
~~~~~~~
-------
@emit results: tuple(meta, adata_filtered.h5ad, [stats CSVs and SVGs])
  Flat filenames: {organ}_{first_author}_{journal}_{year}_{cluster_header_safe}_{embedding_safe}_{vid}_adata_filtered.h5ad
                  {organ}_{first_author}_{journal}_{year}_{cluster_header_safe}_{embedding_safe}_{vid}*.{csv,svg}

**Params referenced:**

- ``params.min_cluster_size``
- ``params.outdir``
- ``params.publish_mode``


Merge Nsforest Results Process
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. rubric:: ``merge_nsforest_results_process``

*Source:* ``modules/nsforest/merge_nsforest_results.nf``

**Params referenced:**

- ``params.outdir``
- ``params.publish_mode``


Plot Histograms Process
^^^^^^^^^^^^^^^^^^^^^^^

.. rubric:: ``plot_histograms_process``

*Source:* ``modules/nsforest/plot_histograms.nf``

Plot Histograms Module

Creates histograms of non-zero median and binary score distributions
to help assess marker gene signal quality before running NSForest.


Input:
~~~~~~
------
@param tuple:
  - meta:             Map with organ, first_author, journal, year, author_cell_type, embedding, vid
  - medians_csv:      {prefix}_medians.csv
  - binary_scores_csv:{prefix}_binary_scores.csv


Output:
~~~~~~~
-------
@emit histograms: tuple(meta, [hist_nonzero_*.svg])
  Flat filenames: {organ}_{first_author}_{journal}_{year}_{cluster_header_safe}_{embedding}_{vid}_hist_nonzero_*.svg

**Params referenced:**

- ``params.outdir``
- ``params.publish_mode``


Plots Process
^^^^^^^^^^^^^

.. rubric:: ``plots_process``

*Source:* ``modules/nsforest/plots.nf``

Plots Module

Creates boxplots, scatter plots, and expression dotplots for NSForest
marker genes. Maps Ensembl IDs to gene symbols using cell-kn gene mapping.


Input:
~~~~~~
------
@param tuple:
  - meta:        Map with organ, first_author, journal, year, author_cell_type, embedding, vid
  - h5ad:        Path to adata_filtered.h5ad
  - results_csv: {prefix}_results.csv from merge_nsforest_results_process


Output:
~~~~~~~
-------
@emit plots: tuple(meta, [boxplots, scatter plots, dotplots as SVG/HTML])
  Flat filenames: {organ}_{first_author}_{journal}_{year}_{cluster_header_safe}_{embedding_safe}_{vid}*.{svg,html}

**Params referenced:**

- ``params.outdir``
- ``params.publish_mode``


Prep Binary Scores Process
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. rubric:: ``prep_binary_scores_process``

*Source:* ``modules/nsforest/prep_binary_scores.nf``

**Params referenced:**

- ``params.outdir``
- ``params.publish_mode``


Prep Medians Process
^^^^^^^^^^^^^^^^^^^^

.. rubric:: ``prep_medians_process``

*Source:* ``modules/nsforest/prep_medians.nf``

**Params referenced:**

- ``params.outdir``
- ``params.publish_mode``


Run Nsforest Process
^^^^^^^^^^^^^^^^^^^^

.. rubric:: ``run_nsforest_process``

*Source:* ``modules/nsforest/run_nsforest.nf``

Run NSForest Module (Parallelized by Cluster Batch)

Runs NSForest algorithm to identify marker gene combinations.
Each cluster batch processed independently (one vs all).