scsilhouette.compute module

Core computation functions for silhouette analysis.

This module contains the main function for computing silhouette scores from single-cell data.

scsilhouette.compute.load_obo_ids(json_path: str, label: str) set[source]

Load obo_ids from a resolve_uberon / resolve_disease JSON file.

scsilhouette.compute.load_obo_labels(json_path: str) dict[source]

Load obo_id -> label mapping from any resolve JSON file.

scsilhouette.compute.filter_by_age(adata, hsapdv_json: str)[source]

Filter cells using development_stage_ontology_term_id matched against HsapDv obo_ids.

Age threshold is encoded in the JSON at resolve time (cellxgene-harvester resolve-hsapdv –min-age N). No numeric comparison — identical pattern to tissue and disease filters.

scsilhouette.compute.run_silhouette(h5ad_path: str, cluster_header: str, embedding_key: str, organ: str, first_author: str, journal: str, year: str, dataset_version_id: str, organism: str = 'human', disease: str = 'normal', use_binary_genes: bool = False, gene_list_path: str | None = None, metric: str = 'euclidean', pca_components: int | None = None, filter_normal: bool = True, uberon_json: str | None = None, disease_json: str | None = None, hsapdv_json: str | None = None, save_scores: bool = True, save_cluster_summary: bool = True, save_annotation: bool = True)[source]

Compute silhouette scores for single-cell clusters.

Parameters:
  • h5ad_path (str) – Path to input h5ad file

  • cluster_header (str) – Column name for cell type clusters

  • embedding_key (str) – Embedding key (e.g., X_umap)

  • organ (str) – Organ/tissue (e.g., kidney)

  • first_author (str) – First author (e.g., Lake)

  • journal (str) – Journal (e.g., Cell)

  • year (str) – Publication year (e.g., 2023)

  • organism (str) – Organism (default: human)

  • disease (str) – Disease state label for annotation output (default: normal)

  • use_binary_genes (bool) – Use binary genes from NSForest

  • gene_list_path (str) – Path to gene list file

  • metric (str) – Distance metric for silhouette (default: euclidean)

  • pca_components (int) – Number of PCA components (optional)

  • filter_normal (bool) – If True, apply tissue + disease + age filters

  • uberon_json (str) – Path to UBERON JSON from cellxgene-harvester resolve-uberon. Required when filter_normal=True.

  • disease_json (str) – Path to disease JSON from cellxgene-harvester resolve-disease. Required when filter_normal=True.

  • hsapdv_json (str) – Path to HsapDv JSON from cellxgene-harvester resolve-hsapdv –min-age N. Age threshold is encoded in the JSON at resolve time. Required when filter_normal=True.

  • save_scores (bool) – Save per-cell silhouette scores

  • save_cluster_summary (bool) – Save cluster summary statistics

  • save_annotation (bool) – Save annotation metadata

Returns:

adata – Filtered, annotated data with silhouette scores

Return type:

AnnData

scsilhouette.compute.compute_summary_stats(cluster_summary_path: str, nsforest_results_path: str | None = None, metadata_path: str | None = None, cluster_header: str = '', organ: str = '', first_author: str = '', journal: str = '', year: str = '', embedding: str = '')[source]

Compute dataset-level summary statistics from cluster summaries.

Computes median-of-medians and other aggregate metrics across all clusters. Optionally includes NSForest F-score statistics if results are available.

When metadata_path is provided, reads a JSON file containing all dataset fields from the cellxgene-harvester CSV. All metadata fields are included as columns in the output summary CSV. JSON fields override individual params.