scsilhouette.compute module
Core computation functions for silhouette analysis.
This module contains the main function for computing silhouette scores from single-cell data.
- scsilhouette.compute.load_obo_ids(json_path: str, label: str) set[source]
Load obo_ids from a resolve_uberon / resolve_disease JSON file.
- scsilhouette.compute.load_obo_labels(json_path: str) dict[source]
Load obo_id -> label mapping from any resolve JSON file.
- scsilhouette.compute.filter_by_age(adata, hsapdv_json: str)[source]
Filter cells using development_stage_ontology_term_id matched against HsapDv obo_ids.
Age threshold is encoded in the JSON at resolve time (cellxgene-harvester resolve-hsapdv –min-age N). No numeric comparison — identical pattern to tissue and disease filters.
- scsilhouette.compute.run_silhouette(h5ad_path: str, cluster_header: str, embedding_key: str, organ: str, first_author: str, journal: str, year: str, dataset_version_id: str, organism: str = 'human', disease: str = 'normal', use_binary_genes: bool = False, gene_list_path: str | None = None, metric: str = 'euclidean', pca_components: int | None = None, filter_normal: bool = True, uberon_json: str | None = None, disease_json: str | None = None, hsapdv_json: str | None = None, save_scores: bool = True, save_cluster_summary: bool = True, save_annotation: bool = True)[source]
Compute silhouette scores for single-cell clusters.
- Parameters:
h5ad_path (str) – Path to input h5ad file
cluster_header (str) – Column name for cell type clusters
embedding_key (str) – Embedding key (e.g., X_umap)
organ (str) – Organ/tissue (e.g., kidney)
first_author (str) – First author (e.g., Lake)
journal (str) – Journal (e.g., Cell)
year (str) – Publication year (e.g., 2023)
organism (str) – Organism (default: human)
disease (str) – Disease state label for annotation output (default: normal)
use_binary_genes (bool) – Use binary genes from NSForest
gene_list_path (str) – Path to gene list file
metric (str) – Distance metric for silhouette (default: euclidean)
pca_components (int) – Number of PCA components (optional)
filter_normal (bool) – If True, apply tissue + disease + age filters
uberon_json (str) – Path to UBERON JSON from cellxgene-harvester resolve-uberon. Required when filter_normal=True.
disease_json (str) – Path to disease JSON from cellxgene-harvester resolve-disease. Required when filter_normal=True.
hsapdv_json (str) – Path to HsapDv JSON from cellxgene-harvester resolve-hsapdv –min-age N. Age threshold is encoded in the JSON at resolve time. Required when filter_normal=True.
save_scores (bool) – Save per-cell silhouette scores
save_cluster_summary (bool) – Save cluster summary statistics
save_annotation (bool) – Save annotation metadata
- Returns:
adata – Filtered, annotated data with silhouette scores
- Return type:
AnnData
- scsilhouette.compute.compute_summary_stats(cluster_summary_path: str, nsforest_results_path: str | None = None, metadata_path: str | None = None, cluster_header: str = '', organ: str = '', first_author: str = '', journal: str = '', year: str = '', embedding: str = '')[source]
Compute dataset-level summary statistics from cluster summaries.
Computes median-of-medians and other aggregate metrics across all clusters. Optionally includes NSForest F-score statistics if results are available.
When metadata_path is provided, reads a JSON file containing all dataset fields from the cellxgene-harvester CSV. All metadata fields are included as columns in the output summary CSV. JSON fields override individual params.