# Prune clustering tree using random forest classifiers

`pruneTree.Rd`

To identify a final set of clusters, this function will move iteratively from the bottom up to prune the provided hierarchical clustering tree using a framework of random forest classifiers and permutation tests.

## Usage

```
pruneTree(
object,
key = "CHOIR",
alpha = NULL,
p_adjust = NULL,
feature_set = NULL,
exclude_features = NULL,
n_iterations = NULL,
n_trees = NULL,
use_variance = NULL,
min_accuracy = NULL,
min_connections = NULL,
max_repeat_errors = NULL,
distance_approx = NULL,
distance_awareness = 2,
collect_all_metrics = FALSE,
sample_max = NULL,
downsampling_rate = NULL,
normalization_method = NULL,
batch_correction_method = NULL,
batch_labels = NULL,
cluster_params = NULL,
use_assay = NULL,
cluster_tree = NULL,
input_matrix = NULL,
nn_matrix = NULL,
dist_matrix = NULL,
reduction = NULL,
n_cores = NULL,
random_seed = NULL,
verbose = TRUE
)
```

## Arguments

- object
An object of class 'Seurat', 'SingleCellExperiment', or 'ArchRProject'.

- key
The name under which CHOIR-related data for this run is stored in the object. Defaults to 'CHOIR'.

- alpha
A numeric value indicating the significance level used for permutation test comparisons of cluster prediction accuracies. Defaults to 0.05.

- p_adjust
A string indicating which multiple comparison adjustment to use. Permitted values are 'bonferroni', 'fdr', and 'none'. Defaults to 'bonferroni'.

- feature_set
A string indicating whether to train random forest classifiers on 'all' features or only variable ('var') features. Defaults to 'var'.

- exclude_features
A character vector indicating features that should be excluded from input to the random forest classifier. Default =

`NULL`

will not exclude any features.- n_iterations
A numeric value indicating the number of iterations run for each permutation test comparison. Defaults to 100.

- n_trees
A numeric value indicating the number of trees in each random forest. Defaults to 50.

- use_variance
A boolean value indicating whether to use the variance of the random forest accuracy scores as part of the permutation test threshold. Defaults to

`TRUE`

.- min_accuracy
A numeric value indicating the minimum accuracy required of the random forest classifier, below which clusters will be automatically merged. Defaults to 0.5 (chance).

- min_connections
A numeric value indicating the minimum number of nearest neighbors between two clusters for them to be considered 'adjacent'. Non-adjacent clusters will not be merged. Defaults to 1.

- max_repeat_errors
Used to account for situations in which random forest classifier errors are concentrated among a few cells that are repeatedly misassigned. A numeric value indicating the maximum number of such 'repeat errors' that will be taken into account. If set to 0, 'repeat errors' will not be evaluated. Defaults to 20.

- distance_approx
A boolean value indicating whether or not to use approximate distance calculations. Default =

`TRUE`

will use centroid-based distances.- distance_awareness
A numeric value representing the distance threshold above which a cluster will not merge with another cluster. Specifically, this value is multiplied by the distance between a cluster and its closest distinguishable neighbor to set the threshold. Default = 2 sets this threshold at a 2-fold increase in distance. Alternately, to omit all distance calculations, set to

`FALSE`

.- collect_all_metrics
A boolean value indicating whether to collect and save additional metrics from the random forest classifier comparisons, including feature importances and tree depth. Defaults to

`FALSE`

.- sample_max
A numeric value indicating the maximum number of cells used per cluster to train/test each random forest classifier. Default =

`Inf`

does not cap the number of cells used.- downsampling_rate
A numeric value indicating the proportion of cells used per cluster to train/test each random forest classifier. Default = "auto" sets the downsampling rate according to the dataset size, for efficiency.

- normalization_method
A character string or vector indicating which normalization method to use. In general, input data should be supplied to CHOIR after normalization, except in cases when the user wishes to use

`Seurat::SCTransform()`

normalization. Permitted values are 'none' or 'SCTransform'. Defaults to 'none'.- batch_correction_method
A character string or vector indicating which batch correction method to use. Permitted values are 'Harmony' and 'none'. Defaults to 'none'.

- batch_labels
If applying batch correction, a character string or vector indicating the name of the column containing the batch labels. Defaults to

`NULL`

.- cluster_params
A list of additional parameters to be passed to Seurat::FindClusters() for clustering at each level of the tree. Note that if

`group.singletons`

is set to`TRUE`

,`CHOIR`

relabels initial clusters such that each singleton constitutes its own cluster.- use_assay
For Seurat or SingleCellExperiment objects, a character string or vector indicating the assay(s) to use in the provided object. Default =

`NULL`

will choose the current active assay for Seurat objects and the`logcounts`

assay for SingleCellExperiment objects.- cluster_tree
An optional dataframe containing the cluster IDs of each cell across the levels of a hierarchical clustering tree. Default =

`NULL`

will use the hierarchical clustering tree generation by function`buildTree()`

.- input_matrix
An optional matrix containing the feature x cell data on which to train the random forest classifiers. Default =

`NULL`

will use the feature x cell matri(ces) indicated by function`buildTree()`

.- nn_matrix
An optional matrix containing the nearest neighbor adjacency of the cells. Default =

`NULL`

will look for the adjacency matri(ces) generated by function`buildTree()`

.- dist_matrix
An optional distance matrix of cell to cell distances (based on dimensionality reduction cell embeddings). Default =

`NULL`

will look for the distance matri(ces) generated by function`buildTree()`

.- reduction
An optional matrix of dimensionality reduction cell embeddings to be used for distance calculations. Defaults =

`NULL`

will look for the dimensionality reductions generated by function`buildTree()`

.- n_cores
A numeric value indicating the number of cores to use for parallelization. Default =

`NULL`

will use the number of available cores minus 2.- random_seed
A numeric value indicating the random seed to be used.

- verbose
A boolean value indicating whether to use verbose output during the execution of this function. Can be set to

`FALSE`

for a cleaner output.

## Value

Returns the object with the following added data stored under the provided key:

- clusters
Final clusters and stepwise cluster results for each progressive pruning step

- parameters
Record of parameter values used

- records
Metadata for all recorded permutation test comparisons and feature importance scores from all comparisons

## Details

If `CHOIR::buildTree()`

was run prior to this function, most parameters
will be retrieved from the object. Alternately, parameter values can be
supplied. For multi-modal data, optionally supply parameter inputs as
vectors/lists that sequentially specify the value for each modality.