Customer Bank Departure Prediction with Tidymodels - Second Part: Decision Threshold Examination
In the ongoing exploration of the Bank Customer Churn problem, this article explains the application of decision threshold analysis using the 'probably' package in R to optimize model performance for non-technical audiences.
The decision threshold analysis involves systematically evaluating model performance for various thresholds, aiming to minimize costs associated with false positives (unnecessary retention offers) and false negatives (lost customers).
To perform this analysis, we first obtain predicted probabilities for churn from our classification model. Then, we define a decision threshold that converts these probabilities into binary churn predictions. Next, we use the 'probably' package to systematically vary this threshold, evaluate model performance, and calculate a cost function.
The 'probably' package in R facilitates decision threshold optimization and cost-based evaluation by working with predicted class probabilities. A typical workflow involves using the model to get predicted class probabilities on test data, creating a cost matrix reflecting business costs, and using 'probably' functions to evaluate metrics and choose an optimal threshold minimizing expected cost.
In our hypothetical scenario analysis, we calculated the Total Cost of FN and FP as an approximation using Customer Lifetime Value (CLV) and a cost of intervention, which is assumed to be the value of an annual fee for a standard account, i.e., $99. The median CLV is taken as the approximate CLV per customer, with annualized CLV calculated as the sum of account fees and credit card fees, with each product having a $99 annual fee except credit cards which have a $149 fee.
The constructed scenario analysis identified a decision threshold that minimizes costs. However, it was observed that the lowest cost model reduces model performance, presenting a trade-off between an effective model that differentiates classes moderately well and a lower cost one with more interventions and greater false positive predictions.
The 'probably::threshold_perf()' function was used to carry out threshold analysis, identifying an optimal threshold based on the J-index. Additionally, the 'workflowsets::extract_workflow_set_result' function was used to generate a tibble of all trialled hyperparameter combinations, and the best was selected based on a specified metric.
To provide more specific code examples tailored to your dataset or model output, feel free to ask for assistance. It's important to note that costs should be adjusted to your exact business context, such as "Cost of retaining a non-churner" vs. "Cost of missing a churner."
This approach makes the threshold decision explicit and data-driven rather than fixed at 0.5, providing a more informed and effective way to manage bank customer churn. The analysis was carried out using the 'probably' package from tidymodels, and the dataset used in the study was obtained from Kaggle (
In the context of data-and-cloud-computing, the application of the 'probably' package in R, a technology tool, allows for optimization of decision threshold analysis in the finance sector, particularly in business settings such as minimizing costs associated with bank customer churn. To effectively manage this problem, the 'probably' package facilitates the evaluation of various decision thresholds, aiming to minimize false positives and false negatives, ultimately leading to cost-effective solutions that improve the performance of churn prediction models.