Calculate the weighted Gini coefficient or AUC in R

statistics

Author

Vinh Nguyen

Published

September 25, 2015

This post on Kaggle provides R code for calculating the Gini for assessing a prediction rule, and this post provides R code for the weighted version (think exposure for frequency and claim count for severity in non-life insurance modeling). Note that the weighted version is not well-defined when there are ties in the predictions and where the corresponding weights vary because different Lorentz curve (gains chart) could be drawn for different orderings of the observations; see this post for an explanation and some examples.

Now, to explain the code. The calculation of the x values (variable random, the cumulative proportion of observations or weights) and y values (variable Lorentz, the cumulative proportion of the response, the good's/1's or positive values) are straightforward. To calculate the area between the Lorentz curve and the diagonal line, one could use the trapezoidal rule to calculate the area between the Lorentz curve and x-axis and then subtract the area of the lower triangle (1/2):

\[ \begin{align} Gini &= \sum_{i=1}^{n} (x_{i} - x_{i-1}) \left[\frac{L(x_{i}) + L(x_{i-1})}{2}\right] - \frac{1}{2} \\ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i})x_{i} + L(x_{i-1})x_{i} - L(x_{i})x_{i-1} - L(x_{i-1})x_{i-1} \right] - \frac{1}{2} \\ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i})x_{i} - L(x_{i-1})x_{i-1} \right] + \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} - L(x_{i})x_{i-1} \right] - \frac{1}{2} \\ &= \frac{1}{2} L(x_{n})x_{n} + \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} - L(x_{i}) x_{i-1} \right] - \frac{1}{2} \\ &= \frac{1}{2} \sum_{i=1}^{n} \left[ L(x_{i-1})x_{i} - L(x_{i}) x_{i-1} \right] \end{align} \] where the last equality comes from the fact that $L(x_{n}) = x_{n} = 1$ for the Lorentz curve/gains chart. The remaining summation thus corresponds to sum(df$Lorentz[-1]*df$random[-n]) - sum(df$Lorentz[-n]*df$random[-1]) inside the WeightedGini function since the $i=1$ term in the summation is 0 ($x_i=0$ and $L(x_{0})=0$ for the Lorentz curve), yielding $n-1$ terms in the code.

For the unweighted case, applying the trapezoidal rule on the area between the Lorentz curve and the diagonal line yields:

\[ \begin{align} Gini &= \sum_{i=1}^{n} \frac{1}{n} \frac{\left[ L(x_{i}) - x_{i} \right] - \left[ L(x_{i-1}) - x_{i-1} \right] }{2} \\ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) - x_{i} \right] + \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i-1}) - x_{i-1} \right] \\ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) - x_{i} \right] + \frac{1}{2n} [L(x_{0}) - x_{0}] + \frac{1}{2n} \sum_{i=1}^{n-1} \left[ L(x_{i}) - x_{i} \right] \\ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) - x_{i} \right] + \frac{1}{2n} [L(x_{0}) - x_{0}] + \frac{1}{2n} \sum_{i=1}^{n-1} \left[ L(x_{i}) - x_{i} \right] + \frac{1}{2n} [L(x_{n}) - x_{n}] \\ &= \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) - x_{i} \right] + \frac{1}{2n} [L(x_{0}) - x_{0}] + \frac{1}{2n} \sum_{i=1}^{n} \left[ L(x_{i}) - x_{i} \right] \\ &= \frac{1}{n} \sum_{i=1}^{n} \left[ L(x_{i}) - x_{i} \right] \end{align} \] where we repeatedly used the fact that $L(x_{0}) = x_{0} = 0$ and $L(x_{n}) = x_{n} = 1$ for a Lorentz curve and that $1/n$ is the width between points (change in cdf of the observations). The summation is what is returned by SumModelGini.

Note that both $1/2$ and $1/n$ are not multiplied to the sums in the weighted and unweighted functions since most people will use the normalized versions, in which case these factors just cancel.