Iterative homogenization by elimination
Problem
Let’s say we have \(S\) objects and each object is associated with one of the \(N\) classes. There are \(c_i\) objects associated with the \(i\)th class. We want to eliminate only \(b\) objects in a way that forms distribution of classes as close as possible to the uniformal distribution.
Intuition
Intuition suggests that we can eliminate one object at a time by selecting a class with the largest number of objects in it. It’s not quite obvious whether this is the best strategy to follow. For example, we can start with the following distribution of classes  \(c = [12, 6, 7, 13]\). Next, we can eliminate 2 objects from it (\(b = 2\)). After elimination we can end up with 10 different outcomes. We can follow our initial strategy and end up with the following distribution: \([12, 6, 7, 11]\) (or \([11, 6, 7, 12]\)). Using some other strategy we can get a different outcome, for example, \(c = [12, 6, 6, 12]\). How do we know which one is closer to the uniformal distribution.
The word “closer” impliest that we need to have a certain measure that allows us to measure homogenization of the distribution. In order to get the solution we need to better define our objective.
Objective
We can use Kullback–Leibler divergence (KL divergence) in order to measure homogenization of the distribution. We can normalize \(c\) in order to convert it to a probability distribution \(p\)
\[p = [p_1, p_2, ..., p_N] \\ p_i = \frac{c_i}{\sum_{j=1}^{N}{c_j}} \\\]And we can define desirable uniformal distribution
\[u = [u_1, u_2, ..., u_N] \\ u_i = \frac{1}{N}\]KL divergence is not symmetric. Comparing \(u\) to \(p\) is not the same as comparing \(p\) to \(u\). We can show that for this problem both versions will lead to the same solution and order in which distributions are specified is not important for our purpose.
We can start with the following definition
\[D_{KL}(u \,  \, p) = \sum_{i=1}^N{u_i \log \frac{u_i}{p_i} } \\\]For our problem we want to subtract \(b\) objects and after this we will end up with a distribution \(q\)
\[q = [q_1, q_2, ..., q_N] \\ q_i = \frac{c_i  x_i}{\sum_{j=1}^{N}{c_j  x_j}} \text{, where} \sum_{i=1}^{N}{x_i}=b\]Optimization could be sumarized in the following way
\[\begin{align} \mathbf{\min_x} \, & D_{KL}(u \,  \, q) & \\ \text{subject to: } & \sum_{i=1}^{N}{x_i}=b & \\ & x_i \geq 0 & \forall i=1,...,N \end{align}\]Solution
First, we can notice that original function that we want to opimize can be simplified
\[\underset{x}{\operatorname{arg\,min}} D_{KL}(u \,  \, q) = \underset{x}{\operatorname{arg\,max}} \sum_{i=1}^{N} \log q_i\]We can use Karush–Kuhn–Tucker (KKT) conditions in order to solve this problem. Objective could be defined in the following way
\[L(x, \lambda, \{\eta_j\}) = \sum_{j=1}^{N} \log q_j  \lambda \left(\sum_{j=1}^{N}{x_j}  b \right) + \sum_{j=1}^{N} \eta_j x_j\]and we need to solve the following equation in order to find optimal solution
\[\frac{\partial}{\partial x_i}\left(\sum_{j=1}^{N} \log q_j\right)  \lambda + \eta_j = 0\]we can rewrite this equation in order to have \(x_i\) specified explicitly
\[\frac{1}{c_i  x_i} + \frac{N}{\sum_{j=1}^{N}{c_j  x_j}} = \lambda  \eta_i\]From the equation above we can notice that
\[\eta_i  \frac{1}{c_i  x_i} = \eta_j  \frac{1}{c_j  x_j} \text{, where} \, i \ne j\]From the Complementary slackness in the KKT conditions we know that \(\eta_i x_i = 0\). This condition implies that \(\eta_i\) and \(x_i\) cannot be nonzero values at the same time. We can consider two cases:

There exists at least one more nonzero \(x_i\). From this assumption the following statement is true (since \(\eta_i = \eta_j = 0\))
\[\frac{1}{c_i  x_i} = \frac{1}{c_j  x_j} \text{, where} \, i \ne j\]This equation says that \(c_i  x_i = c_j  x_j\). This means that after elimination we should end up with exactly the same values in each category.

There exists at least on \(x_i=0\). From this assumption the following statement is true:
\[\eta_i  \frac{1}{c_i} = \frac{1}{c_j  x_j}\]and since \(\eta_i \ge 0\) we get that
\[c_j \gt c_j  x_j \ge c_i\]Since we assumed that \(x_i=0\) and \(x_j \ne 0\) the equation above implies that every zero is associated with the smallest \(c_i\) value and every nonzero is associated with the largest \(c_j\) value. This observation matches our initial intuition. Equation implies that we should remove objects from the most common classes first.
Algorithm
There are multiple different algorithms that could be derived from the previous equations. We can find one by using the following definitions
 \(A = \{ i \,  \, x_i=0 \} \)
 \(\overset{\_}{A} = \{ i \,  \, x_i\ne0 \}\)
 \(\overset{\_}{A} = m \) and \(A = n m \)
 \(\sum_{i=1}^N{c_i} = C\)
 \( k = c_i  x_i, \forall i \in \overset{}{A}\)
 \(c_i \le c_j, \forall i \le j \)
If we assume that there is at least one \(x_i = 0\) then the following should be true
\[\begin{align} c_{nm} &\le \frac{\sum_{j \in \overset{\_}{A}}{c_j}  b}{m} \\ b &\le \sum_{j \in \overset{\_}{A}}{c_j}  m \, c_{nm} \end{align}\]and this equation will work if \(c_0 = 0\) (which implies that we have N+1 category, but one additional category doesn’t have objects in it).
In addition, after some rearrangement of terms we can create the following equation.
\[\begin{align} b &\le \sum_{a = 1}^{m} a \, (c_{n  a + 1}  c_{n  a}) \end{align}\]Reversing order of distributions in the KL divergence function
\[\begin{align} \mathbf{\min_x} \, & D_{KL}(q \,  \, u) & \\ \text{subject to: } & \sum_{i=1}^{N}{x_i}=b & \\ & x_i \geq 0 & \forall i=1,...,N \end{align}\]As before, main equation could be simplified
\[\underset{x}{\operatorname{arg\,min}} \, D_{KL}(q \,  \, u) = \underset{x}{\operatorname{arg\,min}} \sum_{i=1}^{N} q_i \log q_i\]On the right side, we have a negative entropy, and minimum will mean that we just need to maximize entropy of the distribution \(q\)
\[\underset{x}{\operatorname{arg\,min}} \, D_{KL}(q \,  \, u) = \underset{x}{\operatorname{arg\,max}} \, H(q)\] \[L(x, \lambda, \{\eta_j\}) = \sum_{j=1}^{N} q_j \log q_j + \lambda \left(\sum_{j=1}^{N}{x_j}  b \right)  \sum_{j=1}^{N} \eta_j x_j\]from which follows that
\[\sum_{j=1}^{N} q_j \log q_j  \frac{\log q_i}{S} + \lambda  \eta_i = 0 \\ \eta_i + \frac{\log q_i}{S} = \lambda + \frac{1}{S} \sum_{j=1}^{N} q_j \log q_j\]where \(S = \sum_{j=1}^{N} c_j  x_j \)
and based on the same logic as before we can show that
\[\eta_i + \frac{\log (c_i  x_i)}{S} = \eta_j + \frac{\log (c_j  x_j)}{S}, \forall i,j = 1...N\]And again, we can go through two cases:
 if we assume that \(x_i=0\) and \(x_j\ne0\) then \(c_j \gt c_j  x_j \ge c_i\)
 if we assume that \(x_i\ne0\) and \(x_j\ne0\) then \(c_j  x_j = c_i  x_i\)
The same conclusion could be drawn from these equations which shows that order of the distribution doesn’t make a difference for this problem
Handling discrete cases
Developed equations can produce noninteger solutions. These results won’t be suitable for our initial problem. Solution could be easily modified in order to work with integers. First, we can notice that noninteger values will be produced only for cases when \( \sum_{j \in \overset{\_}{A}}{c_j}  b \) is not divisible by \(m\) without the remainder. Instead, we can use the following equation
\[k = \left\lceil\frac{\sum_{j \in \overset{\_}{A}}{c_j}  b}{m}\right\rceil  s_j\]where
\[s_j = \begin{cases} 1 &\quad j \gt nr \\ 0 &\quad\text{otherwise.} \end{cases}\]and
\[\\ r = \left(\sum_{j \in \overset{}{A}}{c_j}  b\right) \bmod {m}\]This is just one way of defining \(s_j\) function, but any strategy will work. It’s aesy to see why this solution is the best that we can get from the previous equations and given that \(m\) categories have exactly the same value it doesn’t really matter from which of the categories we subtract remaining ones.