Agglomerative Information Bottleneck (AIB)

aib.h implemens the Agglomerative Information Bottleneck (AIB) algorithm as first described in  .

AIB takes a discrete valued feature $x$ and a label $c$ and gradually compresses $x$ by iteratively merging values which minimize the loss in mutual information $I(x,c)$.

While the algorithm is equivalent to the one described in  , it has some speedups that enable handling much larger datasets. Let N be the number of feature values and C the number of labels. The algorithm of  is $O(N^2)$ in space and $O(C N^3)$ in time. This algorithm is $O(N)$ space and $O(C N^2)$ time in common cases ( $O(C N^3)$ in the worst case).

# Overview

Given a discrete feature $x \in \mathcal{X} = \{x_1,\dots,x_N\}$ and a category label $c = 1,\dots,C$ with joint probability $p(x,c)$, AIB computes a compressed feature $[x]_{ij}$ by merging two values $x_i$ and $x_j$. Among all the pairs $ij$, AIB chooses the one that yields the smallest loss in the mutual information

$D_{ij} = I(x,c) - I([x]_{ij},c) = \sum_c p(x_i) \log \frac{p(x_i,c)}{p(x_i)p(c)} + \sum_c p(x_i) \log \frac{p(x_i,c)}{p(x_i)p(c)} - \sum_c (p(x_i)+p(x_j)) \log \frac {p(x_i,c)+p(x_i,c)}{(p(x_i)+p(x_j))p(c)}$

AIB iterates this procedure until the desired level of compression is achieved.

# Algorithm details

Computing $D_{ij}$ requires $O(C)$ operations. For example, in standard AIB we need to calculate

$D_{ij} = I(x,c) - I([x]_{ij},c) = \sum_c p(x_i) \log \frac{p(x_i,c)}{p(x_i)p(c)} + \sum_c p(x_i) \log \frac{p(x_i,c)}{p(x_i)p(c)} - \sum_c (p(x_i)+p(x_j)) \log \frac {p(x_i,c)+p(x_i,c)}{(p(x_i)+p(x_j))p(c)}$

Thus in a basic implementation of AIB, finding the optimal pair $ij$ of feature values requires $O(CN^2)$ operations in total. In order to join all the $N$ values, we repeat this procedure $O(N)$ times, yielding $O(N^3 C)$ time and $O(1)$ space complexity (this does not account for the space need to store the input).

The complexity can be improved by reusing computations. For instance, we can store the matrix $D = [ D_{ij} ]$ (which requires $O(N^2)$ space). Then, after joining $ij$, all of the matrix D except the rows and columns (the matrix is symmetric) of indexes i and j is unchanged. These two rows and columns are deleted and a new row and column, whose computation requires $O(NC)$ operations, are added for the merged value $x_{ij}$. Finding the minimal element of the matrix still requires $O(N^2)$ operations, so the complexity of this algorithm is $O(N^2C + N^3)$ time and $O(N^2)$ space.

We can obtain a much better expected complexity as follows. First, instead of storing the whole matrix D, we store the smallest element (index and value) of each row as $(q_i, D_i)$ (notice that this is also the best element of each column since D is symmetric). This requires $O(N)$ space and finding the minimal element of the matrix requires $O(N)$ operations. After joining $ij$, we have to efficiently update this representation. This is done as follows:

• The entries $(q_i,D_i)$ and $(q_j,D_j)$ are deleted.
• A new entry $(q_{ij},D_{ij})$ for the joint value $x_{ij}$ is added. This requires $O(CN)$ operations.
• We test which other entries $(q_{k},D_{k})$ need to be updated. Recall that $(q_{k},D_{k})$ means that, before the merge, the value closest to $x_k$ was $x_{q_k}$ at a distance $D_k$. Then
• If $q_k \not = i$, $q_k \not = j$ and $D_{k,ij} \geq D_k$, then $q_k$ is still the closest element and we do not do anything.
• If $q_k \not = i$, $q_k \not = j$ and $D_{k,ij} < D_k$, then the closest element is $ij$ and we update the entry in constant time.
• If $q_k = i$ or $q_k = j$, then we need to re-compute the closest element in $O(CN)$ operations.

This algorithm requires only $O(N)$ space and $O(\gamma(N) C N^2)$ time, where $\gamma(N)$ is the expected number of times we fall in the last case. In common cases one has $\gamma(N) \approx \mathrm{const.}$, so the time saving is significant.