Author: J. Wang
Summary:
Two catergories of hashing methods:
1. Unsupervised hashing methods:
pros- good performance with metric distances
cons- in image search, semantic similarity is usually given in terms of labeled pairs of images; slow to train
2. Supervised hashing methods:
pros- can handle such semantic similarity
cons- are prone to overfitting when labeled data is small or noisy; slow to train(much slower in comparison to the unsupervised methods)
Semi-Supervised Hashing (SSH)
The technique can leverage semantic similarity using labeled data while remaining robust to overfitting. SSH is also much faster than existing supervised hashing methods and can be easily scaled to large datasets.
Given a set of n points, X = {xi}, in which a fraction of pairs are associated with two categories of label information, M and C. A pair (xi, xj) ∈ M is denoted as a neighbor-pair in which xi and xj are either neighbors in a metric space or share common class labels. Similarly, (xi, xj) ∈ C is called a nonneighbor-pair if two samples are far away in metric space or have different class labels. The matrix Xl is formed by L columns of X. The goal of SSH is to learn hash functions that minimize the error on the labeled training data Xl, while maximally satisfying the desirable properties of hashing.
Formulation
Hashing aims to map the data X to a Hamming space to obtain its compact representation. In this work, we use linear projection coupled with mean thresholding as a hash function. Let H = [h1, . . . , hK] be a sequence of K hash functions and W = [w1, . . . ,wK]. We want to learn a W that gives the same bits for (xi, xj) ∈M and different bits for (xi, xj) ∈ C.
Objective function
The above objective function in a compact matrix form by first defining a matrix S.
Suppose H(Xl) maps the points in Xl to their K-bit hash codes. Then, the objective function J(H) can be represented as,
Since the above function measures only the empirical accuracy, it is prone to overfitting especially when the size of labeled set is small compared to the entire dataset (i.e. L ≪ n). To get better generalization ability, one needs to add “regularization” by incorporating conditions that lead to desirable properties of hash codes, independent of the performance on the labeled training data. The balancing property specifies that each hash function hk(·) should partition the entire dataset X into two sets of equal size.
Relaxing Objective Function
Replace the sign of projection with its signed magnitude. The relaxation not only desires similar points to have the same sign but also large projection magnitudes, meanwhile projecting dissimilar points not only with different signs but also as far as possible.
The new objective function
To avoid dealing with non-differentiable sgn(·) function, we maximize directly the variance of the projected data. The overall objective function:
The learning of optimal projectionsWbecomes a typical eigenproblem, which can be easily solved by doing an eigenvalue decomposition on matrix M:
where λ1 > λ2 > · · · > λK are the top eigenvalues of M and ek, k = 1, · · · ,K are the corresponding eigenvectors.
The first (supervised) term is the empirical accuracy of the learned hash functions on the pairwise labeled data, while the second (unsupervised) term is a regularizer that prefers those directions that maximize the variance of the projections subject to orthogonality constraints.
Relaxing Orthogonality Constraints
For most real-world datasets, most of the variance is contained in top few projections. The orthogonality constraints force one to progressively pick those directions that have very low variance, substantially reducing the quality of lower bits, and hence the whole embedding.
Instead of imposing hard orthogonality constraints, we convert them into a penalty term added to the objective function.
However, the above objective function is non-convex and there is no easy way to find the global solution unlike the previous case.
The solution for non-orthogonal W is also obtained by direct eigendecomposition of M similar to the orthogonal version in (17) resulting in negligible computational overhead.
Comment: