2013年5月18日 星期六

[Summary] A Global Geometric Framework for Nonlinear Dimensionality Reduction

Topic: A Global Geometric Framework for Nonlinear Dimensionality Reduction

Author: J.B. Tenenbaum

Summary:

Scientists working with large volumes of high-dimensional data confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations.

The challenge of non-linearity with data lying on a two-dimensional “Swiss roll”: points
far apart on the underlying manifold, as measured by their geodesic, or shortest path,  distances, may appear deceptively close in the high-dimensional input space, as measured by their straight-line Euclidean distance.



The complete isometric feature mapping, or Isomap, algorithm has three steps:

1. Construct neighborhood graph: Determine which points are neighbors on the manifold M, based on the distances dX (i, j) between pairs of points i, j in the input space X. These neighborhood relations are represented as a weighted graph G over the data points, with edges of weight dX(i, j) between neighboring points.

2. Compute shortest graphs: Estimate the geodesic distances dM(i, j) between all pairs
of points on the manifold M by computing their shortest path distances dG(i, j) in the graph G.

3. Construct d-dimensional embedding: Applies classical MDS to the matrix of graph distances DG={dG(i, j)}, constructing an embedding of the data in a d-dimensional Euclidean space Y that best preserves the manifold’s estimated intrinsic geometry. The coordinate vectors yi for points in Y are chosen to minimize the cost function:





As with PCA or MDS, the true dimensionality of the data can be estimated from the decrease in error as the dimensionality of Y is increased. Isomap is guaranteed asymptotically to recover the true dimensionality and geometric structure of a strictly larger class of nonlinear manifolds.

The guarantees of asymptotic convergence rest on a proof that as the number of data points increases, the graph distances dG(i, j) provide increasingly better approximations to the intrinsic geodesic distances dM(i, j), becoming arbitrarily accurate in the limit of infinite data. How quickly dG(i, j) converges to dM(i, j) depends on certain parameters of the manifold as it lies within the high-dimensional space (radius of curvature and branch separation) and on the density of points.

沒有留言:

張貼留言