A Visual Embedding for the Unsupervised Extraction of Abstract ...

Report 0 Downloads 72 Views
arXiv:1507.08818v1 [cs.CV] 31 Jul 2015

Extracting Visual Patterns from Deep Learning Representations Garcia-Gasulla, D. and Ayguad´e, E. and Labarta, J.

B´ejar, J. and Cort´es, U.

Barcelona Supercomputing Center, Barcelona, SPAIN [email protected]

Universitat Polit`ecnica de Catalunya - BarcelonaTECH, SPAIN

Abstract Vector-space word representations based on neural network models can include linguistic regularities, enabling semantic operations based on vector arithmetic. In this paper, we explore an analogous approach applied to images. We define a methodology to obtain large and sparse vectors from individual images and image classes, by using a pre-trained model of the GoogLeNet architecture. We evaluate the vector-space after processing 20,000 ImageNet images, and find it to be highly correlated with WordNet lexical distances. Further exploration of image representations shows how semantically similar elements are clustered in that space, regardless of large visual variances (e.g., 118 kinds of dogs), and how the space distinguishes abstract classes of objects without supervision (e.g., living things from non-living things). Finally, we consider vector arithmetic, and find them to be related with image concatenation (e.g., “horse cart - horse ' rickshaw”), image overlap (“Panda - Brown bear ' Skunk”) and regularities (“Panda is to Brown bear as Skunk is to Badger”). All these results indicate that visual semantics contain a large amount of general information, and that those semantics can be extracted as vector representations from neural network models, making them available for further learning and reasoning.

Introduction Deep learning networks are representation-learning methods processing raw, high-dimensional inputs for detection or classification (LeCun, Bengio, and Hinton 2015). The representation built by deep networks is distributed through multiple features in multiple layers, with each feature providing a different portion of information. The features composing the representation are non-linearly related to one another, making their conjunction an incredibly rich knowledge representation language. So far deep learning networks have been mostly used to exploit the language of their learnt representations straightforwardly, through tasks like image classification. However, the abundant amount of information coded within deep networks can be exploited for other, more abstract problems, such as reasoning. For that purpose, first one needs to extract the information coded within the black box of each feature, so that other algorithms and methodologies can be applied to it. A way of doing so is through

vector-space representations obtained from neural network language models. A vector-space that is then explored for regularities through vector arithmetics. Such is the approach taken by (Mikolov, Yih, and Zweig 2013), where authors find both syntactic (e.g., singular/plural) and semantic (e.g., male/female) regularities in vector representations of words. In this paper we also try to extract information from within neural network models, but in this case for the more complex domain of images. This will lead us to work with deep networks, capable of capturing the complexity and variety of information found on the visual domain. Using a previously trained network and its internal features as descriptors, we build vectors of features for a set of images as these are being processed by the trained network. Once the vector-space has been defined, we analyze which information is coded in it, and how can we exploit it for learning or reasoning purposes. Results provide insight into the contents of the deep network models, and open up a new set of applications exploiting deep learning network representations.

Motivation Linguistic vector-space representations of words obtained from neural network language models have been analyzed for syntactic and semantic regularities (Mikolov, Yih, and Zweig 2013). Regularities have been shown to emerge from vector arithmetics applied to the continuous space defined by these vectors, and are useful to tasks like machine translation (Mikolov, Le, and Sutskever 2013). Although the original representation proposed by Mikolov et al. was dense and low-dimensional, further research showed that sparse highdimensional representations could also encode similar information (Levy, Goldberg, and Ramat-Gan 2014). Inspired by that work, the original motivation of this research was to build an image2vec system, to build and exploit vector-space representations of images1 . Given one or more visual representations of an object processed through a neural network, our goal was to obtain a distributed, comparable vector representation which could be use to perform visual reasoning and to be applied the visual learning tasks. However, the differences between the linguistic domain and the visual do1

Although this term clarifies the initial purpose of our research, we discourage its use to refer to our contribution due to the huge differences between word2vec and our work.

main brought us to tackle the problem differently. word2vec uses an embedding model to build its vector representation, defining each word through the words surrounding it. Since the number of possible contexts for each word is relatively small word2vec can operate using shallow neural networks, thus producing dense and low-dimensional representations. In the visual domain things are more complex, as objects can suffer changes in brightness, background, obstruction, rotation, scale, context etc. Hence, a sparse low-dimensional representation would be unable to capture the necessarily rich domain of visual patterns needed to characterize objects abstractly. For that purpose we had to go much deeper, extracting large amounts of features from a deep network. Discriminant systems running on a deep network focus on the output of top-layer features because these hold the higher discriminative power. Our purpose however is representativeness, which is why we focus on all features of all layers in a deep network.

Methodology A convolutional neural network (CNN) trained with labeled images learns visual patterns useful for discriminating those labels. In a deep network the patterns learnt are millions, implemented as activation functions (e.g., ReLU) within the network features. The hierarchical nature of CNN layers defines non-linear relations among its features, as some become the input of others. Regardless, each of the learnt patterns remains a relevant unit of information for the discrimination task, thanks to the unique combination of parameters it defines. Analogously, each feature can be considered to provide a significant piece of visual knowledge for the description of an image. By considering all that knowledge together for a given image, one is fact considering everything the network knows about the image, as taught by its previous training. And that is what we expect an image vectorrepresentation of based on feature activations to capture. The precision and specificity of the resultant vector representation will be bounded by the variety and quality of the patterns found in the deep network; patterns capable of discriminating lots of different image classes with precision focus on many details, and thus provide richer image descriptions with a larger representative power. To maximize both descriptive accuracy and descriptive detail we used the GoogLeNet deep convolutional neural network architecture (Szegedy et al. 2014), a very deep network (22 layers) which obtained the best results at the ILSVRC14 visual recognition challenge (Russakovsky et al. 2015). We used the GoogLeNet pre-trained model available in the Caffe deep learning framework (Jia et al. 2014) which was trained using the 1.2 Million images of the test set. This model was trained to discriminate among the 1000 leaf-node categories of the ImageNet hierarchy. The GoogLeNet architecture is based on Inception modules, and the available model contains 9 of those. To build the vector representations we use the output of those modules, particularly the three 1x1, 3x3 and 5x5 convolution layers from which the output of the module is produced, and skipping the parallel pooling path. When an image is run through the trained network for its classification, these 27

different layers combined produce a over 1 Million activations, expressing the presence and relevance of 1 Million different visual patterns in the input image. We treat each of those activations as an independent variable describing distinct visual information of the image. Thus, our image high-dimensional, sparse vector representation is composed by over 1 Million continuous variables2 . The experiments described in sections Evaluation, Clusters of Image Classes and Exploring Class Relations, were done on the MareNostrum supercomputer, using two Intel SandyBridge-EP E5-2670/1600 20M 8-core at 2.6 GHz and 64 GB of RAM. We used 20,000 images of the ImageNet validation set. All code used to process the activation files is available at https://github.com/dariogarcia/tiramisu, including the python scripts used to build all figures and graphs.

Image Classes One of the main challenges of any image classification system, such as GoogLeNet, is to be robust to changes in brightness, perspective, scale, context, etc. For that purpose, deep networks train with tons of images, developing a huge variety of visual patterns and building rich and thorough object representations. We are interested in exploring and exploiting visual semantics, not of specific images, but of abstract visual entities. For that reason we combine the information provided by each specific image of the same class, its vector representation, into an image class vector representation. By doing so we intend to capture all the visual aspects of each entity and to obtain representative values of each variable with respect to the image class. An aggregated image class vector has the same size as an image vector (roughly 1 Million continuous variables), and is computed as the arithmetic mean of all images available for that class. At the end of this aggregation process we obtain 1,000 vectors, corresponding to the representations of each of the 1,000 leaf-node categories in the ImageNet hierarchy. Given the 20,000 images processed, the number of images aggregated per class ranges from 11 to 32. Alternative aggregation methodologies were evaluated, as discussed in the Parametrization section. The 1,000 categories of the ImageNet hierarchy correspond to very diverse entities and objects. Some of those are simple and plain objects, producing few and weak activations. Others are more rich and complex, involving more and stronger activations. To provide each image class with an analogous amount of information, we applied a normalization to the image class vectors. Instead of normalizing each image class vector as a whole, we perform this process layer by layer, with the goal of making the information available at each visual resolution equally relevant for its representation. Alternative normalization methods, including no normalization, were evaluated, as discussed in the Parametrization section.

Vector Operations Image class vector representations are built by aggregating and normalizing the activations of several images within a 2 The feature extraction code is documented in Caffe. Our code to store it is available at https://github.com/dariogarcia/caffe

Figure 1: Feature extraction, aggregation and normalization process followed to build image class vector representations. single vector. The whole process is depicted in Figure 1. To study the information contained within the vector-space we explored image class similarities through vector distance measures. In detail, we use the cosine similarity to compute the distance matrix between the 1,000 image classes, and use that matrix to explore the vector-space. We compare those distance with several distances based on WordNet in the Evaluation section. Other similarity measures were also evaluated, as discussed in the Parametrization section. Besides vector similarities, we also explore vector arithmetics. In the past, regularities of the form “a is to a∗ as b is to b∗“ have been explored for the linguistic context (Mikolov, Yih, and Zweig 2013). Given the more isolated nature of image semantics, we will initially consider simpler relations of the form “a ' b = c”. For that purpose we use the subtraction operator on image class vectors, and combine it with the cosine similarity measure in the Exploring Class Relations section. Given two images, i1 and i2 and a feature f , the subtraction of i2 to i1 is defined as  f (i1 − i2 ) =

f (i1 ) − f (i2 ), 0,

if f (i1 ) ≥ f (i2 ) otherwise

Evaluation Analyzing the amount and variety of information captured in the vector-space defined in section Methodology is not straightforward. Besides the pictures, the only source of information available for each of the 1,000 contained classes is their label, which is mapped to a WordNet concept. WordNet, a lexical database composed groups of synonymous words named synsets, associate terms through semantic relations such as hypernym/hyponym. Thus, WordNet captures part of the semantics of the classes represented as vectors. Although the vector-space and WordNet contain different semantics (visual semantic are very different from lexical semantics), we use WordNet as a baseline. Given the vector-space representation we extract the distance between every pair of image classes captured. We do so using the cosine similarity, and obtain, for every image class, a sorted list of the rest of classes based on their

distance, a ranking. An analogous process is done through WordNet, using the image class labels to find their closest classes in the lexical database. Thus we have, for each image class, two rankings; one according to the vector-space and one according to WordNet. For each pair we compute Spearman’s ρ, which provides a measure of ranking correlation. This measure is bounded between -1 and 1, with values close to either -1 or 1 indicating a strong correlation. We perform this study for the rankings of the 1,000 image classes, resulting in a distribution of correlations. This distribution will tell how close both semantic spaces are. For WordNet we consider six different similarity measures: three based on path length between concepts in the hypernym/hyponym taxonomy (Path, LCh and WuP) and three based on information content, a corpus-based measure of the specificity of a concept (Res, JCn and Lin). All six measures are described in (Pedersen, Patwardhan, and Michelizzi 2004). Additionally, for the three corpus-based measures we use two different corpus, the Brown Corpus, and the British National Corpus (bnc). Given the essential differences between WordNet and the vector-space, we expect that the nine measures combined produce a consistent baseline on the evaluation of vector-space distances. Figure 2 shows the distribution of values for each of the nine WordNet measures. The distribution of ρ values are mostly found between 0.4 and 0.6, indicating a strong correlation between distances on the vector-space and WordNet distances. These results are consistent on all nine WordNet measures used; the average ρ for the 1,000 rankings is 0.44 in the worse case (JCn bnc) and 0.49 in the best case (Res Brown & Res bnc). Thus, vector representations contain a large amount of semantic information also captured by WordNet. This is a particularly interesting feat, considering that WordNet is an imperfect evaluation measure incapable of capturing visual semantics such as color patterns or proportions.

Parametrization In the Methodology section we described the process followed to build and compare vector representations. For each

Figure 2: Histograms of Spearman’s ρ correlation between nine WordNet similarity measures and image class vector similarity. of the steps in this process we considered the various options available regarding specific methods and algorithms. To guide our choices and determine which specific solution was most appropriate we used the distribution of Spearman’s ρ coefficient between distances in the vector-space and WordNet distances, as detailed in the Evaluation section. The parameters we considered were the following: • Aggregation: To compute the image class vector representation from a set of image vectors we used a mean. We tested the three Pythagorean means: the arithmetic mean, the geometric mean, and the harmonic mean. • Normalization: To normalize vectors we tested overall vector normalization and 27 sub-vector normalizations, based on the 27 layers of features found on each vector. Besides, normalization can be applied to images, before these are aggregated into image classes, or to image classes after aggregation. We also considered the case when no normalization was performed. • Distance: To compute distances between vectors we tested both the cosine distance and the euclidean distance. • Threshold: To decrease variability we considered adding an activation threshold which would disregard feature values between 0 and 1 when building image vector representations. We tested the combination of all those parameters, and found the best setting to be an arithmetic mean, an image class normalization by layer, cosine distance and no threshold. According to the method used for evaluation, the aggregation and normalization parameters had the largest impact on performance. The distance algorithm, and particularly the use of an activation threshold, had much smaller effects.

Clusters of Image Classes To further explore the information captured in the image vector-space representations we analyze the setting of clusters of image classes within the space embedding. Clusters are obtained from WordNet hierarchy concepts, which for example specify which image classes are mammals and which are not. To display a 1,000 distance matrix in a single plot we apply metric multi-dimensional scaling (Borg and

Figure 3: Scatter plot of image class vector similarities built through metric multi-dimensional scaling. Black circles belong to images of types of dogs. Dark grey circles belong to images of types of wheeled vehicle Groenen 2005) with two dimensions, a method that builds a two-dimensional mapping of the vector distances which respects the pairwise original similarities (see Figures 3-4). We started by considering two WordNet categories with lots of synsets within the 1,000 leaf-node categories of ImageNet: according to WordNet there are 118 specializations of dog in the ImageNet data, and 44 specializations of wheeled vehicle3 . Given the two-dimensional similarity mapping, we highlight the location of the image classes belonging to one of these two sets in Figure 3. Image classes belonging to synsets of dog are highlighted in black, while image classes belonging to synsets of wheeled vehicle are painted in dark grey. The first conclusion drawn from looking at Figure 3 is that the two sets of highlighted images compose definable clusters. Although precision is not perfect, image classes belonging to the same WordNet category are clearly assembled together in the vector-space representation. That is particularly 3 To these 44 classes we added the school bus, minibus and trolleybus image classes, which we consider to be wheeled transports.

relevant considering the wide variety of dogs computed (e.g., Chihuahua, Husky, Poodle, Great Dane), which apparently have little visual features in common. According to these results, visual features common on all dogs are more relevant than variable features such as size, or proportion. According to the latter a Chihuahua could be found to be more similar to rabbits or cats than to dogs. The cluster defined by the wheeled vehicle image classes has a lower precision than that of dogs, because wheeled vehicles are even more varied than dogs (e.g., Monocycle, Tank, train). Nevertheless all but one wheeled vehicle are located on the same quadrant of the graph, indicating that there is a large and reliable set of features in the vector representation identifying images of this type. The one wheeled vehicle located outside of the middleleft quadrant, in the low-right part of Figure 3, corresponds to snowmobile, a unique type of wheeled vehicle. Figure 3 shows a separation in the two-dimensional representation, splitting the two-dimensional mapping of image classes into two sets. This separation is the only consistently sparse area visible at first sight in the graph, which means it may be of relevance for the vector-space. To explain this phenomenon we explored one of the most basic categorizations in WordNet, which separates ImageNet classes in two types of entities: living things, defined by WordNet as a living (or once living) entity and the rest. By painting the images belonging to living things we obtain the graph of Figure 4. This graph shows how the separation found in the vector-space corresponds to this most simple categorization, clustering images depending on whether they depict living things or not. remarkably, the precision of this categorization is almost perfect. The few mistakes done either correspond to organisms with unique shapes and textures (e.g., lobster, baseball player, dragonfly) or to things which are often depicted around living things (e.g., snorkel, dog sled). In the case of coral reef, according to WordNet it is not a living organism, but in the vector-space it is so. Encouraged by these results we tried to obtain a representation of the vector-space which showed the separation between living organisms and the rest with more clarity. For that purpose we tested a non-linear mapping of the distances to three dimensions using the ISOMAP algorithm (Tenenbaum, de Silva, and Langford 2000). The features extracted from the network are combined non-linearly to obtain the class of an image, so this kind of transformation should highlight the inherent non-linearity of the vector-space. Figure 5 shows a more evident separation among these two synsets and also a more complex structure within the class images. Results obtained in the analysis of clusters in the vectorspace indicate these representations can be used to find groups of classes sharing large semantics. The limitations of these semantics are hard to define at this point, if possible at all, as it would require us to define what can and what cannot be learnt only from images. Since the original deep networks was not provided with semantic information on the 1,000 categories it discriminated, all semantics captured had to be extracted from visual features. The distinction between living things and the rest is a paradigmatic case given the variety of organisms computed. Horses, salmons, eagles, lizards and mushrooms seem to have little in common visually, and

Figure 4: Scatter plot of image class vector similarities built through metric multi-dimensional scaling. Black circles belong to images of types of living things. yet they are clustered together in this representation. Hence, it seems that textures, shapes or structural patterns of living things are distinct enough from the rest as to define a significant difference between them. Further exploring and explaining this behavior is an issue we intend to address as follow-up work.

Exploring Class Relations Results so far have shown that vector-space representations of image classes contain a large amount of information on those classes. To analyze further which type of visual semantics are those we now consider vector arithmetics, subtracting image class representations to one another. Given the resultant vector of subtracting two image classes as defined in the Vector Operations subsection, we then look for the closest image class through cosine similarity. We start by considering image classes which can be understood as the concatenation (not overlapped) of two other classes. An example of that could be chair plus wheel, which could produce of-

Figure 5: Scatter plot built through ISOMAP. Black circles belong to images of types of living things.

# 1 2 3 4 5

first class church mosque horse cart ice bear giant panda

-

second class mosque church sorrel (horse) brown bear brown bear

closest class bell cot stupa rickshaw kuvasz skunk

Table 1: Equations used to explore vector-space regularities.

fice chair or wheelchair. Afterwards we consider overlapped relations where two classes are strongly intertwined to produce a third. An example of that could be wolf plus man, which could produce werewolf. Table 1 shows the equations from which we extract our preliminary conclusions. We first consider concatenated images through church and mosque, two image classes located closest to one another in the vector-space. We performed subtraction operations on both directions (Table 1, lines 1-2), and found that the closest class to church - mosque was bell cote, an architecture element used to shelter bells typical of Christian churches and not found on mosques. On the opposite direction, we found the closest class to mosque - church was stupa, a hemispherical structure, often with a thin tower on top, typical of Buddhism. Although stupas do not belong to mosques, mosques are often hemispherical, and include thin towers. Features not found on Christian churches. Another example of concatenated images is found in the equation horse cart minus sorrel, the closest result being rickshaw (Table 1, line 3). sorrel is the only type of horse in the ImageNet classes, while rickshaw is a two wheeled passenger cart pulled by one person typical of Asia. Visually speaking, the result of removing the visual pattern of a horse from a horse cart (e.g., a cart) is almost indistinguishable from a rickshaw, as rickshaw pictures do not always include a person. This relation is consistent in the vector-space and can be reproduced through the subtraction operator. According to these results, the horse cart vector-representation can be decomposed into two independent vectors, corresponding to its two main composing entities: a horse and a wagon. These results indicate that the subtraction operator could be used analogously to any concatenated class to obtain isolated representations of the elements composing them. Overlapped image classes can be considered to be an amalgam of more than one visual entity. To explore this case we used the examples shown in lines 4-5 of Table 1. For our first example, we computed the difference between brown bear and ice bear. The closest classes were kuvasz, Maltese dog, Sealyham terrier, white wolf, Old English sheepdog and Arctic fox among others, all white coated mammals of varying size and proportion. These results indicate that the vector obtained from theice bear minus brown bear subtraction could be labeled as white fur entity, as being white is the main difference separating an ice bear from a brown bear, and also the main common feature found on the classes closest to the subtraction. To further support this hypothesis, we subtracted brown bear from giant panda. In this case the closest classes to the result were skunk, Angora rabbit, soc-

cer ball and indri (a monkey). Remarkably, all four classes are characterized by having white and black color patterns, while being diverse in many other aspects (e.g., size, shape, texture). Thus, the vector resultant of giant panda - brown bear seems to represent an image class similar to black and white spotted entity. We go one step further, and explore the relations defined by the black and white spotted entity image class. We subtract the vector-representation corresponding to this visual pattern from skunk, Angora rabbit, soccer ball and indri, and obtain the following regularities: Panda is to Brown Bear as Skunk is to Badger, as Angora Rabbit is to Persian Cat, as Indri is to Howler Monkey, and as Soccer ball is to Crash helmet. Consistently, when the vector of black and white spotted entity is removed from black and white spotted entities, we obtain elements which are close to it in many other aspects (e.g., shape, proportion, texture) but which has a different coloring pattern. This shows the existence of regularities within the vector-space, which can be used to perform multiple arithmetic and reasoning operations based on visual semantics.

Figure 6: Regularity among pairs of images based on the vector black and white spotted entity, depicted as an arrow.

Conclusions In this paper we build a vector-space representation based on the features originally learnt by a deep learning network to capture visual semantics. Our goal was to explore the information contained within that variety and complex set of patterns, as this could be used by other learning and reasoning methods. Our first test shows how different elements with common basic visual semantics (e.g., dogs, wheeled vehicles) are found close to one another in the vector-space. These semantics seem to be well balanced in terms of significance, as variations in proportion, size and color do not overcome the more essential similarities (e.g., those shared by 118 kinds of dogs). Our second test shows the vectorspace to hold rich information on textures, shapes and proportions of images. Information detailed enough as to capture untaught distinctions such as the one between living organisms and non-living things. This results make us wonder what is the limit of what can be learnt through visual information. Finally, we explore vector regularities through

arithmetic operations, finding vector representations of composed images to be decomposable. Additionally, abstract image classes can be obtained (e.g., black and white spotted entity) and operated with, showing the existence of regularities among images with shared visual semantics. These results show how to extract and exploit the internal representations of images learnt by deep networks. The methodology presented extracts the semantics contained within a network, making them available for other learning and reasoning methodologies. This approach can empower multiple innovative solutions, mostly in the field of visual learning (e.g., through clustering) and visual reasoning (e.g., through arithmetics). The main follow-up work of this research goes in that direction, studying possible applications, and exploring the limits of the knowledge coded within vector-space representations.

Acknowledgments This research was supported by an IBM-BSC Joint Study Agreement, No. W1564631.

References [Borg and Groenen 2005] Borg, I., and Groenen, P. J. 2005. Modern multidimensional scaling: Theory and applications. Springer Science & Business Media. [Jia et al. 2014] Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093. [LeCun, Bengio, and Hinton 2015] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436–444. [Levy, Goldberg, and Ramat-Gan 2014] Levy, O.; Goldberg, Y.; and Ramat-Gan, I. 2014. Linguistic regularities in sparse and explicit word representations. CoNLL-2014 171. [Mikolov, Le, and Sutskever 2013] Mikolov, T.; Le, Q. V.; and Sutskever, I. 2013. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168. [Mikolov, Yih, and Zweig 2013] Mikolov, T.; Yih, W.-t.; and Zweig, G. 2013. Linguistic Regularities in Continuous Space Word Representations. In HLT-NAACL, 746–751. [Pedersen, Patwardhan, and Michelizzi 2004] Pedersen, T.; Patwardhan, S.; and Michelizzi, J. 2004. WordNet:: Similarity: measuring the relatedness of concepts. In Demonstration papers at hlt-naacl 2004, 38–41. Association for Computational Linguistics. [Russakovsky et al. 2015] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 1–42. [Szegedy et al. 2014] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2014. Going deeper with convolutions. arXiv preprint arXiv:1409.4842.

[Tenenbaum, de Silva, and Langford 2000] Tenenbaum, J. B.; de Silva, V.; and Langford, J. C. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323.