Camera-ready version being prepared.
Camera-ready version being prepared.
Relationships among objects play a crucial role in image understanding. Despite the great success of deep learning techniques in recognizing individual objects, reasoning about the relationships among objects remains a challenging task. Previous methods often treat this as a classification problem, considering each type of relationship (e.g. “ride”) or each distinct visual phrase (e.g. “person-ride-horse”) as a category. Such approaches are faced with significant difficulties caused by the high diversity of visual appearance for each kind of relationships or the large number of distinct visual phrases. We propose an integrated framework to tackle this problem. At the heart of this framework is the Deep Relational Network, a novel formulation designed specifically for exploiting the statistical dependencies between objects and their relationships. On two large data sets, the proposed method achieves substantial improvement over state-of-the-art.
A number of studies have shown that increasing the depth or width of convolutional networks is a rewarding approach to improve the performance of image recognition. In our study, however, we observed difficulties along both directions. On one hand, the pursuit for very deep networks is met with a diminishing return and increased training difficulty; on the other hand, widening a network would result in a quadratic growth in both computational cost and memory demand. These difficulties motivate us to explore structural diversity in designing deep networks, a new dimension beyond just depth and width. Specifically, we present a new family of modules, namely the PolyInception, which can be flexibly inserted in isolation or in a composition as replacements of different parts of a network. Choosing PolyInception modules with the guidance of architectural efficiency can improve the expressive power while preserving comparable computational cost. The Very Deep PolyNet1, designed following this direction, demonstrates substantial improvements over the state-of-the-art on the ILSVRC 2012 benchmark. Compared to Inception-ResNet-v2, it reduces the top-5 validation error on single crops from 4.9% to 4.25%, and that on multi-crops from 3.7% to 3.45%.
full paper will be made publically available soon
Despite the remarkable progress in recent years, detecting objects in a new context remains a challenging task. Detectors learned from a public dataset can only work with a fixed list of categories, while training from scratch usu- ally requires a large amount of training data with detailed annotations. This work aims to explore a novel approach – learning object detectors from documentary films in a weakly supervised manner. This is inspired by the obser- vation that documentaries often provide dedicated exposition of certain object categories, where visual presentations are aligned with subtitles. We believe that object detec- tors can be learned from such a rich source of information. Towards this goal, we develop a joint probabilistic framework, where individual pieces of information, including video frames and subtitles, are brought together via both visual and linguistic links. On top of this formulation, we further derive a weakly supervised learning algorithm, where object model learning and training set mining are unified in an optimization procedure. Experimental results on a real world dataset demonstrate that this is an effective approach to learning new object detectors.
Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the selection module, to learn the action models and reason about the temporal duration of action instances, respectively. These two components are implemented with feed-forward networks, and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet only employs weak supervision, our method achieves performance superior or comparable to that of those strongly supervised approaches on these two datasets.
Markov Random Fields (MRFs), a formulation widely used in generative image modeling, have long been plagued by the lack of expressive power. This issue is primarily due to the fact that conventional MRFs formulations tend to use simplistic factors to capture local patterns. In this paper, we move beyond such limitations, and propose a novel MRF model that uses fully-connected neurons to express the complex interactions among pixels. Through theoretical analysis, we reveal an inherent connection between this model and recurrent neural networks, and thereon derive an approximated feed-forward network that couples multiple RNNs along opposite directions. This formulation combines the expressive power of deep neural networks and the cyclic dependency structure of MRF in a unified model, bringing the modeling capability to a new level. The feed-forward approximation also allows it to be efficiently learned from data. Experimental results on a variety of low-level vision tasks show notable improvement over state-of-the-arts.
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( 69.4%) and UCF101 (94.2%). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.
We won the first place in the classification task at the ActivityNet 2016 challenge.
This paper presents the method that underlies our submission to the untrimmed video classification task of ActivityNet Challenge 2016. We follow the basic pipeline of temporal segment networks and further raise the performance via a number of other techniques. Specifically, we use the latest deep model architecture, e.g., ResNet and Inception V3, and introduce new aggregation schemes (top-k and attention-weighted pooling). Additionally, we incorporate the audio as a complementary channel, extracting relevant information via a CNN applied to the spectrograms. With these techniques, we derive an ensemble of deep models, which, together, attains a high classification accuracy (mAP 93.23%) on the testing set and secured the first place in the challenge.
The rapid growth of web images presents new challenges as well as opportunities to the task of image understanding. Conventional approaches rely heavily on fine-grained annotations, such as bounding boxes and semantic segmentations, which are not available for web-scale images. In general, images over the Internet are accompanied with descriptive texts, which are relevant to their contents. To bridge the gap between textual and visual analysis for image understanding, this paper presents an algorithm to learn the relations between scenes, objects, and texts with the help of image-level annotations. In particular, the relation between the texts and objects is modeled as the matching probability between the nouns and the object classes, which can be solved via a constrained bipartite matching problem. On the other hand, the relations between the scenes and objects/texts are modeled as the conditional distributions of their co-occurrence. Built upon the learned cross-domain relations, an integrated model brings together scenes, objects, and texts for joint image understanding, including scene classification, object classification and localization, and the prediction of object cardinalities. The proposed cross-domain learning algorithm and the integrated model elevate the performance of image understanding for web images in the context of textual descriptions. Experimental results show that the proposed algorithm significantly outperforms conventional methods in various computer vision tasks.
Binary representation is desirable for its memory efficiency, computation speed and robustness. In this paper, we propose adjustable bounded rectifiers to learn binary representations for deep neural networks. While hard constraining representations across layers to be binary makes training unreasonably difficult, we softly encourage activations to diverge from real values to binary by approximating step functions. Our final representation is completely binary. We test our approach on MNIST, CIFAR10, and ILSVRC2012 dataset, and systematically study the training dynamics of the binarization process. Our approach can binarize the last layer representation without loss of performance and binarize all the layers with reasonably small degradations. The memory space that it saves may allow more sophisticated models to be deployed, thus compensating the loss. To the best of our knowledge, this is the first work to report results on current deep network architectures using complete binary middle representations. Given the learned representations, we find that the firing or inhibition of a binary neuron is usually associated with a meaningful interpretation across different classes. This suggests that the semantic structure of a neural network may be manifested through a guided binarization process.
This paper proposes a novel framework for generating lingual descriptions of indoor scenes. Whereas substantial efforts have been made to tackle this problem, previous approaches focusing primarily on generating a single sentence for each image, which is not sufficient for describing complex scenes. We attempt to go beyond this, by generating coherent descriptions with multiple sentences. Our approach is distinguished from conventional ones in several aspects: (1) a 3D visual parsing system that jointly infers objects, attributes, and relations; (2) a generative grammar learned automatically from training text; and (3) a text generation algorithm that takes into account coherence among sentences. Experiments on the NYU-v2 dataset show that our framework is able to generate natural multi-sentence descriptions, outperforming those produced by a baseline.
A considerable portion of web images capture events that occur in our personal lives or social activities. In this paper, we aim to develop an effective method for recognizing events from such images. Despite the sheer amount of study on event recognition, most existing methods rely on videos and are not directly applicable to this task. Generally, events are complex phenomena that involve interactions among people and objects, and therefore analysis of event photos requires techniques that can go beyond recognizing individual objects and carry out joint reasoning based on evidences of multiple aspects. Inspired by the recent success of deep learning, we formulate a multi-layer framework to tackle this problem, which takes into account both visual appearance and the interactions among humans and objects, and combines them via semantic fusion. An important issue arising here is that humans and objects discovered by detectors are in the form of bounding boxes, and there is no straightforward way to represent their interactions and incorporate them with a deep network. We address this using a novel strategy that projects the detected instances onto multi-scale spatial maps. On a large dataset with 60,000 images, the proposed method achieved substantial improvement over the state-of-the-art, raising the accuracy of event recognition by over 10%.
Images are often used to convey many different concepts or illustrate many different stories. We propose an algorithm to mine multiple diverse, relevant, and interesting text snippets for images on the web. Our algorithm scales to all images on the web. For each image, all webpages that contain it are considered. The top-K text snippet selection problem is posed as combinatorial subset selection with the goal of choosing an optimal set of snippets that maximizes a combination of relevancy, interestingness, and diversity. The relevancy and interestingness are scored by machine learned models. Our algorithm is run at scale on the entire image index of a major search engine resulting in the construction of a database of images with their corresponding text snippets. We validate the quality of the database through a large-scale comparative study. We showcase the utility of the database through two web-scale applications: (a) augmentation of images on the web as webpages are browsed and (b) an image browsing experience (similar in spirit to web browsing) that is enabled by interconnecting semantically related images (which may not be visually related) through shared concepts in their corresponding text snippets.
In this paper, we tackle the problem of retrieving videos using complex natural language queries. Towards this goal, we first parse the sentential descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm. Our approach exploits object appearance, motion and spatial relations, and learns the importance of each term using structure prediction. We demonstrate the effectiveness of our approach on a new dataset designed for semantic search in the context of autonomous driving, which exhibits complex and highly dynamic scenes with many objects. We show that our approach is able to locate a major portion of the objects described in the query with high accuracy, and improve the relevance in video retrieval.
In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing. Importantly, in doing so, we reason about which particular object each noun/pronoun is referring to in the image. This allows us to utilize visual information in order to disambiguate the so-called coreference resolution problem that arises in text. Towards this goal, we propose a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. We demonstrate the effectiveness of our approach on the challenging NYU-RGBD v2 dataset, which we enrich with natural lingual descriptions. We show that our approach significantly improves 3D detection and scene classification accuracy, and is able to reliably estimate the text-to-image alignment. Furthermore, by using textual and visual information, we are also able to successfully deal with coreference in text, improving upon the state-of-the-art Stanford coreference system.
Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm - random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency - orders of magnitude speed-up compared to the state-of-the-art.
There has been growing interest in indoor scene understanding recently. Conventional approaches mainly rely on 2D images and have been faced with various challenges. In this paper, we tackle this problem using RGBD data. Towards this goal, we propose a holistic approach that exploits 2D segmentation, 3D geometry, as well as contextual relations between scenes and objects. Specifically, we extend the CPMC framework to 3D in order to generate candidate cuboids, and develop a conditional random field to integrate information from different sources to classify the cuboids. With this formulation, scene classification and 3D object recognition are coupled and can be jointly solved through probabilistic inference. We test the effectiveness of our approach on the challenging NYU v2 dataset. The experimental results demonstrate that through effective evidence integration and holistic reasoning, this approach achieves substantial improvement over the state-of-the-art.
In this paper, we develop a generative model to describe the layouts of outdoor scenes - the spatial configuration of regions. Specifically, the layout of an image is represented as a composite of regions, each associated with a semantic topic. At the heart of this model is a novel stochastic process called Spatial Topic Process, which generates a spatial map of topics from a set of coupled Gaussian processes, thus allowing the distributions of topics to vary continuously across the image plane. A key aspect that distinguishes this model from previous ones consists in its capability of capturing dependencies across both locations and topics while allowing substantial variations in the layouts. We demonstrate the practical utility of the proposed model by testing it on scene classification, semantic segmentation, and layout hallucination.
Age invariant face recognition has received increasing attention due to its great potential in real world applications. In spite of the great progress in face recognition techniques, reliably recognizing faces across ages remains a difficult task. The facial appearance of a person changes substantially over time, resulting in significant intra-class variations. Hence, the key to tackle this problem is to separate the variation caused by aging from the person-specific features that are stable. Specifically, we propose a new method, called Hidden Factor Analysis (HFA). This method captures the intuition above through a probabilistic model with two latent factors: an identity factor that is age-invariant and an age factor affected by the aging process. Then, the observed appearance can be modeled as a combination of the components generated based on these factors. We also develop a learning algorithm that jointly estimates the latent factors and the model parameters using an EM procedure. Extensive experiments on two well-known public domain face aging datasets: MORPH (the largest public face aging database) and FGNET, clearly show that the proposed method achieves notable improvement over state-of-the-art algorithms.
Mixture distributions are often used to model complex data. In this paper, we develop a new method that jointly estimates mixture models over multiple data sets by exploiting the statistical dependencies between them. Specifically, we introduce a set of latent Dirichlet processes as sources of component models (atoms), and for each data set, we construct a nonparametric mixture model by combining sub-sampled versions of the latent DPs. Each mixture model may acquire atoms from different latent DPs, while each atom may be shared by multiple mixtures. This multi-to-multi association distinguishes the proposed method from previous ones that require the model structure to be a tree or a chain, allowing more flexible designs. We also derive a sampling algorithm that jointly infers the model parameters and present experiments on both document analysis and image modeling.
Dirichlet process mixture models (DPMMs) have become an important tool to describe complex data in the past decade. However, learning multiple DPMMs over related sets of observations remains an open question. A popular approach to this problem is the Hierarchical Dirichlet Process (HDP), which is limited in that models are required to be organized as a tree. In this paper, we present a generic framework to construct dependent DP mixtures, drawing on a new formulation that we proposed recently. This framework breaks the limitations of HDP, allowing each mixture model to inherit atoms from multiple sources with covariate-dependent probabilities. We show, through experiments on real data, that the proposed framework allows one to devise models that capture the dependency between different data sets more accurately - this capability is important in a context with multiple data sources, such as autonomous planning and distributed sensing.
Many vision problems, such as object recognition and image synthesis, are greatly impacted by deformation of objects. In this paper, we develop a deformation model based on Lie algebraic analysis. This work aims to provide a generative model that explicitly decouples deformation from appearance, which is fundamentally different from the prior work that focuses on deformation-resilient features or metrics. Specifically, the deformation group for each object can be characterized by a set of Lie algebraic basis. Such basis for different objects are related via parallel transport. Exploiting the parallel transport relations, we formulate an optimization problem, and derive an algorithm that jointly estimates the deformation basis for a class of objects, given a set of images resulted from the action of the deformations. We test the proposed model empirically on both character recognition and face synthesis.
Face recognition in the wild has been one of the longest standing computer vision challenges. While there has been constant improvement over the years, the variations in appearance, illumination, pose etc. still makes it one the hardest task to do well. In this paper we summa- rize two techniques that leverage context and show sig- nificant improvement over vision only methods. At the heart of the approach is a probabilistic model of context that captures dependencies induced via set of contex- tual relations. The model allows application of standard variational inference procedures to infer labels that are consistent with contextual constraints.
We present a new generative image model, integrating techniques arising from two different domains: manifold modeling and Markov random fields. First, we develop a probabilistic model with a mixture of hyperplanes to approximate the manifold of orientable image patches, and demonstrate that it is more effective than the field of experts in expressing local texture patterns. Next, we develop a construction that yields an MRF for coherent image generation, given a configuration of local patch models, and thereby establish a prior distribution over an MRF space. Taking advantage of the model structure, we derive a variational inference algorithm, and apply it to low-level vision. In contrast to previous methods that rely on a single MRF, the method infers an approximate posterior distribution of MRFs, and recovers the underlying images by combining the predictions in a Bayesian fashion. Experiments quantitatively demonstrate superior performance as compared to state-of-the-art methods on image denoising and inpainting.
Markov random fields play a central role in solving a variety of low level vision problems, including denoising, inpainting, segmentation, and motion estimation. Much previous work was based on MRFs with hand-crafted networks, yet the underlying graphical structure is rarely explored. In this paper, we show that if appropriately estimated, the MRF's graphical structure, which captures significant information about appearance and motion, can provide crucial guidance to low level vision tasks. Motivated by this observation, we propose a principled framework to solve low level vision tasks via an exponential family of MRFs with variable structures, which we call Switchable MRFs. The approach explicitly seeks a structure that optimally adapts to the image or video along the pursuit of task-specific goals. Through theoretical analysis and experimental study, we demonstrate that the proposed method addresses a number of drawbacks suffered by previous methods, including failure to capture heavy-tail statistics, computational difficulties, and lack of generality.
MCMC sampling has been extensively studied and used in probabilistic inference. Many algorithms rely on local updates to explore the space, often resulting in slow convergence or failure to mix when there is no path from one set of states to another via local changes. We propose an efficient method for sampling from combinatorial spaces that addresses these issues via bridging states that facilitate the communication between different parts of the space. Such states can be created dynamically, providing more flexibility than methods relying on specific space structures to design jump proposals. Theoretical analysis of the approach yields bounds on mixing times. Empirical analysis demonstrates the practical utility on two problems: constrained map labeling and inferring partial order of object layers in a video.
I helped building the data analysis algorithm for this Chem.Eng paper, and was therefore listed as one of the authors.
We report the selective detection of single nitric oxide (NO) molecules using a specific DNA sequence of d(AT)15 oligonucleotides, adsorbed to an array of near-infrared fluorescent semiconducting single-walled carbon nanotubes (AT15−SWNT). While SWNT suspended with eight other variant DNA sequences show fluorescence quenching or enhancement from analytes such as dopamine, NADH, l-ascorbic acid, and riboflavin, d(AT)15 imparts SWNT with a distinct selectivity toward NO. In contrast, the electrostatically neutral polyvinyl alcohol enables no response to nitric oxide, but exhibits fluorescent enhancement to other molecules in the tested library. For AT15−SWNT, a stepwise fluorescence decrease is observed when the nanotubes are exposed to NO, reporting the dynamics of single-molecule NO adsorption via SWNT exciton quenching. We describe these quenching traces using a birth-and-death Markov model, and the maximum likelihood estimator of adsorption and desorption rates of NO is derived. Applying the method to simulated traces indicates that the resulting error in the estimated rate constants is less than 5% under our experimental conditions, allowing for calibration using a series of NO concentrations. As expected, the adsorption rate is found to be linearly proportional to NO concentration, and the intrinsic single-site NO adsorption rate constant is 0.001 s−1 μM NO−1. The ability to detect nitric oxide quantitatively at the single-molecule level may find applications in new cellular assays for the study of nitric oxide carcinogenesis and chemical signaling, as well as medical diagnostics for inflammation.
Best student paper award (only two awarded out of about 1600 submissions).
We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models.
We present a framework for vision-assisted tagging of personal photo collections using context. Whereas previous efforts mainly focus on tagging people, we develop a unified approach to jointly tag across multiple domains (specifically people, events, and locations). The heart of our approach is a generic probabilistic model of context that couples the domains through a set of cross-domain relations. Each relation models how likely the instances in two domains are to co-occur. Based on this model, we derive an algorithm that simultaneously estimates the cross-domain relations and infers the unknown tags in a semi-supervised manner. We conducted experiments on two well-known datasets and obtained significant performance improvements in both people and location recognition. We also demonstrated the ability to infer event labels with missing timestamps (i.e. with no event features).
We propose a principled framework to model persistent motion in dynamic scenes. In contrast to previous efforts on object tracking and optical flow estimation that focus on local motion, we primarily aim at inferring a global model of persistent and collective dynamics. With this in mind, we first introduce the concept of geometric flow that describes motion simultaneously over space and time, and derive a vector space representation based on Lie algebra. We then extend it to model complex motion by combining multiple flows in a geometrically consistent manner. Taking advantage of the linear nature of this representation, we formulate a stochastic flow model, and incorporate a Gaussian process to capture the spatial coherence more effectively. This model leads to an efficient and robust algorithm that can integrate both point pairs and frame differences in motion estimation. We conducted experiments on different types of videos. The results clearly demonstrate that the proposed approach is effective in modeling persistent motion.
We present a novel method for modeling dynamic visual phenomena, which consists of two key aspects. First, the integral motion of constituent elements in a dynamic scene is captured by a common underlying geometric transform process. Second, a Lie algebraic representation of the transform process is introduced, which maps the transformation group to a vector space, and thus overcomes the difficulties due to the group structure. Consequently, the statistical learning techniques based on vector spaces can be readily applied. Moreover, we discuss the intrinsic connections between the Lie algebra and the Linear dynamical processes, showing that our model induces spatially varying fields that can be estimated from local motions without continuous tracking. Following this, we further develop a statistical framework to robustly learn the flow models from noisy and partially corrupted observations. The proposed methodology is demonstrated on real world phenomenon, inferring common motion patterns from surveillance videos of crowded scenes and satellite data of weather evolution.
In this paper, we develop a new framework for face recognition based on nonparametric discriminant analysis (NDA) and multiclassifier integration. Traditional LDA-based methods suffer a fundamental limitation originating from the parametric nature of scatter matrices, which are based on the Gaussian distribution assumption. The performance of these methods notably degrades when the actual distribution is non-Gaussian. To address this problem, we propose a new formulation of scatter matrices to extend the two-class NDA to multiclass cases. Then, in order to exploit the discriminant information in both the principal space and the null space of the intraclass scatter matrix, we develop two improved multiclass NDA-based algorithms (NSA and NFA) with each one having two complementary methods that are based on the principal space and the null space of the intraclass scatter matrix, respectively. Comparing to the NSA, the NFA is more effective in the utilization of the classification boundary information. In order to exploit the complementary nature of the two kinds of NFA (PNFA and NNFA), we finally develop a dual NFA-based multiclassifier fusion framework by employing the overcomplete Gabor representation for face images to boost the recognition performance. We show the improvements of the developed new algorithms over the traditional subspace methods through comparative experiments on two challenging face databases, the Purdue AR database and the XM2VTS database.
This paper presents a framework to automatically detect and recover the occluded facial region. We first derive a Bayesian formulation unifying the occlusion detection and recovery stages. Then a quality assessment model is developed to drive both the detection and recovery processes, which captures the face priors in both global correlation and local patterns. Based on this formulation, we further propose GraphCut-based Detection and Confidence-Oriented Sampling to attain optimal detection and recovery respectively. Compared to traditional works in image repairing, our approach is distinct in three aspects: (1) it frees the user from marking the occlusion area by incorporating an automatic occlusion detector; (2) it learns a face quality model as a criterion to guide the whole procedure; (3) it couples the detection and occlusion stages to simultaneously achieve two goals: accurate occlusion detection and high quality recovery. The comparative experiments show that our method can recover the occluded faces with both the global coherence and local details well preserved.
Outdoor face recognition is among the most challenging problems for face recognition. In this paper, we develop a discriminant mutual subspace learning algorithm for indoor and outdoor face recognition. Unlike traditional algorithms using one subspace to model both indoor and outdoor face images, our algorithm simultaneously learn two related subspaces for indoor and outdoor images respectively thus can better model both. To further improve the recognition performance we develop a DMSL-based multi-classifier fusion framework on Gabor images using a new fusion method called adaptive informative fusion scheme. Experimental results clearly show that this framework can greatly enhance the recognition performance.
Human faces manifest distinct structures and characteristics when observed in different scales. Traditional face recognition techniques mainly rely on low-resolution face images, leading to the lost of significant information contained in the microscopic traits. In this paper, we introduce a multilayer framework for high resolution face recognition exploiting features in multiple scales. Each face image is factorized into four layers: global appearance, facial organs, skins, and irregular details. We employ Multilevel PCA followed by Regularized LDA to model global appearance and facial organs. However, the description of skin texture and irregular details, for which conventional vector representation are not suitable, brings forth the need of developing novel representations. To address the issue, Discriminative Multiscale Texton Features and SIFT-Activated Pictorial Structure are proposed to describe skin and subtle details respectively. To effectively combine the information conveyed by all layers, we further design an metric fusion algorithm adaptively placing emphasis onto the highly confident layers. Through systematic experiments, we identify different roles played by the layers and convincingly show that by utilizing their complementarities, our framework achieves remarkable performance improvement.
Inspired by the underlying relationship between classification capability and the mutual information, in this paper, we first establish a quantitative model to describe the information transmission process from feature extraction to final classification and identify the critical channel in this propagation path, and then propose a Maximum Effective Information Criteria for pursuing the optimal subspace in the sense of preserving maximum information that can be conveyed to final decision. Considering the orthogonality and rotation invariance properties of the solution space, we present a Conjugate Gradient method constrained on a Grassmann manifold to exploit the geometric traits of the solution space for enhancing the efficiency of optimization. Comprehensive experiments demonstrate that the framework integrating the Maximum Effective Information Criteria and Grassmann manifold-based optimization method significantly improves the classification performance.
The paper introduces a new framework for feature learning in classification motivated by information theory. We first systematically study the information structure and present a novel perspective revealing the two key factors in information utilization: class-relevance and redundancy. We derive a new information decomposition model where a novel concept called class-relevant redundancy is introduced. Subsequently a new algorithm called Conditional Informative Feature Extraction is formulated, which maximizes the joint class-relevant information by explicitly reducing the class-relevant redundancies among features. To address the computational difficulties in information-based optimization, we incorporate Parzen window estimation into the discrete approximation of the objective function and propose a Local Active Region method which substantially increases the optimization efficiency. To effectively utilize the extracted feature set, we propose a Bayesian MAP formulation for feature fusion, which unifies Laplacian Sparse Prior and Multivariate Logistic Regression to learn a fusion rule with good generalization capability. Realizing the inefficiency caused by separate treatment of the extraction stage and the fusion stage, we further develop an improved design of the framework to coordinate the two stages by introducing a feedback from the fusion stage to the extraction stage, which significantly enhances the learning efficiency. The results of the comparative experiments show remarkable improvements achieved by our framework.
Recently, the wide deployment of practical face recognition systems gives rise to the emergence of the inter-modality face recognition problem. In this problem, the face images in the database and the query images captured on spot are acquired under quite different conditions or even using different equipments. Conventional approaches either treat the samples in a uniform model or introduce an intermediate conversion stage, both of which would lead to severe performance degradation due to the great discrepancies between different modalities. In this paper, we propose a novel algorithm called Common Discriminant Feature Extraction specially tailored to the inter-modality problem. In the algorithm, two transforms are simultaneously learned to transform the samples in both modalities respectively to the common feature space. We formulate the learning objective by incorporating both the empirical discriminative power and the local smoothness of the feature transformation. By explicitly controlling the model complexity through the smoothness constraint, we can effectively reduce the risk of overfitting and enhance the generalization capability. Furthermore, to cope with the nongaussian distribution and diverse variations in the sample space, we develop two nonlinear extensions of the algorithm: one is based on kernelization, while the other is a multi-mode framework. These extensions substantially improve the recognition performance in complex situation. Extensive experiments are conducted to test our algorithms in two application scenarios: optical image-infrared image recognition and photo-sketch recognition. Our algorithms show excellent performance in the experiments.
In this paper, we present a new learning framework for image style transforms. Considering that the images in different style representations constitute different vector spaces, we propose a novel framework called Coupled Space Learning to learn the relations between different spaces and use them to infer the images from one style to another style. Observing that for each style, only the components correlated to the space of the target style are useful for inference, we first develop the Correlative Component Analysis to pursue the embedded hidden subspaces that best preserve the inter-space correlation information. Then we develop the Coupled Bidirectional Transform algorithm to estimate the transforms between the two embedded spaces, where the coupling between the forward transform and the backward transform is explicitly taken into account. To enhance the capability of modelling complex data, we further develop the Coupled Gaussian Mixture Model to generalize our framework to a mixture-model architecture. The effectiveness of the framework is demonstrated in the applications including face super-resolution and bidirectional portrait style transforms.
In this paper, we propose a new face hallucination framework based on image patches, which integrates two novel statistical super-resolution models. Considering that im- age patches reflect the combined effect of personal characteristics and patch-location, we first formulate a TensorPatch model based on multilinear analysis to explicitly model the interaction between multiple constituent factors. Motivated by Locally Linear Embedding, we develop an enhanced multilinear patch hallucination algorithm, which efficiently exploits the local distribution structure in the sam- ple space. To better preserve face subtle details, we derive the Coupled PCA algorithm to learn the relation between high-resolution residue and low-resolution residue, which is utilized for compensate the error residue in hallucinated images. Experiments demonstrate that our framework on one hand well maintains the global facial structures, on the other hand recovers the detailed facial traits in high quality.
Linear discriminant analysis (LDA) is a popular face recognition technique. However, an inherent problem with this technique stems from the parametric nature of the scatter matrix, in which the sample distribution in each class is assumed to be normal distribution. So it tends to suffer in the case of non-normal distribution. In this paper a nonparametric scatter matrix is defined to replace the traditional parametric scatter matrix in order to overcome this problem. Two kinds of nonparametric subspace analysis (NSA): PNSA and NNSA are proposed for face recognition. The former is based on the principal space of intra-personal scatter matrix, while the latter is based on the null space. In addition, based on the complementary nature of PNSA and NNSA, we further develop a dual NSA-based classifier framework using Gabor images to further improve the recognition performance. Experiments achieve near perfect recognition accuracy (99.7%) on the XM2VTS database.
Linear Discriminant Analysis(LDA) is widely-used in face recognition systems. However, with the traditional formulation, the available information in the training samples is not sufficiently utilized. In this paper, we present a new formulation, called Generalized LDA, where the scatter matrices are defined in a more flexible manner by identifying the fundamental principles of the scatter matrices construction. We further propose a novel framework called Feedback-based Dynamic Generalized LDA. It integrates the Generalized LDA and the dynamic feedback strategy for subspace analysis, in which the subspace is iteratively optimized by utilizing the feedback from the previous step. The comparative experiments demonstrate that the new framework achieves encouraging improvement on performances of both the face identification and the face verification.
Recently many Automatic Face Recognition (AFR) systems were developed for applications with unspecific persons, which is different from conventional pattern recognition problems where all classes are known in the training stage. In this paper, we present a systematic and comprehensive study on linear subspace methods for face recognition on unspecific persons. Over 6700 experiments using different algorithms with different training parameters and testing conditions are conducted on a large scale database (4550 samples) to investigate the compound effect of various influential factors. The observations based on these experiments are expected to provide widely applicable guidelines for designing practical AFR systems.
In this paper, we propose a novel framework for face super-resolution based on a layered predictor network. In the first layer, multiple predictors are trained online with a dynamic-constructed training set, which is adaptively selected in order to make the trained model tailored to the testing face. When the dynamic training set is obtained, the optimum predictor can be learned based on the Resampling-Maximum Likelihood- Model. To further enhance the robustness of prediction and the smoothness of the hallucinated image, additional layers are designed to fuse multiple predictors with the fusion rule learned from the training set. Experiments fully demonstrate the effectiveness of the framework.
Lighting condition is an important factor in face analysis and synthesis, which has received extensive study in both computer vision and computer graphics. Motivated by the work on multilinear model, we propose a learning-based algorithm for relighting based on tensor framework, which explicitly accounts for the interaction of the identity factor and the lighting factor. The major contribution of our work is that we develop a novel algorithm based on a two-stage decomposition scheme to simultaneously and robustly solve for the identity parameter and the lighting parameter which are both unknown. Equipped with the decomposition algorithm, the capability of the tensor model is significantly extended. Experiment results illustrate the effectiveness of our algorithm.
In this paper, we propose a novel patch-based face hallucination framework, which employs a dual model to hallucinate different components associated with one facial image. Our model is based on a statistical learning approach: Associative Learning. It suffices to learn the dependencies between low-resolution image patches and their high-resolution ones with a new concept Hidden Parameter Space as a bridge to connect those patches with different resolutions. To compensate higher frequency information of images, we present a dual associative learning algorithm for orderly inferring main components and high frequency components of faces. The patches can be finally integrated to form a whole high-resolution image. Experiments demonstrate that our approach does render high quality super-resolution faces.
In this paper, we propose a novel face hallucination framework based on image patches, which exploits local geometry structures of overlapping patches to hallucinate different components associated with one facial image. To achieve local fidelity while preserving smoothness in the target high-resolution image, we develop a neighbor combination super-resolution model for high-resolution patch synthesis. For further enhancing the detailed information, we propose another model, which effectively learns neighbor transformations between low- and high-resolution image patch residuals to compensate modeling errors caused by the first model. Experiments demonstrate that our approach can hallucinate high quality super-resolution faces.