Field of Research & Interesting 

I am developing new machine learning/vision algorithms (on-line boosting learning, active boosting-based learning, autonomous on-line learning) for detection, tracking and recognition of many different object categories. I am interested in the supervised, unsupervised and semi-supervised scenarios

The chart below gives an overview of our field of research.
Please click on any item to get more information!


Human-Machine Communication

Modern communication and information processing systems enable us to interact with all kinds of computers and computer controlled machines, e.g. to make a phone call, to access the internet, to operate entertainment electronics, to use information services, to operate household appliances, or even to navigate cars. These systems have already become an inherent part of our environment in everyday life (buzz phrase "pervasive computing"). With ongoing technological progress, these systems do not only become more capable and efficient, but their handling can be rather complex. For this reason, an adequate user interface is a major goal of research and development to enable everyone to participate effortlessly in a modern computing infrastructure.

Research at the Institute for Human-Machine Communication focuses on the fundamentals of a widely intuitive, natural, and therefore multimodal interaction between humans and information processing systems. All forms of interaction, i.e. modalities, that are available to humans, are to be investigated for this purpose. Both the machine's representation of information and the interaction technique is to be considered in this context, like text and speech, sound and music, haptics, graphics and vision, gesture and mimics, and emotions.


Media Communications

In the area of media communications, research at the Institute for Human-Machine Communication focuses on human interaction with digital media technologies. We therefore investigate both the semantic analysis of multimedia data (text, documents, handwriting, audio, graphics, video), and techniques for information indexing and data base retrieval. For this complex mixture of data and content, intelligent pattern processing and recognition methods are explored, and new interaction concepts are developed.


Pattern Recognition

Pattern recognition is the research area that studies the design and operation of systems that recognize patterns in data. There are many kinds of different patterns, e.g. visual patterns, temporal patterns, logical patterns, spectral patterns, etc. Pattern recognition is an inherent part of every intelligent activity or system. There are different approaches to pattern recognition, including:

  • On-line Boosting
  • Conditional Random Fields
  • Statistical or fuzzy pattern recognition
  • Syntactic or structural pattern recognition
  • Knowledge-based pattern recognition

The statistical approach views pattern recognition as classification task, i.e. assigning an input to a category, based on statistical criteria. It encloses subdisciplines like discriminant analysis, feature extraction, error estimation, cluster analysis, grammatical inference and parsing. Important application areas are speech and image analysis, character recognition, man and machine diagnostics, person identification, industrial inspection, and of course, human-machine interaction. Consequentially, the area of statistical pattern recognition is a fundamental scientific discipline and area of research at the Institute for Human-Machine Communication.


Signal Processing

Signal Processing means the theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals by digital or analog devices or techniques. The term signal includes audio, video, speech, image, communication, medical, musical, and other signals in continous or discrete (i.e. sampled) form. Competence in Signal Processing is vital for the development of new techniques in Human-Machine Communication.


Statistical Classifiers

Statistical Classifiers like Hidden Markov Models (HMMs) have emerged during the last 5 years as probably the most powerful paradigm for processing of dynamic patterns, such as time series, speech signals, and other pattern sequences. Especially in speech recognition, HMMs became the dominating technology. However, in multimedia signal processing applications, involving mostly image processing and computer vision problems with dynamic and static patterns, HMMs are still far less often used. But this area became more and more important during recent years, especially in Human-Machine Communication. We therefore investigate the suitability of HMMs with respect to various pattern recognition tasks in multimedia information processing, like:

HMMs in speech recognition. HMMs for character, handwriting and formula recognition. Image sequence processing with HMMs. HMMs for gesture recognition. Video-indexing with HMMs and stochastic video models. HMM-based audio-visual topic recognition. Circular 1D- and 2D-HMMs for rotation-invariant recognition of symbols. Recognition of deformed and occluded objects. HMMs in image databases and image retrieval. Pseudo-2D-HMMs for face recognition. Pseudo-2D-HMMs for pictogram recognition and spotting. HMM-applications for person detection and object tracking. Gesture and facial expression recognition with 1D- and Pseudo-3D-HMMs.


Neural Networks

A Neural Network (NN) is an information-processing structure inspired by the interconnected, parallel topology of the mammalian brain. NNs use a collection of mathematical models to emulate some of the observed properties of biological nervous systems and draw on the analogies of adaptive biological learning. The key element of the NN paradigm is its structure composed of a large number of interconnected processing elements that are analogous to neurons and that are tied together with weighted connections that are analogous to synapses.

Learning in NNs involves adjustments to the connections that exist between the neurons. Learning typically occurs by example through training, or exposure to a set of verified input/output data where the training algorithm iteratively adjusts the connection weights (synapses). These connection weights store the knowledge necessary to solve specific problems.

NNs are used for pattern recognition and classification tasks, with the ability to robustly classify imprecise input data, such as in character, speech and image recognition. The advantage of NNs lies in their resilience against distortions in the input data and their capability of learning. NNs can be implemented in software or in specialized hardware.


Hybrid Systems

Hybrid Systems used for pattern recognition are an effective combination of neural networks and statistical classifiers, in particular Hidden-Markov Models (HMMs). Special training procedures are required for Neural Networks in order to combine them efficiently with HMMs. In many cases, the structure of the underlying HMMs has to be modified for the combination with Neural Networks. We designed a large variety of different combination possibilites, including Maximum Mutual Information Neural Networks, Discriminant Feature Transformation Hybrids and Tied-Posterior-HMMs.


Speech Processing

Our research in speech processing aims to develop algorithms and systems which are able to automatically recognize continuous speech under real-world conditions. For that purpose, statistical classifiers as well as hybrid systems are being investigated. Most methods are based on stochastic Hidden Markov models (HMMs), which are utilized as reference models for speech sounds (phonemes). Words and complete sentences can be built up from the phoneme models. The sentences are analysed by a speech understanding module, giving an interpretation of the meaning. Special problems have to be solved due to the great variability in pronunciation as well as to the strong dependence from the speaker. Here, we favourably apply pronunciation variants and adaptive classifiers.

Utilizing the recognition capabilities of the speech processing algorithms statistical methods are used to perform natural speech interpretation by means of stochastic grammars. This ranges from semantical decoding to automatic translation based on the former. Especially expectation-based approaches are examined under parallel exploitation of all participating knowledge bases - from the acoustical to the semantical level.


Gestures, Action and Emotion


Multimodal Fusion

The combination of several different modalities for input and output, such as haptics, speech, and gesture, provide for efficient, intuitive and error-robust human-machine communication. Merging different modalities to obtain multimodal information exchange marks one of the most important topics of contemporary research on human-machine communication, straightforwardly extending the idea of an enhanced human-machine dialog by means of introducing natural input and output channels. Some of the most interesting problems are:

  • In which way may information transfer be distributed appropriately - concerning temporal order and contents - over two or more available modalities?
  • To which extent does concurrent and semantically coupled use of several modalities application-specifically improve robustness on the one hand and efficiency as well as acceptance of an information processing system on the other?
  • May certain approaches or formalisms be transferred from one modality to the other, and which statistical and maybe rule-based methods can be applied to perform successive data fusion on the different abstraction levels?


Interactive Graphics

Techniques based on image processing render new ways of natural human-machine interaction possible. These include gesture recognition for visual command input, object tracking for locating people and identifying their actions, and face recognition to personalize interactive environments. New dimensions for interaction open up by combining these methods with immersive technologies like Augmented or Virtual Reality.


Face Recognition

The recognition performance of human beings concerning the classification of faces even under contrarious constraints, such as partial occlusions, rotation or visual distortion can be seen as extremely good. Such most people can easily spot known individuals in larger groups, even under disadvantageous conditions.

Today all known technical systems are far beyond those enormous evolutionary grown recognition capabilities. However, despite the resulting problem to mess a technical system with the performance of human beings, automated face recognition is still an active field of resarch. In addition to this, the finding of faces in abritry images as well as the recognition of facial expressions and mimiks is focus of serveral activies within our institute. The modeling and classfication is done using a wide range of signal processing methods mentioned above.

Automated systems for face recognition enable a wide spectrum for technical applications. For example automated entrance systems for companies have nearly reached a stadium to be mature for serial prudcts nowadays.

Real-Time Detection, Tracking and Recognition

The key idea of our approach is to formulate the abilities to detect, recognize and to track as classification problems. By doing so we can apply the same techniques for all tasks. The major advantage is that low-level computations can be shared and have to be done only once.


For each frame the integral representation needs to be computed only once which is then used by all three modules for feature computation. Note that each unit selects appropriate features for the specific task however computation time of the features is negligible.




Person and Object Tracking


Information Indexing and Retrieval

Taking man machine interaction into consideration, the basic goal of research in this field is to design queries in multimedia data bases as intuitive and efficient as possible. The background motivation of this discipline are the increasing sizes of such digital data bases due to optimized compression algorithms, increases in storage space and a growing internet community. While the roots can be found in textual interpretation of documents, nowadays a number of diverse applications and further digital data forms as images, video, audio, and video games can be observed. These claim for advanced methods of pattern recognition and artificial intelligence to be interpreted.

In the field of Information Retrieval (IR) the main focus lies on enabling intuitive access to such data for the user. Information Indexing on the other hand is concerned with the processing of the data streams for efficient latter queries. Like this data will be categorized, subdivided or even sorted and summarized.

At our institute a demonstrator in the field of IR exists for the recognition of tools drawn with the mouse for queries in a tool databank. Furthermore we investigate the field of hummed or sung queries for easy access in large music archives. Formerly unknown polyphonic audio tracks will be preprocessed to enable dynamic matching to monophonic samples. Finally in the field of II we strive for the multimodal recognition of action units in recorded meetings.


User Interfaces and Modeling


Usability Engineering

The thorough design and test of novel dialogue concepts with usability and acceptance tests yield man-machine interfaces that users really enjoy.


Handwriting Recognition

The goal of automatic handwriting recognition is to enhance user-friendliness through pen-based input devices and to increase automation for fast and efficient processing of large amounts of documents. Automatic handwriting recognition can either be done at the time of input "on-line", or "off-line" when processing documents. On-line means in this context that time-information, i.e. the trajectory of the strokes, is processed as well. In contrast to this, off-line recognition only uses a picture.

Besides the well-known OCR (optical character recognition) of machine-printed and digitized characters and the recognition of single handwritten characters, the recognition of cursive longhand plays a growing role for the input of text in mobile devices.

Samples for application are:

  • On-line handwriting recognition
    Personal Digital Assistant (PDA), Pocket PC, digitizer tablet, Notebook, Webpad, Tablet PC
  • Off-line handwriting recognition
    handwritten notes, address recognition (mail), form processing
  • Document recognition (OCR)
    archiving (newspapers, bills), indexing and retrieval in data bases, form processing, address recognition

Depending on the application, different questions prevail:

  • localization, preprocessing and feature extraction of the script
  • recognition of single characters, words or sentences
  • segmentation properties (block letters, longhand, connected or divided characters because of low quality or resolution)
  • number of different fonts or writers (writer independent or not, adaptation)
  • choice of a codebook (size) or language model, grammar

Recognizing continous cursive longhand, which cannot be easily segmented in single characters, is quite similar to a speech recognition task. For this task, and for handwriting recognition as well, statistical methods for pattern recognition (e.g. Hidden Markov Models) are the most common technique for modeling and recognition.



Technical Acoustics and Noise Abatement

Physical and hearing-related methods for the evaluation of noise are developed and implemented in measuring systems. Sound Quality Design refers to creating the desired sound characteristics of industrial products using psychophysical methods.


The properties of the human auditory system are being investigated and considered in practical applications, e.g. in the context of source coding of audio signals, audiology, audio engineering technology or room acoustics.



My researchs interests are in computer vision. I research algorithms for automatically interpreting images and videos, particularly those containing people, gestures, face, perdestrian, car. I am currently working on:

  • Online boosting learning

  • Active boosting-based learning

  • Semi-supervised learning

  • Combining discriminative and generative methods

  • Human body pose estimation

  • Pedestrian detection

  • Activity recognition

  • Object recognition

  • Motion synthesis

My current projects include basic research and system development in computer vision (motion, stereo and object recognition) real-time gesture tracking and recognition system,  Detecting Pedestrians, recognition of facial expressions, virtual(ized) reality...

Please see the ICG for a list of research projects, or my list of publications for more details.


Characterizing myself

I enjoy working hard: If it wasn’t painfully difficult, you did it wrong. (Dan Brown, Angels & Demons)

That's kind of me:

Uli Stein



Research Interest Keywords 

Computational sensors, computer vision, human-computer interaction, medical applications, mobile robots, quality-of-life technology, and stereo vision

Biomedical Image Analysis - Develpping algorithms for biomedical image analysis under an image feature-based machine learning framework.
Computational Sensor Laboratory - Developing specialty imaging sensors for improving robustness and capabilities of robot vision systems.
Face Group - Robust detection, recognition, and tracking of human faces with automated analysis of expressions
Helicopter Lab - A vision-guided robot helicopter which can function in any weather conditions using only on-board intelligence and computing power.
Human Identification at a Distance - Developing and evaluating human identification technologies as part of the Defense Advanced Research Projects Agency (DARPA) sponsored program in Human Identification at a Distance (HumanID).
Medical Robotics and Computer Assisted Surgery - Researching planning (medical image computing, simulation) and execution (intraoperative sensing and actuation) technologies for computer-assisted surgery.
People Image Analysis Consortium - The People Image Analysis (PIA) Consortium develops and distributes technologies that process images and videos to detect, track, and understand peoples' faces, bodies, and activities.
Virtualized RealityTM - Construct views of real events from nearly any viewpoint
Vision for Safe Driving - Computer vision algorithms and systems for automotive safe driving applications.
2D->3D Face Model Construction - To develop a linear algorithm that uniquely recovers the 3D non-rigid shapes and poses of a human face from a 2D monocular video.
3D Head Motion Recovery in Real Time - Developing a cylindrical model-based algorithm for recovering the full motion (3D rotations and 3D translations) of the head in real time.
3D Image Overlay - X-ray vision has always been the dream of surgeons; Image Overlay is the next best thing.
3D Video Reconstruction of Skeletal Anatomy - This project aims to reconstruct from video sequences the 3D shape of skeleton of mouse fetuses (for pharmaceutical research) stored inside vials with glycerol.
A Statistical Quantification of Human Brain Asymmetry - Constructing image index features to retrieve medically similar cases from a multimedia medical database.
Autonomous Helicopter - Develop a vision-guided robot helicopter
Camera Assisted Meeting Event Observer - Developing the Camera Assisted Meeting Event Observer (CAMEO) - a sensory system designed to provide an electronic agent with physical awareness of the real world.
Car Tracking - Algorithms for tracking cars and generating "bird's eye views" of the surrounding road scene.
Cohn-Kanade AU-Coded Facial Expression Database - An AU-coded database of over 2000 video sequences of over 200 subjects displaying various facial expressions.
Dynamic Conformal Radiotherapy
Face and Facial Feature Tracking - Rigid Tracking of Faces and Non-Rigid Tracking of Facial Features
Face Databases - Miscellaneous face databases collected at CMU.
Face Detection - We are developing computer methods to automatically locate human faces in photos and video.
Face Detection Databases - A collection of databases for training and testing face detectors.
Face Model Building and Fitting - Techniques for building and fitting 2D and 3D models of human faces and heads.
Face Recognition - Recognizing people from images and videos of their faces.
Face Recognition Across Illumination - Recognizing people from faces: video and still iamges.
Face Video Hallucination - A learning-based approach to super-resolve human face videos.
Facial Expression Analysis - Automatic facial expression encoding, extraction and recognition, and expression intensity estimation for the applications of MPEG4 application: teleconferencing, human-computer interaction/interface.
Gaze Estimation - Algorithms for estimating where someone is looking
Hallucinating Faces - A super-resolution algorithm with a strong face-specifc prior.
Human Kinematic Modeling and Motion Capture - Developing a system for building 3D kinematic models of humans and then using the models to track the person in new video sequences.
Human Motion Transfer - Developing a system for capturing the motion of one person and rendering a different person performing the same motion.
Image Enhancement for Faces - Video enhancement techniques, specifically tailored for human faces.
Informedia Digital Video Library - Informedia Digital Video Library - Informedia is pioneering new approaches for automated video and audio indexing, navigation, visualization, summarization search, and retrieval and embedding them in systems for use in education, health care, defense intelligence and understanding of human activity.
Knee Surgery Simulation - Haptic interface for simulated knee surgery and interaction with volumetric data.
Light-fields - A variety of uses of light-fields in computer vision.
Modeling by Videotape - Factorization method of solving the structure-from-motion problem
Non-Invasive Optical Imaging in vivo for Early Detection and Advanced Diagnosis of Cancer
Object Recognition Using Statistical Modeling - Automobile and human face detection via statistical modeling.
Photometric Limits on Computer Vision - An investigation into the fundamental limits imposed on computer vision algorithms by imperfect or incomplete photometric information.
Precision Freehand Sculpting - Developing a handheld tool to accurately cut bone for joint replacement surgery.
Prediction & Planning - This project analyses the safety and interaction of moving objects in complex road scenes.
Reconfigurable Vision Machine - Developing new hardware and software for high performance computer vision.
Scene Flow - Methods of computing dense, non-rigid motion of 3D scenes.
Setting Low-Level Vision Parameters - Techniques for feeding back information from high-level vision modules to low-level modules to improve the performance of the overall system.
Soft Tissue Simulation for Plastic Surgery
Spatio-Temporal View Interpolation - An image-based rendering algorithm for view interpolation across both space and time.
Super-Resolution Optical Flow - A super-resolution algorithm for complex non-rigid scenes.
Temporal Shape-From-Silhouette - Developing algorithms for the computation of 3D shape from multiple silhouette images captured across time.
Textureless Layers - Techniques for the 3D reconstruction of scenes consisting of constant intensity piecewise planar regions (layers).

Karhunen-Loeve Transform

The Karhunen-Loeve Transform a.k.a. Principal Component Analysis (PCA) is used to analyse data by its principal components.

Karhunen-Loeve Transform by Tomasz Wegrzanowski

Karhunen-Loeve Transform by Kristina Scherbaum

Karhunen-Loeve Transform by Peter Schaefer