Sobre la visión por computador
aplicada a los humanos:
problemas abiertos y aplicaciones.
Jordi VitriàJordi VitriàBCN Perceptual Computing Lab
Departament de Matemàtica Aplicada i Anàlisi, Facultat de Matemàtiques, Universitat de Barcelona,
Gran Via de les Corts Catalanes, 585, 08007 Barcelona
&
Centre de Visió per Computador
Edifici O, Campus de la UAB, Bellaterra, 08193 Barcelona
bcnpcl.wordpress.com
Human-robot interaction is not possible without rich, robust models for the
perception (in the broadest sense) of humans.
13/09/2010 Jordi Vitrià | Septiembre 2010 3
13/09/2010 Jordi Vitrià | Septiembre 2010 4
Humans are not a common object, such as cars,
trees or buildings:
Humans display rich behaviors with rich
information that is useful for predicting actions
and decisions.
13/09/2010 Jordi Vitrià | Septiembre 2010 5
and decisions.
Humans communicate by perceiving and
producing visual signals.
13/09/2010 Jordi Vitrià | Septiembre 2010 6
From David Marr's book: Vision, 1982.
Definition:
As a scientific discipline, computer vision is concernedwith the theory and technology for building artificialsystems that obtain information from images. Theimage data can take many forms, such as a videosequence, views from multiple cameras, or multi-
13/09/2010 Jordi Vitrià | Septiembre 2010 7
sequence, views from multiple cameras, or multi-dimensional data from a medical scanner.
obtain information from images =
physical word description
Object detection, recognition and tracking...
13/09/2010 Jordi Vitrià | Septiembre 2010 8
But, what about understanding people?
THE CANONICAL VIEW
1. There is a great need for computer programs that can
describe and predict people activities from video,
2. This is difficult to do, because it is hard to detect,
identify and track people in video sequences, because
we have no common vocabulary for describing what
13/09/2010 Jordi Vitrià | Septiembre 2010 9
we have no common vocabulary for describing what
people are doing, and because the interpretation of
what people are doing depends very strongly on
context.
That’s true, but this is not the whole truth: there is
also a lack of appropriate models for understanding
people and their social world.
13/09/2010 Jordi Vitrià | Septiembre 2010 10
Human sensing =
«bounding box» problem + pose problem + attributes problem +
interaction problem + gestures + social signals +…
Face detection Full body detection
The «bounding box» problem.
Upper body detection
13/09/2010 Jordi Vitrià | Septiembre 2010 11
The «bounding box» problem: face detection
13/09/2010 Jordi Vitrià | Septiembre 2010 12
Basic idea: slide a (multiscale) window across image and
evaluate a face model at every location.
The «bounding box» problem: face detection
Templates: 20, 30, 40, 50, 60 px
Image: 640x480 px
Translation: 5 px
Speed: 10fps
------------------------------------------
Total: 62135 searches -> 1,6μs/search
13/09/2010 Jordi Vitrià | Septiembre 2010 13
The «bounding box» problem: face detection
Fast Feature Computation: Integral Image
13/09/2010 Jordi Vitrià | Septiembre 2010 14
Smallest
Scale
Larger
Scale
The «bounding box» problem: face detection
Face detection solution: efficient features +
machine learning on very large datasets of
examples.
13/09/2010 Jordi Vitrià | Septiembre 2010 15
State of the art: 89%
The «bounding box» problem: face detection
13/09/2010 Jordi Vitrià | Septiembre 2010 16
“Large-scale Privacy Protection in Google Street View”, Andrea Frome, German Cheung, Ahmad Abdulkader, Marco Zennaro, Bo Wu, Alessandro Bissacco, Hartwig Adam, Hartmut Neven, Luc
Vincent, IEEE International Conference on Computer Vision, 2009.
Person Person
The «bounding box» problem: body detection
13/09/2010 Jordi Vitrià | Septiembre 2010 17
The «bounding box» problem: full body detection
13/09/2010 Jordi Vitrià | Septiembre 2010 18
Pedestrian detection using histograms of oriented gradients (Dalal and Triggs 2005)
Upper Body
The «bounding box» problem: upper body detection
13/09/2010 Jordi Vitrià | Septiembre 2010 19
Upper-body detector by Manuel J. Marín-Jiménez, Vittorio Ferrari and Andrew Zisserman
The «bounding box» problem: person detection
13/09/2010 Jordi Vitrià | Septiembre 2010 20
Part-based object detection (Felzenszwalb et al. 2008)
The «bounding box» problem: person detection
13/09/2010 Jordi Vitrià | Septiembre 2010 21
Part-based object detection (Felzenszwalb et al. 2008)
The «bounding box» problem: person detection
13/09/2010 Jordi Vitrià | Septiembre 2010 22
Lubomir Bourdev, Jitendra Malik, Poselets: Body Part Detectors Trained
Using 3D Human Pose Annotations, ICCV 2009
The «bounding box» problem: person detection
• Detect poselets
(SVM)
• Hough-vote for each
torso location
• Score each cluster:
13/09/2010 Jordi Vitrià | Septiembre 2010 23
)(xaiScore of poselet iat location x
iwWeight of poselet ilearned via M2HT[Maji/Malik CVPR09]
The «bounding box» problem: person detection
13/09/2010 Jordi Vitrià | Septiembre 2010 24
Head
Head
The «bounding box» problem: human layout
13/09/2010 Jordi Vitrià | Septiembre 2010 25
The PASCAL Visual Object Classes Challenge 2010
The «bounding box» problem: human layout
The head is detected by integrating several state-of-the-art part detectors:
13/09/2010 Jordi Vitrià | Septiembre 2010 26
Face (frontal +
lateral) detection
Person detection
using poseletsPerson detection
using Pictorial
Model
Person
Detection
using
Discriminatively
Trained Part-
Based Models
The «bounding box» problem: human layout
EXAMPLE: PASCAL Human Layout Challenge 2010
Faces were detected with OpenCV 2.1.
Details of the implementation:
• We use the following cascades:
• Frontal face (default, alt, alt2, alt_tree).
• Lateral face (profile).
• Each cascade return several (from 0 up to N) hypothesis
about head position.
• To integrate the results we use hierarchical clustering.
13/09/2010 Jordi Vitrià | Septiembre 2010 27
Face (frontal +
lateral) detection
• To integrate the results we use hierarchical clustering.
• The final head box is the one with the maximum score
given by hierarchical clustering.
References: Viola, Jones: Robust Real-time Object Detection, IJCV 2001
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.8
0.9
1
recall
prec
isio
n
subset: val, part: head, AP = 0.530
The «bounding box» problem: human layout
We use a person detection system proposed by
Felzenszwalb et al. to detect the body.
Details of the implementation:
• Software version: Discriminatively Trained
Deformable Part Models Version 4.
• Based on model aspect analysis we choose 4 models
which best detect the head position.
• For each model we choose the component related with
head position in order to fix the box.
13/09/2010 Jordi Vitrià | Septiembre 2010 28
head position in order to fix the box.
References:
• P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object Detection with
Discriminatively Trained Part Based Models, PAMI 2009
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.70.80.9
recall
prec
isio
nsubset: val, part: head, AP = 0.459
Person
detection
The «bounding box» problem: human layout
We use the body detection system proposed by Bourdev
et al.
• Initially, we used the set of 1138 poselets trained from the H3D
database.
• The poselets were trained to vote for position and size of the
head.
• In order to improve results a hierarchical clustering per poselet
was introduced.
• From original poselets set, we selected the 239 poselets which
gives the best, in terms of reliability, votes for the head position.
The used selection criteria was the standard deviation (std) of
13/09/2010 Jordi Vitrià | Septiembre 2010 29
Poselets
detection
The used selection criteria was the standard deviation (std) of
votes for head.
• If std was smaller than a defined threshold then the poselet was
defined as reliable.
Reference:
• Lubomir Bourdev, Jitendra Malik, Poselets: Body Part Detectors Trained Using 3D Human Pose
Annotations, ICCV 2009.
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0.70.80.9
recall
prec
isio
n
subset: val, part: head, AP = 0.425
The «bounding box» problem: human layout
Confidence 0.5 Confidence 0.8 Confidence 1.6 Confidence 2.25
13/09/2010 Jordi Vitrià | Septiembre 2010 30
Confidence 0.5 Confidence 0.8 Confidence 1.6 Confidence 2.25
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.70.80.9
recall
prec
isio
n
subset: val, part: head, AP = 0.753
subset: val, part: hand, AP = 0.000
The «bounding box» problem: human layout
13/09/2010 Jordi Vitrià | Septiembre 2010 31
The «bounding box» problem: human layout
13/09/2010 Jordi Vitrià | Septiembre 2010 32
Hand
Hand
Foot
Foot
The «bounding box» problem: human layout
13/09/2010 Jordi Vitrià | Septiembre 2010 33
Hands
Foot
The «bounding box» problem: human layout
Hand detection is a SEARCH problem.
13/09/2010 Jordi Vitrià | Septiembre 2010 34
Hands
Karlinsky Leonid, Dinerstein Michael, Daniel Harari, and Ullman Shimon.
The chains model for detecting parts by their context, CVPR 2010.
The «bounding box» problem: human layout
Hand detection is a SEARCH problem.
13/09/2010 Jordi Vitrià | Septiembre 2010 35
Karlinsky Leonid, Dinerstein Michael, Daniel Harari, and Ullman Shimon.
The chains model for detecting parts by their context, CVPR 2010.
The «bounding box» problem: human layout
Hand detection is a SEARCH problem.
fL)3(TF
)1(TF)2(TF
hL
2F6F
7F
5F
13/09/2010 Jordi Vitrià | Septiembre 2010 36
Karlinsky Leonid, Dinerstein Michael, Daniel Harari, and Ullman Shimon.
The chains model for detecting parts by their context, CVPR 2010.
M,T
Chains model
)2(TF4F
1F
3F
6F
The «bounding box» problem: human layout
Hand detection is a SEARCH problem.
13/09/2010 Jordi Vitrià | Septiembre 2010 37
Karlinsky Leonid, Dinerstein Michael, Daniel Harari, and Ullman Shimon.
The chains model for detecting parts by their context, CVPR 2010.
Gender Ethnicity Age
Facial Attributes
Hair
Glasses
Facial Traits
Aggressiveness
The attributes problem.
13/09/2010 Jordi Vitrià | Septiembre 2010 38
Identity
Facial Expressions
Affect
Emblems
Head pose
Automatic Point-based Facial
Trait Judgments Evaluation
The attributes problem.
Automatic Point-based Facial
Trait Judgments Evaluation
• People are extremely efficient
at making trait judgments (e.g.,
competent, trustworthy) from
faces.
The attributes problem.
faces.
• Rapid, unreflective judgments
of competence based solely on
facial appearance predict
election outcomes.Physiognomy
Automatic Point-based Facial
Trait Judgments Evaluation
Darwin was almost denied the
chance to take the historic
Beagle voyage on account
of his nose.
The attributes problem.
Apparently, the Captain [a fan of
Lavater] did not believe that a
person with such a nose would
“possess sufficient energy and
determination.”
Automatic Point-based Facial
Trait Judgments Evaluation
Evaluating faces = Judging the book by its cover.
• 100 ms exposure is sufficient for a variety of person
judgments
The attributes problem.
– Competence
– Trustworthiness
– Aggressiveness
– Likeability
• Additional time exposure increases confidence in Judgments
• Single glance impressions
Automatic Point-based Facial
Trait Judgments Evaluation
Predicting Senate Elections
The attributes problem.
Automatic Point-based Facial
Trait Judgments Evaluation
The attributes problem.
Automatic Point-based Facial
Trait Judgments Evaluation
The attributes problem.
13/09/2010 Jordi Vitrià | Septiembre 2010 46
From: A.Vinciarelli, M.Pantic, H.Boulard, Social signal processing: Survey of an emerging domain, Image and Vision Computing, Volume 27, Issue 12, November
2009, Pages 1743-1759
Body Pose/Postures
The pose problem.
13/09/2010 Jordi Vitrià | Septiembre 2010 47
The pose problem.
13/09/2010 Jordi Vitrià | Septiembre 2010 48
The pose problem.
13/09/2010 Jordi Vitrià | Septiembre 2010 49
http://www.vision.ee.ethz.ch/~hpedemo/
Human2Human Human2Object
Proxemics Manipulation
The interaction problem.
The interaction problem.
B. Yao and L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in
Human-Object Interaction Activities. IEEE Computer Vision and Pattern Recognition
(CVPR). 2010.
The interaction problem.
B. Yao and L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in
Human-Object Interaction Activities. IEEE Computer Vision and Pattern Recognition
(CVPR). 2010.
The interaction problem.
B. Yao and L. Fei-Fei. Grouplet: a Structured Image Representation for Recognizing
Human and Object Interactions. IEEE Computer Vision and Pattern Recognition
(CVPR). 2010.
The interaction problem.
Context
We can use context!(from Andrew C. Gallagher, A Framework for Using Context to Understand Images of People, PhD Thesis, Carnegie Mellon University, 2009)
Pixel Level
Clothing, other people, relative pose, posture, ...
Capture Content
Time, location, calibration, flash, ...
13/09/2010 Jordi Vitrià | Septiembre 2010 55
Social Context
First name, age, gender, social relationship,
anthropometric data, personal calendar, ...
Context
Contextual features that capture the structure of a group
of people, and the position of individuals within the
group.
13/09/2010 Jordi Vitrià | Septiembre 2010 56
Minimum Spanning Tree Nearest Neighbors
And all this knowledge can be used in real applications…
Agitation in ICU
Conclusion
To build “people perception models” is an Internet vision
problem (= visual feature extraction + machine learning + large
databases) that is still in its infancy.