Sobre la visión por computador aplicada a los humanos ... · Sobre la visión por computador...

Sobre la visión por computador

aplicada a los humanos:

problemas abiertos y aplicaciones.

Jordi VitriàJordi VitriàBCN Perceptual Computing Lab

Departament de Matemàtica Aplicada i Anàlisi, Facultat de Matemàtiques, Universitat de Barcelona,

Gran Via de les Corts Catalanes, 585, 08007 Barcelona

&

Centre de Visió per Computador

Edifici O, Campus de la UAB, Bellaterra, 08193 Barcelona

[email protected]

bcnpcl.wordpress.com

Human-robot interaction is not possible without rich, robust models for the

perception (in the broadest sense) of humans.

13/09/2010 Jordi Vitrià | Septiembre 2010 3


Humans are not a common object, such as cars,

trees or buildings:

Humans display rich behaviors with rich

information that is useful for predicting actions

and decisions.


and decisions.

Humans communicate by perceiving and

producing visual signals.


From David Marr's book: Vision, 1982.

Definition:

As a scientific discipline, computer vision is concernedwith the theory and technology for building artificialsystems that obtain information from images. Theimage data can take many forms, such as a videosequence, views from multiple cameras, or multi-


sequence, views from multiple cameras, or multi-dimensional data from a medical scanner.

obtain information from images =

physical word description

Object detection, recognition and tracking...


But, what about understanding people?

THE CANONICAL VIEW

1. There is a great need for computer programs that can

describe and predict people activities from video,

2. This is difficult to do, because it is hard to detect,

identify and track people in video sequences, because

we have no common vocabulary for describing what


we have no common vocabulary for describing what

people are doing, and because the interpretation of

what people are doing depends very strongly on

context.

That’s true, but this is not the whole truth: there is

also a lack of appropriate models for understanding

people and their social world.


Human sensing =

«bounding box» problem + pose problem + attributes problem +

interaction problem + gestures + social signals +…

Face detection Full body detection

The «bounding box» problem.

Upper body detection


The «bounding box» problem: face detection


Basic idea: slide a (multiscale) window across image and

evaluate a face model at every location.


Templates: 20, 30, 40, 50, 60 px

Image: 640x480 px

Translation: 5 px

Speed: 10fps

------------------------------------------

Total: 62135 searches -> 1,6μs/search



Fast Feature Computation: Integral Image


Smallest

Scale

Larger

Scale


Face detection solution: efficient features +

machine learning on very large datasets of

examples.


State of the art: 89%



“Large-scale Privacy Protection in Google Street View”, Andrea Frome, German Cheung, Ahmad Abdulkader, Marco Zennaro, Bo Wu, Alessandro Bissacco, Hartwig Adam, Hartmut Neven, Luc

Vincent, IEEE International Conference on Computer Vision, 2009.

Person Person

The «bounding box» problem: body detection


The «bounding box» problem: full body detection


Pedestrian detection using histograms of oriented gradients (Dalal and Triggs 2005)

Upper Body

The «bounding box» problem: upper body detection


Upper-body detector by Manuel J. Marín-Jiménez, Vittorio Ferrari and Andrew Zisserman

The «bounding box» problem: person detection


Part-based object detection (Felzenszwalb et al. 2008)



Part-based object detection (Felzenszwalb et al. 2008)



Lubomir Bourdev, Jitendra Malik, Poselets: Body Part Detectors Trained

Using 3D Human Pose Annotations, ICCV 2009


• Detect poselets

(SVM)

• Hough-vote for each

torso location

• Score each cluster:


)(xaiScore of poselet iat location x

iwWeight of poselet ilearned via M2HT[Maji/Malik CVPR09]



Head

Head

The «bounding box» problem: human layout


The PASCAL Visual Object Classes Challenge 2010


The head is detected by integrating several state-of-the-art part detectors:


Face (frontal +

lateral) detection

Person detection

using poseletsPerson detection

using Pictorial

Model

Person

Detection

using

Discriminatively

Trained Part-

Based Models


EXAMPLE: PASCAL Human Layout Challenge 2010

Faces were detected with OpenCV 2.1.

Details of the implementation:

• We use the following cascades:

• Frontal face (default, alt, alt2, alt_tree).

• Lateral face (profile).

• Each cascade return several (from 0 up to N) hypothesis

about head position.

• To integrate the results we use hierarchical clustering.


Face (frontal +

lateral) detection

• To integrate the results we use hierarchical clustering.

• The final head box is the one with the maximum score

given by hierarchical clustering.

References: Viola, Jones: Robust Real-time Object Detection, IJCV 2001

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.8

0.9

1

recall

prec

isio

n

subset: val, part: head, AP = 0.530


We use a person detection system proposed by

Felzenszwalb et al. to detect the body.

Details of the implementation:

• Software version: Discriminatively Trained

Deformable Part Models Version 4.

• Based on model aspect analysis we choose 4 models

which best detect the head position.

• For each model we choose the component related with

head position in order to fix the box.


head position in order to fix the box.

References:

• P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object Detection with

Discriminatively Trained Part Based Models, PAMI 2009

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.70.80.9

recall

prec

isio

nsubset: val, part: head, AP = 0.459

Person

detection


We use the body detection system proposed by Bourdev

et al.

• Initially, we used the set of 1138 poselets trained from the H3D

database.

• The poselets were trained to vote for position and size of the

head.

• In order to improve results a hierarchical clustering per poselet

was introduced.

• From original poselets set, we selected the 239 poselets which

gives the best, in terms of reliability, votes for the head position.

The used selection criteria was the standard deviation (std) of


Poselets

detection

The used selection criteria was the standard deviation (std) of

votes for head.

• If std was smaller than a defined threshold then the poselet was

defined as reliable.

Reference:

• Lubomir Bourdev, Jitendra Malik, Poselets: Body Part Detectors Trained Using 3D Human Pose

Annotations, ICCV 2009.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0.70.80.9

recall

prec

isio

n



Confidence 0.5 Confidence 0.8 Confidence 1.6 Confidence 2.25


Confidence 0.5 Confidence 0.8 Confidence 1.6 Confidence 2.25

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.70.80.9

recall

prec

isio

n


subset: val, part: hand, AP = 0.000





Hand

Hand

Foot

Foot



Hands

Foot


Hand detection is a SEARCH problem.


Hands

Karlinsky Leonid, Dinerstein Michael, Daniel Harari, and Ullman Shimon.

The chains model for detecting parts by their context, CVPR 2010.








fL)3(TF

)1(TF)2(TF

hL

2F6F

7F

5F




M,T

Chains model

)2(TF4F

1F

3F

6F






Gender Ethnicity Age

Facial Attributes

Hair

Glasses

Facial Traits

Aggressiveness

The attributes problem.


Identity

Facial Expressions

Affect

Emblems

Head pose

Automatic Point-based Facial

Trait Judgments Evaluation




• People are extremely efficient

at making trait judgments (e.g.,

competent, trustworthy) from

faces.


faces.

• Rapid, unreflective judgments

of competence based solely on

facial appearance predict

election outcomes.Physiognomy



Darwin was almost denied the

chance to take the historic

Beagle voyage on account

of his nose.


Apparently, the Captain [a fan of

Lavater] did not believe that a

person with such a nose would

“possess sufficient energy and

determination.”



Evaluating faces = Judging the book by its cover.

• 100 ms exposure is sufficient for a variety of person

judgments


– Competence

– Trustworthiness

– Aggressiveness

– Likeability

• Additional time exposure increases confidence in Judgments

• Single glance impressions



Predicting Senate Elections









From: A.Vinciarelli, M.Pantic, H.Boulard, Social signal processing: Survey of an emerging domain, Image and Vision Computing, Volume 27, Issue 12, November

2009, Pages 1743-1759

Body Pose/Postures

The pose problem.


The pose problem.


The pose problem.


http://www.vision.ee.ethz.ch/~hpedemo/

Human2Human Human2Object

Proxemics Manipulation

The interaction problem.


B. Yao and L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in

Human-Object Interaction Activities. IEEE Computer Vision and Pattern Recognition

(CVPR). 2010.


B. Yao and L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in

Human-Object Interaction Activities. IEEE Computer Vision and Pattern Recognition

(CVPR). 2010.


B. Yao and L. Fei-Fei. Grouplet: a Structured Image Representation for Recognizing

Human and Object Interactions. IEEE Computer Vision and Pattern Recognition

(CVPR). 2010.


Context

We can use context!(from Andrew C. Gallagher, A Framework for Using Context to Understand Images of People, PhD Thesis, Carnegie Mellon University, 2009)

Pixel Level

Clothing, other people, relative pose, posture, ...

Capture Content

Time, location, calibration, flash, ...


Social Context

First name, age, gender, social relationship,

anthropometric data, personal calendar, ...

Context

Contextual features that capture the structure of a group

of people, and the position of individuals within the

group.


Minimum Spanning Tree Nearest Neighbors

And all this knowledge can be used in real applications…

Agitation in ICU

Conclusion

To build “people perception models” is an Internet vision

problem (= visual feature extraction + machine learning + large

databases) that is still in its infancy.

Date post:	17-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Sobre la visión por computador aplicada a los humanos ... · Sobre la visión por computador...

Documents