Date post: | 13-Apr-2017 |
Category: |
Technology |
Upload: | wongun-choi |
View: | 249 times |
Download: | 0 times |
A Unified Framework
A Unified Framework for Multi-Target Tracking and Collective Activity RecognitionWongun Choi and Silvio Savarese
University of Michigan, Ann Arbor
1
VisionLabGood afternoon. Im Wongun Choi from the University of Michigan. This is a joint work with my advisor, Silvio Savarese.1
2
VisionLabConsider a video sequence with multiple people,2
Our Goal
3
VisionLabour goal is to understand behavior of all individuals in the scene.
3
Our Goal
4Multiple target tracking
VisionLabFirstly, we want to estimate the trajectories of all individuals4
Our Goal
5Multiple target trackingRecognize activities at different level of granularityAtomic activityWalking
Walking
Walking
& poseWalkingFacing-frontWalkingFacing-frontWalkingFacing-back
VisionLabAlso, we want to recognize semantic activities of the people in different level of granularities.
Such activities include:[show1] single person activities in isolation which we call atomic activities, [show2] such as walking or standing, [show3] and the pose of individuals, [show4] such as facing-front or back.5
Our Goal
Walking-Side-by-Side
Moving-to-Opposite6Multiple target trackingRecognize activities at different level of granularityAtomic activity & posePairwise interaction
VisionLabWe also want to recognize Interplay between pairs of people, which we call interactions, for instance, [show1] walking side by side or [show2] moving to opposite direction.6
Our Goal
Multiple target trackingRecognize activities at different level of granularityAtomic activity & posePairwise interactionCollective Activity
Crossing7
VisionLabAnd finally, we want to identify the overall behavior of all people in the scene, the collective activity, [show1] for example crossing.7
Our Goal
Multiple target trackingRecognize activities at different level of granularityAtomic activity & posePairwise interactionCollective Activity
Solve all problems jointly!!
Crossing8
VisionLabMost importantly, we address all the problems in a unified framework.8
BackgroundAtomicActivityPairwiseInteractionCollectiveActivityTargetTracking Wu et al, 2007 Avidan, 2007 Zhang et al, 2008 Breistein et al, 2009 Ess at al, 2009 Wojek et al, 2009 Geiger et al, 2011 Brendel et al, 2011 Pirsiavash et al, 2011
9 Bobick & Davis, 2001 Efros et al, 2003 Schuldt et al, 2004 Dollar et al, 2005 Niebles et al, 2006 Laptev et al, 2008 Rodriguez et al, 2008 Wang & Mori, 2009 Gupta et al, 2009 Liu et al, 2009 Marszalek et al, 2009 Liu et al, 2011 Zhou et al, 2008 Ryoo & Aggarwal, 2009 Yao et al, 2010 Choi et al, 2010 Patron-perez et al, 2010
Choi et al, 2009 Li et al, 2009 Lan et al, 2010 Ryoo & Aggarwal, 2010 Choi et al, 2011 Khamis et al, 2011 Lan et al, 2012 Khamis et al, 2012 Amer et al, 2012
Investigated in isolationHierarchy of activities
Lan et al, 2010Amer et al 2012
Khamis et al, 2012
VisionLabSo far, a large literature have proposed methods for[show1] tracking multiple targets and recognizing[show2] atomic activity[show3] pairwise interaction[show4] and collective activity[show5] , but most of the times these problems are addressed in isolation
[show6] Some exceptions are shown here. But they only focus on solving atomic and collective activity recognition. 9
BackgroundAtomicActivityPairwiseInteractionCollectiveActivityTargetTracking
VisionLabAs opposed to previous works, we propose to solve all of the four problems in a joint fashion.
Let me explain this concept in few more details.10
11ContributionsBottom-up activity understanding.
WalkingWalkingWalking
Approaching
ApproachingGathering
Bottom-upAtomicactivityInteractionCollectiveactivityTrajectories
VisionLabOur model is able to transfer the information in a bottom up fashion, [show1]
From the estimation of trajectories of individuals and their atomic activities, we can obtain robust characterizations of interactions and collective activities. 11
12Contributions
Bottom-up activity understanding.
Meeting and Leaving
Crossing
VisionLab
For instance, [show1] if we observe these trajectories, [show2] we can easily infer that the activity is meeting and leaving[show3] but if we see these trajectories, [show4] we will say it is an activity crossing.12
13ContributionsBottom-up activity understanding.Contextual information propagates top-down.
WalkingWalkingWalking
Approaching
ApproachingGathering
AtomicactivityInteractionCollectiveactivityTrajectories
Top-down
VisionLabAt the same time, the information flows from [show1] top to down so as to provide critical contextual information to the lower levels of hierarchy: collective activities help understand interactions; interactions help understands atomic activity and track associations.13
14ContributionsBottom-up activity understanding.Contextual information propagates top-down.
Meeting and Leaving
VisionLabLet me give you an example of how activity understanding help trajectory estimation.
[show1] For example, if we are given these set of broken trajectories and[show2] if we know that the underlying activity is meeting and leaving, we can interpret these trajectories as follows [show3] 14
15ContributionsBottom-up activity understanding.Contextual information propagates top-down.
Crossing
Simple Social Force Model Pellegrini et al 2009 Choi et al 2010 Leal-Taixe et al 2012 etc
Repulsion &Attraction
VisionLabNow given the same broken trajectories but if we know that the activity is crossing, [show1] we can interpret the trajectories as follows.
[show2] Similar concept is also introduced as a social force model in previous works, however these works only considered few hand designed types of interactions, such as [show3] repulsion and attraction.
We generalize this concept and have high level activity understanding to guide the process of associating tracks.15
16OutlineJoint ModelInference/Training methodExperimental evaluationConclusion
VisionLab3 Minutes up to this!!!
This is an outline of todays talk16
17OutlineJoint ModelInference/Training methodExperimental evaluationConclusion
VisionLabLets begin with our Joint model.17
Hierarchical Activity ModelInput: video with tracklets
18
VisionLabGiven a video with a set of short fragment of trajectories[show1] tracklets, the activities of all is encoded as a hierarchical graphical model using factor graph. Each component of our factor graph model are shown on the left
19Hierarchical Activity Model
A1A2A3I12I23I13
C
OC
O1O2O3
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1
VisionLab[show1] From each of the tracklets, we obtain observation cue for individuals.[show2] grounded on each observation, we model the atomic activity with variable A.[show3] we encode pair-wise interaction between individuals with variable I.[show4] finally, we assign one collective activity variable, C, to chracterize overall behavior of individuals. [show5] we also provide top-down observation cue for variable C
[show6] By considering the temporal relathionship among variables as well, our full model can be compactly represented as the graph shown here.
20
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1Hierarchical Activity Model
VisionLabThe model can be written as an energy function as shown in the right side.[show1] The energy can be factorized into multiple local potential each of which encodes relationship among variables.
Lets see the details of what each factors represents.20
Atomic-Observation PotentialAtomic Activity Models Action: BoW with STIPPose: HoG
Dalal and Triggs, 05
Dollar et al, 06; Niebles et al, 0721
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1
VisionLabThe highlighted potential encodes the compatibility between atomic activity models and corresponding observations.
[show1] Atomic activity are modeled by the BoW representation equipped with Spatio-Temporal-Interest-Point features [Show2] and pose are described using HOG 21
Interaction-Atomic PotentialI: Standing-in-a-line22
ModelA: StandingFacing-leftA: StandingFacing-left
A: StandingFacing-leftA: StandingFacing-left
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1
VisionLabThe second potential psi (I,A,f) captures the compatibility between interactions models and observations of atomic activities. For instance
[show1] here we illustrate a visualization of a possible learnt model for the standing-in-a-lineinteraction. This model captures the property that two people that stand in a line tend to be located nearby and face the same direction.[show2] Thus, if we are given these observations of atomic activities standing and facing left which are compatible with the learntstanding-in-a-lineinteraction,[show3] the potential is high.22
ModelA: StandingFacing-leftA: StandingFacing-leftInteraction-Atomic Potential23
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1I: Standing-in-a-line
A: StandingFacing-rightA: StandingFacing-left
VisionLabOn the other hand, if we are given these observations of atomic activities standing, facing left, facing right, the compatibility with the model is weak, and thus[show1] the potential is low[show2] such relationship can be compactly represented as an equation below. 23
Collective-Interaction Potential24
C: Queuing
I: standing-side-by-side
I: one-after-the-other
I: one-after-the-other
I: one-after-the-other
One-after-the-otherStanding-side-by-sideFacing-each-otherModel
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1
VisionLabSimilarly, the potential psi(C,I) encodes the compatibility between collective activity models and a set of observations of pair-wise interactions.
[show1] For example, here we illustrate a visualization of a possible learnt model for the collective activity queuing. This model captures the probability of occurrences of interactions labels such as one-after-the-other, facing-each-other. For the collective activity queueing, the interactionone-after-the-otheris highly probable to occur. The interactionfacing-each-other is much less so.[show2] Thus, if we observe these set of interactions, [show3] standing-side-by-side, [show4] one-after-the-other, [show5] and so on, with Queing, [show3] the potential psi(C,I) is high24
Collective-Interaction Potential
I: facing-each-other
I: facing-opposite-side
I: facing-each-other25C: Queuing
One-after-the-otherStanding-side-by-sideFacing-each-otherModel
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1
VisionLabOn the other hand, [show1] if we observe these interactions, facing-each-other and facing-opposite-direction,[show2] the potential is low since these are not compatible with the learned model for queuing. [sohw3] such relationship is encoded by this equation. 25
Collective-Observation PotentialCollective ActivitySTL of all targetsChoi et al, 0926
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1
VisionLabSimilarly to bottom-up cues for atomic activities, The highlighted potential encodes the compatibility between collective activity models and corresponding observations.
[show1] These are obtained using the crowd context descriptor introduced by Choi et al in 2009. 26
Activity Transition Potential27Smooth activity transition
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1
VisionLabAlso, we encode temporal smoothness between[show1] collective activities, [show2] interactions, [show3] and atomic activitiesin adjacent time frames. 27
28Trajectory Estimation
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1
VisionLabLast term captures potential related to trajectory estimation. Before we talk about it, let me briefly discuss about how we define the tracking problem. 28
29Tracklet Association Problem
VisionLabSuppose we have the activity people crossing; [show1] it is likely that our low level observations wont be just a pair of clean tracks such as the red and back ones.29
30Tracklet Association Problem
Input: Fragmented Trajectories (tracklets)Detector failuresOcclusion between targetsScene clutteretc..
Output: set of trajectories With correct IDs (color)
VisionLabBut we rather observe a fragmented set of trajectories which we call tracklets. This is because of [show1] detection failures, occlusion, scene clutters, and so on. Given such set of initial inputs, our goal is [show2] to obtain trajectories with consistent IDs by associating the tracklets30
31Tracklet Association Model
Location affinityAppearance/Color...
Simple match costs, c
??
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1
VisionLabAs in traditional track association problems, we introduce [Show1] the variable f for capturing association hypotheses; A simple solution is to have the association cost vector c to encode properties such as[show2] location affinity, color similarity, and so on.
As this example shows, these properties, dont always work:[show3] location affinity becomes ambiguous at the point of crossing.[show4] Color or appearance similarity is not always reliable when people wear similar clothings31
32Tracklet Association Model
Crossing
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1
VisionLabIn addition to traditional cues, we advocate that the interaction labels provides critical contextual information to guide the process of associating tracklets. [show1] Such information is encoded by the interaction potential that we discussed ealier. [show1] for instance, if we know the interaction is crossing, this type of association[next]
32
33Tracklet Association Model
Crossing
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1
VisionLab[cont.] will give rise to high energy [show1] since this association is compatible with the leanrt model for crossing33
34Tracklet Association Model
Crossing
I(t)I(t)I
A(t)A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)
A(t)ACOc
OA(t)OA(t)OA
I(t)I(t)I
A(t)A(t)A
COc
OA(t)OA(t)OA
t-1tt+1
VisionLabwhereas this type of association [show1] will give rise to low energy.
In the interest of time, we skip the details of mathematics here. 34
35OutlineJoint ModelInference/Training methodExperimental evaluationConclusion
VisionLabLets see how we solve the inference problem and train the model. 35
InferenceNon-convex Problem!!36
VisionLabThe inference problem can be represented as finding the configuration of all variables that maximize the joint energy function. [show1] The optimization, however, computationally very demanding36
InferenceActivity Recognition Given f
Iterative Belief Propagation
Novel Branch and Bound37Tracklet Association Given C, I, A
Crossing
I(t)I(t)I
A(t)A(t)ACOc
OA
I(t)I(t)I
A(t)
A(t)ACOc
OA
I(t)I(t)I
A(t)A(t)A
COc
OA
t-1tt+1
VisionLabThus, we introduce a new iterative method to solve the problem.
[show1] Given initial tracklet association, we obtain activity labels using[show2] iterative belief propagation.[show3] and given activity labels, [show4] we obtain tracklet association using our novel branch-and-bound method.
For the interests of time, we skip the details of each inference method.37
Model weights are learned in a Max-Margin framework using Structural SVM.38TrainingTsochantaridis et al, 2004
VisionLabFinally, we obtain the model parameters from set of training data using a structural svm framework .38
39OutlineJoint ModelInference/Training methodExperimental evaluationConclusion
VisionLabLets discuss our experimental evaluation39
ExperimentsCollective Activity Dataset44 videos with multiple peopleCrossing, Waiting, Queuing, Walking, Talking
40CrossingWaitingQueuingWalkingTalkingChoi et al, 2009
Target identities
InteractionApproachingLeavingPassing-byFacing-each-otheretc..
Atomic Activity facing-rightfacing-leftWalkingstanding
VisionLabFor the evaluation, we use the collective activity dataset, proposed by us in 2009. In addition to the collective activity lables, we provided annotations for [show1] target identities[show2] interactions between pairs of people[show3] and atomic properties of individuals.40
ExperimentsNew Dataset32 videos with multiple peopleGathering, Talking, Dismissal, Walking together, Chasing, Queuing
41
GatheringTalkingDismissal
Walking-togetherChasingQueuing
Target identities
InteractionApproachingWalking-in-oppos..Facing-each-otherStanding-in-a-rowetc..
Atomic Activity facing-rightfacing-leftWalkingStandingrunning
VisionLabAlso, we collect an additional dataset to test our framework that is composed of 32 videos with 6 collective activities. We also provide labels for target identities, interactions and atomic activities similarly. 41
42Classification ResultsChoi et al, 2009Choi et al, 2009+6.6%+4.9%
Collective ActivityDataset, 2009New Dataset
VisionLabFirstly, we compare the collective activity classification accuracy usingBaseline method using a crowd context descriptor, we introduced in 2009 and 11, and our full hierarchical representation.
We obtain good improvement, about 6%, in overall classification by utilizing the hierarchical structure in our model tested on the collective activity dataset.
Again, we observe similar improvement, about 5 %, in the new dataset. 42
Target AssociationTracklet# of error1556Improvement over tracklet0%
Result of Dataset VSWS0943
VisionLabNow, we analyze the tracklet association results.The first row shows number of error in tracklet matchingAnd the second row shows % of improvement over input tracklets.
Target AssociationTrackletNo Interaction# of error15561109Improvement over tracklet0%28.73%
Result of Dataset VSWS0944
VisionLabBy solving the tracklet association without interaction cues, we could obtain about 30% improvement over tracklets, which have abour 400 less matching error.
Target AssociationTrackletNo InteractionWith Interaction# of error15561109894Improvement over tracklet0%28.73%42.54%
Result of Dataset VSWS0945
VisionLabBy incorporating our interaction model with the estimated activity labels, We obtain 14% more improvement, which have abour 200 less matching errors than the baseline without interaction.
Target AssociationTrackletNo InteractionWith InteractionWith GT Activities# of error15561109894736Improvement over tracklet0%28.73%42.54%52.76%
Result of Dataset VSWS0946
VisionLabNotice that if ground truth activity lables are given, we obtain an upper bound target association error of 736, that is corresponding to 53% improvement over the input. More quantitative evaluation can be found in the paper.
Example Classification Result47Interaction labelsAP: approachingFE: facing-each-otherSR: standing-in-a-row...
VisionLabHere we show examplar results we obtained from newly proposed dataset. The estimated collective activity lable is overlaid on top And interaction labels are displayed between pairs of people. 47
Example Classification Result48Atomic Activities Action: W - walking S standing
Pose (8 directions) L - left LF left/front F front RF- right/front etc.
VisionLabNow let me show results from a more complex sequence.
Here, we show the estimated atomic activity label for each individual, overlaid below each bounding boxes, where .
48
Example Classification Result49Pair-InteractionsAP: approaching.....FE: facing-each-otherSS: standing-side-by-sideSQ: standing-in-a-queue
VisionLaband interactions between pairs of people. SQ represent interaction standing in a queue.49
Example Classification Result50
VisionLabFinally, we show the estimated collective activity variable on top.50
Example Classification Result51Tracklet Association Color/nNumber: ID Solid boxes: tracklets Dashed boxes: match hypothesis
VisionLabHere, the video shows tracklet association result, we obtained using our unified framework.
Color and number on top of bounding box shows identity of target, Solid boxes represent trackletsAnd dashed boxes shows smoothed paths that associate two tracklets.
As you see, our model keep the targets identities same even after the occlusion by associating tracklets correctly. 51
52Association Example
With InteractionNo InteractionWrong IDs!Correct IDs!Time
VisionLabFinally, we examplar comparison between the tracklet association results obtained without interaction model and with interaction model.
[show1] Due to the severe occlusion incurred by the car motion, [show2] traditional method without interaction context, lost the targets identities and assign new ID for all after the occlusion. [show3] on the other hand, when we have the interaction in the association model, [show4] we could keep the identity of the targets in such a challenging scenario. 52
Propose novel model for joint activity recognition and target association.
Conclusion53
A1A2A3I12I23I13
C
OC
O1O2O3
VisionLabIn this paper, we proposed a novel model that seamlessly relate target tracking, and atomic activity, interaction and collective activity understanding.
Also , we show that by solving all together, we can achieve better understanding about high level activities.
Most interestingly, we also show that high level activity can help trajectory estimation better. 53
Propose novel model for joint activity recognition and target association.
High level contextual information help improve target association accuracy significantly.Conclusion54
Crossing
VisionLabIn this paper, we proposed a novel model that seamlessly relate target tracking, and atomic activity, interaction and collective activity understanding.
Also , we show that by solving all together, we can achieve better understanding about high level activities.
Most interestingly, we also show that high level activity can help trajectory estimation better. 54
Propose novel model for joint activity recognition and target association.
High level contextual information help improve target association accuracy significantly.
Best classification results on collective activity up to date.
Conclusion55
VisionLabIn this paper, we proposed a novel model that seamlessly relate target tracking, and atomic activity, interaction and collective activity understanding.
Also , we show that by solving all together, we can achieve better understanding about high level activities.
Most interestingly, we also show that high level activity can help trajectory estimation better. 55
56Thanks to
Yingze Bao, Byungsoo Kim, Min Sun, Yu Xiang,ONR and anonymous reviewers
VisionLab57
VisionLab
Not-PSD => non-convex58Branch-and-Bound AssociationGeneral search algorithmGuarantee exact solution RequireBranch operationBound operation
VisionLab59Branch-and-Bound Illustration
Q
Q0Q1
Q
L(Q)L(Q)U(Q)U(Q)U(Q1) < L(Q0) !
BranchBound
VisionLabBoundLower-bound60Branch-and-Bound Association
??for each interaction variable
I12
I34
VisionLabBound (cont.)61Branch-and-Bound Association
Per interaction variableOnly one non-zero activationper line in Hi
I12
VisionLabBound (cont.)Lower-bound
Binary Integer Programming
Upper-bound
62Branch-and-Bound Association
VisionLabBranchDivide problem into two disjoint sub-problem.
Find the most ambiguous variable.63Branch-and-Bound Associatione.g. Q0 => f = [1, x, x, x, x, x, .] Q1 => f = [0, x, x, x, x, x, .]
VisionLab