Human Action Recognition Solution

$30.00 $24.90


This assignment will work with the UCF-101 human action recognition dataset. The dataset consists of 13,320 videos between ~2-10 seconds long of humans performing one of 101 possible actions. The dimensions of each frame are 320 by 240.

This homework will compare a single frame model (spatial information only) with a 3D convolution-based model (2 spatial dimensions + 1 temporal dimension).

  • Part one: Fine-tune a 50-layer ResNet model (pretrained on ImageNet) on single UCF-101 video frames

  • Part two: Fine-tune a 50-layer 3D ResNet model (pretrained on Kinetics) on UCF-101 video sequences

There are three sets of results for comparison: 1.) single-frame model, 2.) 3D model, 3.)

combined output of the two models. For each of these three, report the following:

  • (top1_accuracy,top5_accuracy,top10_accuracy): Did the results improve after

combining the outputs?

  • Use the confusion matrices to get the 10 classes with the highest performance and the 10 classes with the lowest performance: Are there differences/similarities? Can anything be said about whether particular action classes are discriminated more by spatial information versus temporal information?

  • Use the confusion matrices to get the 10 most confused classes. That is, which off-diagonal elements of the confusion matrix are the largest: Are there any notable examples?

Put all of the above into a report and submit as a pdf. Also zip all of the code (not the models, predictions or dataset) and submit.