A Unified Framework for Interaction Recognition

Poster for the REU team for Human Object Interaction Recognition
Poster for the REU team for Human Object Interaction Recognition Summer 2013

REU Poster PDF File

Tech Report

This research project was funded by the NSF and took place in the summer of 2013 at the Florida Institute of Technology. The goal was to devise a technique capable of classifying the interaction of an individual in a video. This team consisted of a faculty advisor, a graduate mentor as well as two undergraduate students, namely Ivan Bogun (graduate mentor), Haidar Khan (undergrad), Jacob Chen (undergrad), and a Dr. Eraldo Ribeiro (faculty advisor).

The projects goal was to accurately classify an interaction taking place in a video. To do this, we gathered and combined two pieces of information to make a probabilistic decision. The two pieces of information are: 1) Trajectories of the hand and body, and 2) Appearance based information of the object. Haidar Khan was responsible for gathering and processing Trajectory based information. I was responsible for gathering and processing Appearance Based Information. Ivan was responsible for combining the two pieces of information together and making decision from these pieces of information.

In order to extract appearance information, I used two features of an image which are: 1) Edges and 2) Doublets. To classify an object, I used a bag of words approach[1] combined with pLSA, which is commonly used for text analysis.[2] First, I assumed that the object in question can be extracted from each individual frame of the video. After I extracted the object image in question for each frame, I transformed each object image into a canny edge transformation and doublets. A canny edge transformation highlights the edges inside that image, which can be seen in the above poster.[3] I chose edges because each image’s background was consistent throughout the video and similar objects would be described with similar edges. Doublets are two consecutive patches of n by n pixels and can be used to provide spatial information.[4] The following steps are the same for both types of features.

  1. Break the transformed images into patches of pixels
  2. Combine all the patches into a large matrix
  3. Run k-means clustering algorithm on the data matrix to obtain clusters
  4. Construct the video using the clusters obtained
  5. Build a probabilistic model for each class using the Expectation Maximization algorithm to find the parameters of the distribution (we assumed a gaussian distribution since there are a large number of frames in a video, and for practicality sake) for each class of object

Next, I was able to test using a probability matrix gained from the model. Upon a new video, I was able to “guess” what class the video belonged to based on its probability of belonging to one of the classes. By performing this guess on both the Edges and the Doublets information, I was able to improve my classification. The results can be seen in the confusion matrix. Lastly, I constructed a kernel representing the Appearance based information that will be used later in Multiple Kernel Learning [5].

The paper was peer reviewed, accepted for publication, and presented at the ICPR 2014 conference. The schedule is found here and the topic is Interaction Recognition using Sparse portraits.


  1. Teng Li; Tao Mei; In-So Kweon; Xian-Sheng Hua, “Contextual Bag-of-Words for Visual Categorization,” Circuits and Systems for Video Technology, IEEE Transactions on , vol.21, no.4, pp.381,392, April 2011
    doi: 10.1109/TCSVT.2010.2041828
  2. Sheng-Yi Kong; Lin-shan Lee, “Improved Spoken Document Summarization Using Probabilistic Latent Semantic Analysis (PLSA),” Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on , vol.1, no., pp.I,I, 14-19 May 2006
    doi: 10.1109/ICASSP.2006.1660177
  3. Canny, John, “A Computational Approach to Edge Detection,” Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.PAMI-8, no.6, pp.679,698, Nov. 1986
    doi: 10.1109/TPAMI.1986.4767851
  4. Zhang, E.; Mayo, M., “Improving Bag-of-Words model with spatial information,” Image and Vision Computing New Zealand (IVCNZ), 2010 25th International Conference of , vol., no., pp.1,8, 8-9 Nov. 2010
    doi: 10.1109/IVCNZ.2010.6148795
  5. Mehmet G ̈onen and Ethem Alpaydın. Multiple kernel learning algorithms. The Journal of Machine Learning Research, 999999:2211–2268, 2011

Touch-based User Authentication

Poster for the UMD URF Symposium 2nd Place.
Poster for the UMD URF Symposium 2nd Place.

Touch-Based Poster in PDF format

iOS 7 Data Collection Application

This funded research project took place in the school year of 2013-2014 at the University of Maryland, College Park. The goal was to determine whether touch-based information on a mobile device could successfully identify the user of the device. This team consisted of two faculty advisors, a graduate mentor and one undergraduate student, namely Heng Zhao (graduate mentor), Jacob Chen (undergrad), Dr. Vishal Patel (faculty advisor), and Dr. Rama Chellapa (faculty advisor).

The project goal was to correctly identify the user of a mobile device in order to authenticate them using a dynamic security algorithm that would not intrude upon the user experience of the mobile device. To do this, we set up a fixed environment wherein different users could react to different scenarios. Each scenario was done in a bright-lit room, a dim-lit room, and in a natural-lit setting. These scenarios are:

  1. Swipe to browse through images
  2. Drag images to center
  3. Find an object inside of a large image
  4. Skim through a PDF for requested information

Once the data has been gathered, it was transformed into features that are outlined in this paper.[1] Once we performed an analysis called PCA (Principle Component Analysis), we saw that the data was hard to separate as evident in the poster above. We found that using a classification technique called KSRC (Kernel Sparse Representation-based Classification) provided the best results.[2] 

The idea behind SRC (KSRC is SRC but applied to higher dimensions) is to compile the same number of training samples in the form of a feature vector for each person into a large matrix, call it Y. Next, we would like to see which person a new feature vector, call it Yt, belongs to. We conjecture that this new feature vector is a linear combination of some training vectors. We call the coefficients for this linear combination another matrix called X. Thus we have a linear equation Yt = YX. We want to minimize the norm of the coefficients so that the important coefficients show up. The coefficients of a class when multiplied by the training vectors of that class that give us the smallest error will then be the class that the new feature vector belongs to! We used a technique If we perform this technique on many feature vectors from the same person, we can have more confidence when we finally identify the person. This technique showed promising results when we used an evaluation metric called the F1 score. A higher F1 score corresponds to a higher accuracy and can be seen in the above poster.

The paper is currently under review.


  1. M. Frank, R. Biedert, E. Ma, I. Martinovic, and D. Song. 814 Touchalytics: On the applicability of touchscreen input 815 as a behavioral biometric for continuous authentication. 816 IEEE Transactions on Information Forensics and Security, 817 8(1):136–148, Jan 2013
  2. S. Gao, I. W.-H. Tsang, and L.-T. Chia. Sparse representation with kernels. IEEE Transactions on Image Processing, 820 22(2):423 –434, feb. 2013

Mobile Mesh Networks

Currently this project team consists of a faculty advisor, Dr. Ryan Integlia, and two graduate students, Jacob Chen and John McCormack. We are using a routing protocol called B.A.T.M.A.N Advanced that runs on OpenWRT. We hope to develop a mesh platform in order to deploy future applications. We would like to be able to pass virtual machines between nodes to enable distributable applications. Current goals include geo-spatial/environmental and contextual analysis, sentiment, security, data visualization/interaction, swarm interaction, and swarm intelligence. Check out the Research Progress page for updates. We believe that this mesh platform will be able to produce numerous novel applications that can be rapidly deployed while offering unprecedented robustness and flexibility that cannot be found in a standard network infrastructures.