Online face and facial landmark detection, pose normalization and face clustering without tracking

This project presents a complete machine learning and computer vision system that performs the following:

  1. Detect faces in a given video or image using Histograms of Oriented Gradients (HOG) [1] and a combination of linear classifiers as trained in [3].
  2. Detect 68 facial landmark points on the detected faces using an ensemble of regression trees algorithm described in [2].
  3. Normalize the face poses by finding a transformation from the detected face pose to the canonical pose by scaling, rotating and translation.
  4. Normalize the face appearance (in the image pixel space) using the obtained transformation to get the normalized/frontalized face images.
  5. Extract high dimensional overlapping LBP features around the detected landmark points, get a high dimensional feature vector for each face and perform online clustering (and cluster updates) of the faces.

The system does not require or perform any tracking. Yet, it is able to tag the people/faces appearing the video with the cluster IDs which roughly correspond to the identities of people.

It uses C/C++ programming language for the main processing and logic, and Qt C++ library for the GUI and user interaction.

The system can take input from 4 main data sources:

  1. Webcam
  2. Video file on disk
  3. Single image file on disk
  4. A directory of images

The following are the outputs for each of the inputs mentioned above:

Input Output
Webcam
  • face detection visualization
  • facial landmark detection visualization
  • normalized faces visualization
  • visualization of landmarks on normalized faces
  • visualization of means of face clusters in pixel space
Video file
  • face detection visualization
  • facial landmark detection visualization
  • normalized faces visualization
  • visualization of landmarks on normalized faces
  • visualization of means of face clusters in pixel space
Image file
  • face detection visualization
  • facial landmark detection visualization
  • normalized faces visualization
  • visualization of landmarks on normalized faces
Directory of images
  • XML file that has recorded the following for each image in specified directory:
    • Relative path of current image file.
    • Face detection bounding box rectangle coordinates in the form of [x, y, w, h], where x and y are the top left coordinates of the rectangle, and w and h are the width and height of the rectangle respectively.
    • The confidence or score of the detection.
    • The 68 landmark point positions in the form of {[x1,y1], [x2,y2], …, [x68,y68]}.
  • For each image:
    • face detection visualization
    • facial landmark detection visualization
    • normalized faces visualization
    • visualization of landmarks on normalized faces

The facial landmarks used have been annotated as follows:

facial landmark annotation map from [10]
Some screenshots from the deployed system are shown below with relevant explanations. Note: All photos or videos that serve as the input data to my deployed system have been gathered from public domains and datasets, and as such, the copyrights of those photos or videos belong to the original authors who own them.

The main page of the deployed face analytics system. It has 4 different modes of data input: (1) webcam (2) video file on disk (3) a single image file on disk (4) a directory of images (which can have any arbitrary level of sub-directories, etc.) It also has a number of options for turn on or turn off visualizations of different outputs of the system.
The “About” menu is for documentation which includes materials such as references and explanation of the system.
This is the reference section of the documentation menu.
Zooming in on the top part of the GUI. Before beginning any analytics or processing, the trained machine learning models need to be loaded by clicking on the appropriate button. Before any such loading of the models, no further analytics can be performed, therefore all the other buttons are naturally disabled.
After loading the models, the system is now ready to do perform analytics and the appropriate buttons (for the 4 modes of data input operations) are now enabled.
After clicking on the “Load single image” button, the system will give the user a chance to browse and locate any image file on the computer.
After the user has selected an input image, the system has now obtained the file path of the input image and the system has enabled the button “Start single image” for the user to start the analytics on the chosen image. Furthermore, the “Load single image” and all other buttons have been disabled to reflect this.
After the user clicks “Start single image”, the system will show the outputs according the options selected by the user in the checkboxes under “Show visualizations for”. In this case, the user has selected all the visualizations except the last one (which is “face cluster means). Even if the user would like to select this checkbox, it will not be allowed since “the face cluster means” visualization option is valid for the webcam and video file modes of operations.
This is after the user has selected another image (after pressing “Load single image”). This time, the user opts to show visualization for only face detections by selecting only the first checkbox.
This time, the user has opted for visualizing face detections and facial landmark detections. All other (checkboxes for) visualizations are turned off.
Also visualizing the normalized faces.
In addition to visualizing the normalized faces, the user has selected to show landmarks on the normalized face images.
Another test image for the single image input mode.
Another test image for the single image input mode.
Another test image for the single image input mode.
Another test image for the single image input mode.
Another test image for the single image input mode.
Another test image for the single image input mode.
Another test image for the single image input mode.
Another test image for the single image input mode.
Another test image for the single image input mode.
This time, the user has the selected the video file input mode and the system lets it browse to a video file on the computer.
After the user has selected, the system has loaded the video and is ready to perform analytics on it.
In the first frame of the loaded video, we see the first person. The system correctly assigns the face to “cluster 1”. It works by first detecting the face, and then computing the facial landmark points. This is followed by finding a affine transformation (in particular rotation, scaling and translation) that will transform the relative positions of the detected landmark points to a canonical 2D pose. Once the translation, rotation and scaling parameters have been obtained, the image pixels will be warped using the obtained transformation to generate the normalized face image. This normalized face image will serve as input to the feature extraction which will extract at overlapping locations around the pose normalized detected landmark points Local Binary Pattern (LBP) features. All these LBP feature vectors will be concatenated to form the single very high dimensional feature vector that represents the normalized face image. All the subsequent online clustering including cluster assignments and updates are done in the feature space. The visualization of the face cluster means in the appearance pixel space is just for human consumption.
The video has been moving forward and soon another person has appeared. The system again correctly assigns to a new cluster, i.e. “cluster 2”. The effect of this can also be seen in the visualization.
Even as the person is moving about, talking, turning his face in various ways, etc., the system still correctly assigns to the correct cluster. Note that all the clustering takes place online in real-time. No offline processing, collection of data and batch clustering is used. Moreover, no tracking, temporal smoothing or any type of spatio-temporal cues are required or used by the system.
The user is free or pause, resume (start) or stop the video analytics at any point in time.
The video has been playing for a while. This time, even new people appeared in the scene. As can be seen, the system is robust to various real-time conditions such as scale, translation, 2D rotation, lighting and resolution. The number of clusters and the individual models of the clusters themselves are automatically updated by the system on a frame-by-frame basis.
Yet another person has appeared in the group and a new cluster has been correctly created, updated and assigned.
2 new clusters are created. The cluster 6 actually belongs to the person who was in cluster 4, however, since the face is sufficiently different from that of in cluster 4, it was assigned to a new cluster (“cluster 6”). This could be improved by performing more expensive 3D normalizations, etc. But due to the requirement to keep the system performing in real-time, those expensive operations are not performed. Another possible improvement will be to use additional cues such as tracking, spatio–temporal information and additional context (such as clothing) to connect and merge the different clusters which belong to the same person and learn a higher level model online.


A new video file input.
From the beginning of this video, there are two persons, and the system correctly assigns them to “cluster 1” and “cluster 2”. The visualizations for various aspects of the output can also be seen such as the face detections, facial landmarks, normalized face images and normalized landmark points, and cluster means in pixel space.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
Demonstration of turning on or off certain visualizations even as the video is playing and analytics are performed.
Demonstration of turning on or off certain visualizations even as the video is playing and analytics are performed.
Demonstration of turning on or off certain visualizations even as the video is playing and analytics are performed.
Demonstration of turning on or off certain visualizations even as the video is playing and analytics are performed.
Demonstration of turning on or off certain visualizations even as the video is playing and analytics are performed.
Demonstration of turning on or off certain visualizations even as the video is playing and analytics are performed.
Demonstration of turning on or off certain visualizations even as the video is playing and analytics are performed.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
A new video file input
The beginning of this video has 2 persons and the system correctly creates 2 new clusters and correctly assigns the faces to the respective clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
A new video input file.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
This face was deemed to be sufficiently different and as such a new cluster was created. See the previous comments on how to extend the system to tackle this problem (with possible negative consequences to the speed or performance of the system).




A new video input file
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
As the video has been playing and despite the various changes that have taken place (such as the people in the scene talking and moving their heads, lips, etc. in various ways), the system still correctly assigns them to the correct clusters and does not make any new clusters.
Webcam input mode.
The user can select a webcam ID just in case there are multiple webcams. If the user has only one webcam, however, it is mostly likely ID 0. The further details with the webcam mode and other modes are not shown here. Even better results than in-the-wild videos such as YouTube videos can be expected since webcams are normally in more controlled conditions in terms of lightning, sizes of faces, resolution, etc.

 

 

References

[1] Dalal, Navneet, and Bill Triggs. “Histograms of oriented gradients for human detection.” Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE, 2005.
[2] Kazemi, Vahid, and Sullivan Josephine. “One millisecond face alignment with an ensemble of regression trees.” 27th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, United States, 23 June 2014 through 28 June 2014. IEEE Computer Society, 2014.
[3] King, Davis E. “Max-margin object detection.” arXiv preprint arXiv:1502.00046 (2015).
[4] C. Sagonas, E. Antonakos, G, Tzimiropoulos, S. Zafeiriou, M. Pantic. 300 faces In-the-wild challenge: Database and results. Image and Vision Computing (IMAVIS), Special Issue on Facial Landmark Localisation “In-The-Wild”. 2016.
[5] C. Sagonas, E. Antonakos, G, Tzimiropoulos, S. Zafeiriou, M. Pantic. 300 faces In-the-wild challenge: Database and results. Image and Vision Computing (IMAVIS), Special Issue on Facial Landmark Localisation “In-The-Wild”. 2016.
[6] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, M. Pantic. A semi-automatic methodology for facial landmark annotation. Proceedings of IEEE Int’l Conf. Computer Vision and Pattern Recognition (CVPR-W), 5th Workshop on Analysis and Modeling of Faces and Gestures (AMFG 2013). Oregon, USA, June 2013.
[7] King, Davis E. “Dlib-ml: A machine learning toolkit.” Journal of Machine Learning Research 10.Jul (2009): 1755-1758.
Belhumeur, P., Jacobs, D., Kriegman, D., Kumar, N.. ‘Localizing parts of faces using a consensus of exemplars’. In Computer Vision and Pattern Recognition, CVPR. (2011).
[8] Belhumeur, P., Jacobs, D., Kriegman, D., Kumar, N.. ‘Localizing parts of faces using a consensus of exemplars’. In Computer Vision and Pattern Recognition, CVPR. (2011).
[9] Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Boudev, Thomas S. Huang. ‘Interactive Facial Feature Localization’, ECCV2012.[10] https://ibug.doc.ic.ac.uk/resources/facial-point-annotations/