How to Detect Lies with a Machine and Microexpressions

16 min readFeb 3, 2021

Let's try a challenge:

Below are a few statements. The challenge for you is to say all these statements out loud. The catch is that you must do it with a stone face. So no emotions. Are you ready? Here we go.

“I think puppies are ugly, horrible creatures.”
“Chocolate is the grosset food on the planet”
“The beach is the worst place to relax and chill.”

Alright, now you can show your emotions. Let your anger and disgust show as you think about the lies you just told. Are you worried someone saw you? With such a serious face, they probably think you really meant those truly unpopular opinions. But don't worry. There’s a way to prove your innocence.

Emotions Deeply Hidden

While you were actively trying to keep a straight face, your facial muscles were working hard to show your true emotions through the use of micro-expressions.

Micro-expressions are tiny facial expressions that represent a person's true emotions. These differ from macro expressions in both lengths of time and intensity. While a macro expression lasts for 1/2 to 4 seconds, a micro-expression only lasts from 1/15th to 1/25th of a second.

Micro-expressions on the left and their corresponding macro expressions on the right (https://www.researchgate.net/figure/Difference-of-motion-intensity-between-micro-and-macro-expression-happiness-line-1_fig1_336815025)

Micro-expressions occur as a cause of emotional repression. When you try to suppress your macro expressions, micro-expressions form which can show your true feelings.

So if someone happened to see you chanting these socially unpopular opinions, don't worry. They likely detected and classified the micro-expressions you were showing, allowing them to see you were clearly lying.

Except that's almost certainly not the case. Humans are very bad at detecting micro-expressions, let alone correctly labeling them. Untrained observers have been found to be accurate 56% of the time with trained observers not much higher.

Yes, that's right. You can actually get training in micro-expression detection and classification. Dr. Ekman, one of the early micro-expression researchers, has this course online if you'd like to check it out. Maybe then, you can catch your friend lying if they ever try the challenge above.

But not everyone has the money or time to learn how to detect micro-expressions. Because of their uncontrollable nature, micro-expressions seem like a promising tool in fields where being able to detect deception is critical, such as Psychiatry or Law. So finding a way to accurately detect them would be beneficial.

If humans have such poor detection accuracy, then it makes sense to try the next best thing. Artificial Intellgience.

AI Microexpression Recognition

Artificial Intelligence is known for its increasing capability to do human tasks much better than humans. Microexpression detection is no exception.

Research into the applications of AI/ML with micro-expressions has been sparse but increasing. Many studies have shown the creation of models with accuracy exceeding human capability, making the vision of using microexpressions to detect deception more possible.

So how do we build a model that can detect micro-expressions? There are three main steps:

Data Collection

Data collection is the hardest step of the process. The subtlety of micro-expressions makes it hard to gather data without a very controlled environment.

One of the early datasets, named USF-HD, consisted of 100 data points from participants who were shown examples of micro-expressions and then asked to replicate them.

Another came from Polikovsky in which he asked 10 university students to try and make low-intensity facial expressions and then get rid of them as fast as possible.

You can already see that these sets aren't the most accurate as they don't represent natural micro-expressions. This makes it hard to accurately identify which emotion is being shown.

The SMIC dataset does a better job by using an interrogation setting to cause high stress in the subjects. Since micro-expressions often show in high-stress situations where emotions are being suppressed, the interrogation environment allowed for natural and spontaneous micro-expressions to occur.

A scene representing what it probably felt like for the subjects of this dataset

Despite this, the dataset is quite small as it only captured 77 data points that could be used.

More recent datasets have fixed some of these issues by using a more controlled environment.

Xiaobai Li from the University of Oulu in Finland got their dataset by having 20 participants sit in front of a camera and watch a series of videos meant to incite a range of positive and negative emotions. Like the challenge you did at the beginning of this article, they were told to not show any emotions.

The video was then split up into frames in which a neutral frame (neutral face, no micro-expression) was picked as a base. Other frames were then compared to this frame to detect subtle changes in the face (micro-expressions). Then based on the clip being shown to the participant at the time of that micro-expression, along with testimonials of how they felt watching the specific clip, they were able to deduce whether that emotion was positive, negative, or surprised. This resulted in a total of 164 micro-expression data points.

The same technique was used for the popular CASME dataset but consisted of 35 participants for a total of 195 micro-expressions. The number was later increased to 247 data points in the updated set, CASME II.

An example of a participant from the CASME II dataset, you can see the slight difference in her eyebrows which represents a micro-expression. You can also see just how short these are, as this one lasted only 245 milliseconds.

Overall, while the datasets are getting better, the necessity for ideal data points makes it challenging to create algorithms that are generalizable with high accuracy.

For example, in Li’s dataset, the participants were required to face the camera head-on, with no large movements of the head. Lighting had to be perfect and the camera had to be still.

This makes it hard to train algorithms that can recognize faces from a moving or shaking camera, a camera of bad quality, or a camera at different angles.

Not only that, but these data sets, although increasing, do have a low amount of data points to go off of.

One possible suggestion from this article on gaining more data points involves using the vast amount of videos on youtube and social media as a way to find and label microexpressions. With many people recording videos in which they are facing a camera talking (usually with good lighting), it could work as a source to get micro-expressions that could easily be labeled based on what the speaker is saying.

Despite the challenges in gathering quality datasets, algorithms have been made using sets such as CASME II, that perform at state-of-the-art accuracies, averaging 80–90%. After picking a dataset, its time to train the model to extract microexpressions from the faces.

Feature Extraction

Before feature extraction, the data must be pre-processed. This includes splitting the video up into individual frames, removing frames that don't include microexpressions, and then converting the frames from RBG color to grayscale.

So how do we determine which frames should be removed? First, it's important to figure out whether we want to use spatial or spatiotemporal data. Should we use static frames, or use a sequence of frames that would allow us to account for time?

Using spatial frames requires the researcher to find the apex frame of the emotion, which is the frame in which the micro-expression is most intense. Using spatial frames allows for simpler algorithms and shorter training times but results in less accurate machines.

Using spatiotemporal data requires the use of all frames from the onset of the micro-expression to the offset. While leading to much more accurate algorithms, it can have much longer training times.

A popular method of extracting these features is to use Local Binary Pattern (LBP). Let's take our static frame.

So the point of LBP is to help us find edges in a picture, in this case, allowing us to distinguish the mouth from the cheeks and the eyes from the forehead, etc. First, we divide the 2D image of our face into multiple blocks. Then for each pixel inside the blocks, we measure the 8 surrounding pixels.

Diagram representing how LBP works (https://link.springer.com/chapter/10.1007/978-3-030-01449-0_24)

We compare the grayscale value of each neighboring pixel by the grayscale value of the center pixel. For each neighbor, if the value is higher than or equal to the value of the center pixel, we record a 1, and we record a 0 if the opposite is true. With 8 neighbors, this allows us to form a byte which is then converted to a decimal so it can be used for training.

If you want further explanation, this video from Computerphile does a good job explaining the concept of LBP.

How do these values allow us to get edges? With each pixel, we can draw lines (edges) dividing the pixels with a value of 0 from the pixels with a value of 1, to distinguish between light and dark spots. By doing this for each pixel and abstracting up to the level of the aforementioned blocks, the computer can make predictions on where in the picture there are edges and where there are not. If two pixels consistently get divided by a line, then it is likely they are on a spot in the picture where a prevalent edge is, such as the line between the lips and the face, or nostrils and the nose.

Now, this is good for helping the computer detect edges on the face, but in order to tell that a microexpression occurred, LBP must be applied using three dimensions, x, y, and time. By adding time as a third dimension, we are able to see how the edges of one frame compares to the edges of the subsequent frame. This is called optical flow, in which we can calculate the movement of an object by seeing where a “pixel” moves from one frame to the next. The pixel itself is not actually moving, but rather the data of the pixel in the subsequent frame is now taking up a neighboring pixel in the following frame.

A way to calculate this is through Local Binary Patterns on Three Orthogonal Planes (LBP-TOP.) It's the same as LBP but now, instead of just an xy plane, we have an xy, yt, and xt plane.

In other words, the xt plane takes the center pixel and compares it to its left and right neighbors, as well as the values of the left, right, and center pixels from the preceding and succeeding frame, resulting in an 8-bit byte. The same goes for the yt frame but with its top and bottom neighbors.

So now that we have extracted our features, it's time to feed them into an algorithm for training.

Feature Identification

There are many ways to go about Feature classification ranging from handcrafted algorithms to deep learning networks. This section will go over the methods and results of the two papers mentioned below:

“Spontaneous Facial Micro-Expression Recognition using 3D Spatiotemporal Convolutional Neural Networks”

“Facial micro-expression recognition: A machine learning approach”

Both studies use the CASME II dataset which makes them good for comparison.

MicroExpFuseNet vs MicroExpSTCNN

In the first paper by the Computer Vision Group at the Indian Institute of Information Technology, they tested the difference between two different methods, the MicroExpSTCNN model which analyzed the whole face, and the MicroExpFuseNet which extracted and used just the mouth and eye regions.

Diagram of how the MicroExpSTCNN model was structured

With the MicroExpSTCNN model, the images are put into a 3D convolutional neural network. This layer is used to distinguish the edges and features of the face. The input has dimensions of w x h x d, with w and h being the width and height of the frame while d is the number of frames being analyzed at once (temporal vector). This is what makes the network 3D and what allows for temporal as well as spatial analysis.

The data is then passed through 3D pooling layers which are used to pick the features of the face that will be most influential in the classification of the emotion. This occurs through a process called Uniform Binning, in which pixels that have very similar values are put together in a “bin.” This helps reduce redundancy and increase training efficiency.

This data is then sent through a dropout layer. When a neural network trains on a data set, it has the tendency to get really good at classifying the frames it's been training on, but when introduced to new frames it fails (Overfitting). It's like memorizing the notes for a test instead of grasping the underlying concepts. This prevents you from answering questions that weren't in the notes.

The dropout layer reduces overfitting by turning off random nodes in the network each iteration. This forces the model to work harder on finding patterns that are generalizable in microexpression classification.

The flattening layer is then used to turn the 3D data into a 1-dimensional input. The data is then sent through a dense network. (Dense networks are networks in which every node is connected to every other node) Then it's sent through another dropout layer and then another dense layer. This portion is where the model trains and figures out what emotions correlate with each expression.

Finally, the data is passed through a softmax function which turns the input values into a number between 0 and 1, or in other words, a probability. A probability is then assigned to each emotion (Anger, Disgust, or Happy) and the one with the highest is the prediction made by the model.

Diagram of how the MicroExpFuseNet model is structured

The MicroExpFuseNet model follows the same order of layers as the MicroExpSTCNN. The main difference is that the beginning stage consists of two separate networks in which one extracts the eye regions and the other extracts the mouth regions. The data from both networks are then combined into one input after the flattening stage. The other difference with this model is that the dense and dropout pattern is repeated three times before hitting the softmax layer.

Support Vector Machine vs Extreme Learning Machine

In the second study by Iyanu Adegun and Hima Vadapalli, they compare the accuracy of the Support Vector Machine (SVM) and the Extreme Learning Machine (ELM). (Well go over these shortly)

First off, for feature extraction, both methods use the LBP and LBP-TOP methods.

This data is then plotted onto a graph in which an SVM can be used. What the SVM does is it tries to find the line that can best divide the data into two determined classes. Without getting too complicated, it does this by drawing a line, calculating the distance of each point to the line, and then evaluating the probability that the point is in the right class based on its distance from the line.

With each iteration, the model uses this data, along with the correct labels from the training set, to adjust the line so that it's in the spot where it most accurately divides the data into the two classes and leaves the most amount of margin on each side of the line from the closest point.

That's a bit of a mouthful but hopefully, this picture can help you visualize it.

SVM tries to find the line that maximizes the distance between the red line and the dotted lines on both sides while accurately dividing the data points into two classes.

In this study, there were five possible outputs (Surprise, Disgust, Happiness, Repression, and Other). Since SVM is a binary classifier, the model was used five times with the two classes being “the emotion” or “not the emotion” (ex. Surprise or Not surprise). The average of all five accuracies was then taken to get the total accuracy of the model.

Sometimes with SVM, the data can not be perfectly divided by a one-dimensional line. In this case, the data is graphed in three dimensions and a hyperplane is used to divide the data, which was the strategy used in Adegun and Vadapalli’s study.

SVM in a three-dimensional graph, using a hyperplane to separate the data.

The other model Adegun and Vadapalli’s study tests is an Extreme Learning Machine. Without getting into all the math, an Extreme Learning Machine is similar to a normal Neural Network in that the data gets fed forward. It differs in that it doesn't use backpropagation but rather something called a matrix inverse Don't worry if you don't understand these terms. Just understand that the main advantage of ELM is that it doesn't have to feed the data through multiple times. With the use of the matrix inverse, the weights and biases are set after feeding the data forward just one time.

This results in a significantly faster learning speed than traditional neural networks which is what encouraged this study to test it in the first place.

The ELM trains similar to the SVM in that it used LBP and LBP-TOP to extract the features. The model is also a binary classification model so the data was split into five, two-class sets just like with the SVM.

Results

In order of increasing accuracy, the four methods ranked:

4. MicroExpFuseNet (83.25%)

3. MicroExpSTCNN (87.80%)

2. Support Vector Machine (96.26%)

Extreme Learning Machine (97.65%)

The highest accuracies (1 and 2) were both achieved when the algorithms extracted features using LBP-TOP over LBP which shows that temporal data increases the accuracy by a significant margin (3–6%).

Also important to note that 3 and 4 used three outputs while 1 and 2 used five. It's possible that the use of an “Other” category for 1 and 2 made it easier for the machine to predict correctly, by being able to lump the more ambiguous microexpressions into one category. This means that the data points in the other four categories were more defined making it easier for the model to classify them.

Also important to note that the MicroExpSTCNN which involved looking at the whole face, did better than the MicroExpFuseNet which only looked at the eyes and mouth regions. From this, we can conclude that the nose, forehead, cheeks, and other parts of the face besides the mouth and eyes are significant in classifying the emotion of a microexpression.

It's clear that these algorithms are able to get very high accuracies; much better than humans could. This gives hope for the use of this technology in society.

Applications of a Micro-Expression Detector

What could this research be used for? Well, the fundamental use is that microexpressions can help detect deception, especially when the subject is consciously trying to deceive. Some interesting applications of this tool can be found in fields such as Psychiatry and Law.

Mental Health

Patients with depression often lie about feeling better (from not liking the treatment, pressure to feel better, etc.) to their doctor which causes them to leave without getting the proper help they need. This can have dramatic impacts, the most significant being suicide.

There is also a type of depression called “Smiling Depression,” in which a person has depression but suppresses it either consciously or not, with positive emotions such as a smile, hence the name.

These people are usually successful in many aspects of their life and often don't even realize they are depressed. This type of depression is especially dangerous. With normal depression, patients are prone to low energy while people with smiling depression are more likely to experience high spurts of energy which makes them more likely to follow through with a suicide attempt.

Representation of what people with Smiling Depression do, they mask their feelings with a smile

With microexpression detection, we could use it on patients when conducting interviews to try and look for traces of repressed sadness or anger. This could help us make more informed decisions on the next moves to make resulting in increased suicide prevention.

Interrogations

Interrogations for criminal suspects are a perfect place for microexpression detection to be used. They create a high-stress environment and are filled will people who are trying to cover something up. Remember, the SMIC dataset used this method of interrogation to create spontaneous micro-expressions.

Micro-expression detection could help us see how a suspect truly feels while saying certain statements. If you murdered someone and then tried to lie and say you had no knowledge of who the victim was, you'd likely show traces of anger or guilt that could be detected by a micro-expression trained model. Hopefully, you are never in that situation though.

It’s important to note that microexpression detection algorithms should be used as one of many tools to assess a patient or criminal. While microexpressions can be a good indicator of a patient’s true emotions, it’s important to have other metrics (ex. polygraph, heart rate monitor, patients history) when making such life-altering decisions.

Quick Recap

To recap, micro-expressions are expressions that form on the face, usually in high-stress situations where deception is taking place. These expressions are shorter and less intense than macro expressions and are hard to fake, making them good for revealing a person's true emotions.

Through the collection of photos and videos showing micro-expressions, an AI model can be trained to learn how to detect and classify micro-expressions itself.

This would help in fields where detecting lies or suppressed emotions would be beneficial such as Psychiatry and Law.

New Challenge

With all this new knowledge you've gained, let's try one last challenge:

Same as before, say the following statements without showing any emotion. Ready?

“This article was absolutely amazing!”
“This article taught me so much interesting things about microexpression detection!”
“Microexpressions and models that can detect them are the coolest things ever!”

Man, if only I had a model myself to detect all the micro smiles on your faces right now. :)

TL; DR

Microexpressions are short and subtle expressions on the face that usually form in high-stress situations where a person is trying to lie. Good for showing a person's true emotions
Using AI we can train models to detect and classify microexpressions
A popular dataset right now is the CASME II dataset with 247 data points. Current datasets are sparse and were made under controlled environments.
To extract the microexpressions from a face, the machine uses Local Binary Pattern models which compare the grayscale value of the surrounding pixels with its center pixel. This process helps determine edges in a photo. LBP-TOP is used to incorporate time into the model which allows for the model to see changes in the face, more specifically, microexpressions.
Two studies created models of their own. The MicroExpSTCNN, which trained off the whole face, and the MicroExpFuseNet which trained off the eye and mouth regions were tested in one study while a Support Vector Machine algorithm and an Extreme Learning Machine algorithm was tested in the other.
The ELM was shown to have the highest accuracy of the four at 97.65%. It likely did better than the MicroExpSTCNN and MixroExpFuseNet due to its inclusion of an “Other” category which was used for the more ambiguous microexpressions.
Microexpression detection could be used in fields such as Law and Psychiatry where being able to detect deception is crucial.