Understanding pose estimation using opencv

Pose-estimation using OpenCV

written by: Dinesh Gitae, Rohini shinde, shyambhu mukherjee

Pose-estimation is a classic computer vision task where we detect the pose of a human face or the body; and we decide different parameters or detect certain caricatures from that. In the mentoring effort from mentorbruh, dinesh and rohini took a opencv pose-estimation, completely understood the code and modified it to detect their own face pose via opencv. While dinesh completed it from local, rohini used a colab version to do the task. In this post, we will go through the code and understand it to the most tiny details; so that you can recreate the task or at least use such pose-estimation tasks.

Summary of the article:

In this article, first we introduce the reader to what is a pose-estimation task and what are the different models currently available in market to do the task. Then we quickly will go to the example and we will thoroughly discuss each and every part of it and how does each part help in the pose estimation. We will provide some of the examples of code snippet from dinesh's repository and finally we will end with some further reading links about how to use pose-estimation results and exciting things people have been building using pose-estimation.

What is pose-estimation task?

Pose estimation tasks refer to the inferring the poses of a body from a image through the process of detection of both human bodies from an image; then detecting the separate pose-denoting parts of body; and finally assembling them to detect the pose of the body. Here is an interesting kdnugget article on the different ground breaking researches done upto 2019 in this field. Now, once you read this article and do a bit of research; you will come to see there are different variations of the pose-estimation task and there are multiple papers which have achieved great feats in each of these. Here we will mention some of them:

(a) convolutional pose machine: The paper estimates 3d pose of human bodies and the paper is one of the best till date. This paper was posted on 2016; and is available with the codes too.

(b) PIFuHD, Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization:

This paper came out in cvpr-2020, and ranks first in 3d object reconstruction from a single image in renderPeople. You can check and try colab implementation of this paper from facebook research here.

This paper came out in 2016 again, and this one approaches a slightly different problem than the previous two. This paper focuses on detecting multiple persons from one single image and then determines their poses simultaneously. See the image below to get a nice understanding.

deepcut image, position estimation project article

Now, we can mention only so many great works; without going out of the scope of the current article. While these may have got your hopes high, in our current project, we will use the Tensorflow MobileNet model; as we run it in local machines and don't have guarantee of heavier model supports. The term mobilenet raises a question; which we will quickly resolve in the next paragraph.

Why mobile-net?

well, mobile-net models are basically meant to be light-weight yet not much lesser accurate versions of generally heavier models like deep cnn networks; i.e. google's vgg16net and others. A number of compression, singular value decomposition, prunning and others go into creating such models and finally these models become light weight yet similar performing. You can read about one such efforts from this paper which talks in length about all these efforts. In the current project, we use the tensorflow mobilenet model which is only 7MB and therefore we can run it safely without our computers getting frozen ( 😉 ).

Into the code we go:

Now that we are down from our high horses, let's quickly start with the code.

First of all we do a couple import. Note that opencv is imported as cv2. If you face any problem in downloading opencv, then follow this to resolve it. Anyway, we import cv2 as cv and also we import matplotlib.pyplot for doing the plots of images and all in our code.

import cv2 as cv and import matplotlib.pyplot as plt

Now, we will quickly load the mobilenet model, which is in our git repo, same directory, in the name of graph_opt.pb; using the following line and the readNetFromTensorflow functionality:

# Load the Frozen model using Tenserflow:
net = cv.dnn.readNetFromTensorflow('graph_opt.pb')

After this, we will quickly note the 19 points we are going to point out via pose-estimation and also we provide the pairing as a list of lists of these points; so as to join our pose-estimate into a single skeliton like structure.

19 body-tags for tensorflow mobilenet models to tag and their pairs

We have an available image of all the pairs connected. We quickly load this image using PIL library's Image.open() functionality. Look at the image to get a better idea of the body parts and pose-pairs:

Now, we reach the most important part of our code, where we write a function to run the model on a image, estimate the positions and finally put them as green lines over the image. We will first provide the whole function for you to read and then cite each part and explain it.

The function is below:

def pose_estimation(frame):
    
    frameWidth = frame.shape[1]
    frameHeight = frame.shape[0]
    
    ## Grabbing the image:
    net.setInput(cv.dnn.blobFromImage(frame, 1.0, (inWidth, inHeight), (127.5, 127.5, 127.5), swapRB = True, crop = False))
    
    out = net.forward()
    out = out[:, :19, :, :]  # MobileNet output [1, 57, -1, -1], we only need the first 19 elements

    assert(len(BODY_PARTS) == out.shape[1]) 

    points = []
    for i in range(len(BODY_PARTS)):
        # Slice heatmap of corresponging body's part.
        heatMap = out[0, i, :, :]

        # Originally, we try to find all the local maximums. To simplify a sample
        # we just find a global one. However only a single pose at the same time
        # could be detected this way.
        _, conf, _, point = cv.minMaxLoc(heatMap)
        x = (frameWidth * point[0]) / out.shape[3]
        y = (frameHeight * point[1]) / out.shape[2]
        # Add a point if it's confidence is higher than threshold.
        points.append((int(x), int(y)) if conf > thr else None)
        

    for pair in POSE_PAIRS:
        partFrom = pair[0]
        partTo = pair[1]
        assert(partFrom in BODY_PARTS)
        assert(partTo in BODY_PARTS)

        idFrom = BODY_PARTS[partFrom]
        idTo = BODY_PARTS[partTo]

        if points[idFrom] and points[idTo]:
            cv.line(frame, points[idFrom], points[idTo], (0, 255, 0), 10)
            cv.ellipse(frame, points[idFrom], (3, 3), 0, 0, 360, (0, 0, 255), cv.FILLED)
            cv.ellipse(frame, points[idTo], (3, 3), 0, 0, 360, (0, 0, 255), cv.FILLED)

    t, _ = net.getPerfProfile()
    freq = cv.getTickFrequency() / 1000
    cv.putText(frame, '%.2fms' % (t / freq), (10, 20), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0))
    return frame

Now, first things first. The first important thing we do in the function is to take a frame as an input. To the uninitiated, a frame is basically a image, but to be more technical, it is more of a slice of a running image. i.e. if you are reading from a video, or taking live input, then based on how many frame par second is getting recorded, you will be capturing a series of images of what we are calling here 'frame'. In case of cv as well as programs where screen/input changes with time ( i.e. as in games), the concept of frame arises.

Now, inside the program, we do the following things:

To get a shape of image:

To find the actual dimension (Height and Width) of image we use following lines.

Pre-processing the Image:

To obtain correct predictions from deep neural networks we first need to pre-process our image. In the context of deep learning and image classification, these pre-processing tasks normally involve:

a. Mean subtraction

b. Scaling the images

OpenCV provides two functions to facilitate image preprocessing for deep learning classification:

a. cv2.dnn.blobFromImage()

b. cv2.dnn.blobFromImages()

These two function performs Mean subtraction, Scaling and Channel swapping.

Mean subtraction:

image credit: pyimagesearch article

This is a visual representation of mean subtraction where the RGB means has been calculated from image & subtracted from original image (left) resulting in the output image (right).

blobFromImage() create 4-dimensional blob from image. It resizes and crops image from center and subtracts mean values by scale factor, swap Red and Blue channels.

Basically, blob creates a collection of images with the same dimension i.e. height and width and same channels.

Discussion about parameters:

cv.dnn.blobFromImage( image, scalefactor = 1.0, size, mean, swapRB = True)

image: this is the input image we want to preprocess before feeding It to the neural network.
scalefactor: After we perform mean subtraction we can optionally scale our images by some factor. This value defaults to `1.0` (i.e., no scaling) but we can supply another value as well.
size: Here we supply the spatial size that the Convolutional Neural Network expects.
mean: These are our mean subtraction values. They can be a 3-tuple of the RGB means or they can be a single value in which case the supplied value is subtracted from every channel of the image.
swapRB: OpenCV assumes images are in BGR channel order. however, the ‘mean’ value assumes we are using RGB order. To resolve this we can swap the R and B channels in ‘image’ by setting this value to ‘True’.

So this is where our pre-processing completes. Then using the setInput() function of our neural net object we provide the 4d blob as an input.

Now, we use the net.forward() method to feed the input forward through the model and get the output as output. This output is then sliced for 19 body pairs which we are currently tracking. The whole method happens in this part of the code.

Now, you have to understand what the output is. The output, in each of its index i, out[0,i,:,:] is a heatmap of the i-th body part. This heatmap, represents a spatial probability i.e. p((x,y)) roughly correspond to (x,y) being that body point. So what we logically need to do is to find the point which is hottest, i.e. which has the highest probability of being the body part according to the model's output.

For this, we use the function cv.minMaxLoc(). Let's look at the parameter structure and output syntax of cv.minMaxLoc().

Discussion about parameters:

minMaxLoc() [1/2]

void cv::minMaxLoc	(	InputArray	src,
		double *	minVal,
		double *	maxVal = `0`,
		Point *	minLoc = `0`,
		Point *	maxLoc = `0`,
		InputArray	mask = `noArray()`
	)

Python:
	minVal, maxVal, minLoc, maxLoc	=	cv.minMaxLoc(	src[, mask]	)

This function takes the src or the 2d array as an input, and then it provides back global minimum value, maximum value; as well as the two dimensional locations in the 2d array where the minimum and the maximum values occur.

We use this function, in loop, for each of the body parts, and then finally capture the maximum value points; i.e. body points and store them in a list. The whole mechanism happens as below:

cv minmaxloc() used in detecting maximum probability for body point prediction for pose-prediction. code sample.

Note this important point also, that we actually don't capture the body position, unless it is more than the threshold. Therefore, using higher or lower threshold, accuracy can be maintained from this point too.

Finally we connect and place the lines of pose-skeliton using the following code loop; where we go through each of the pose-pairs; collect their positions and draw lines between them. We use

cv.line(frame, location-from, location-to, color, breadth)
and cv2.ellipse(image, centerCoordinates, axesLength, angle, startAngle, endAngle, color [, thickness[, lineType[, shift]]])

to draw the respective body points and join them. The code for the same is:

cv ellipse() and cv line used to draw the pose points and pose-lines in the body image.

The same function is then used in different images in the repo; as well as the same thing is used on the practically captured images.

The last part of the code talks about how the video is captured and how automatically on each frame we can provide the pose-estimation.

cap = cv.VideoCapture(0)
cap.set(cv.CAP_PROP_FPS, 10)

if not cap.isOpened():
    cap = cv.VideoCapture(0)
if not cap.isOpened():
    raise IDError("cannot open webcam")
    
while cv.waitKey(1) < 0:
    hasframe, frame = cap.read()
    if not hasframe:
        cv.waitKey(0)
        break
    pose_estimation(frame)
    cv.imshow('Pose estmation', frame)

Other than the facts that, cv.videocapture(0) is used to start the webcam capture. The cap.isOpened() refers a boolean rendering whether the webcam was opened or not; which in turn helps to raise errors. And last but not least; we use the cap.read() function to read each from the webcam capture object. This part of the code can also serve as anything you may want to do with running a webcam and then by manipulating the incoming image. i.e. it will look something like this:

cap = cv.VideoCapture(0)
cap.set(cv.CAP_PROP_FPS, 10)

if not cap.isOpened():
    cap = cv.VideoCapture(0)
if not cap.isOpened():
    raise IDError("cannot open webcam")
    
while cv.waitKey(1) < 0:
    hasframe, frame = cap.read()
    if not hasframe:
        cv.waitKey(0)
        break

    #perform your task on each frame
    output = function_to_run(frame) 
    cv.imshow("the output:",output)

In conclusion:

We learned quite a good deal about the literature in current pose-estimation tasks and also we completely analyzed and performed the pose estimation task from our side. Hope this was a good read from you. We are just starting with opencv in our blog; and would love to collaborate with our readers if wanted. So do comment if you want to collaborate and write with us.

Thanks for reading! Special thanks to quan hua, whose code repo we used and modified further according to our usecases.

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE

Machine learning and statistics with python

Search This Blog