Custom Gesture Recognition Model for Video Playback using MediaPipe
Introduction
What is MediaPipe?
MediaPipe is Google's open-source framework for building multimodal (e.g., video, audio, etc.) machine learning pipelines. It is highly efficient and versatile, making it perfect for tasks like gesture recognition.
This is a tutorial on how to make a custom model for gesture recognition tasks based on the Google MediaPipe API. This tutorial is specifically for video-playback, though could be generalized to image and live-video feed recognition.
from google.colab import files
import os
import tensorflow as tf
assert tf.__version__.startswith('2')
from mediapipe_model_maker import gesture_recognizer
import matplotlib.pyplot as plt
upload dataset zip file to google collab and unzip:
Note, it is possible to fine tune training using hyperparameters:
dropout_rate: The percentage of input units to ignore in a dropout layer. Default is 5%.
layer_widths: A list specifying the number of units in each hidden layer for the gesture model. Each value creates a new hidden layer with that number of units. These hidden layers include BatchNorm, Dropout, and ReLU. Default is an empty list (no hidden layers).
Customizable Parameters for HParams (affecting model accuracy):
learning_rate: The speed at which the model learns during training. Default is 0.001.
batch_size: The number of samples processed before the model updates. Default is 2.
epochs: The number of times the model will see the entire dataset during training. Default is 10.
steps_per_epoch: (Optional) The number of steps (batches) to run per epoch. If not set, the default is the size of the training dataset divided by the batch size.
shuffle: Whether the dataset is mixed before training. Default is False.
lr_decay: The rate at which the learning rate decreases over time. Default is 0.99.
gamma: A parameter used for focal loss. Default is 2.
eg. the following trains a model with dropout rate of 0.02 and learning rate of 0.003
Modify the previous hand tracking script to recognize gestures and control video playback.
import cv2
import mediapipe as mp
import numpy as np
from mediapipe.tasks.python.components.containers.landmark import NormalizedLandmark
from mediapipe.framework.formats import landmark_pb2
#modify path with your custom model and video path
video_file_path = 'your_video.mp4'
gesture_model = 'gesture_recognizer.task'
#create gesture recognizer
GestureRecognizer = mp.tasks.vision.GestureRecognizer
GestureRecognizerOptions = mp.tasks.vision.GestureRecognizerOptions
#Using your custom model to create the options for the gesture recognizer
gesture_options = GestureRecognizerOptions(
base_options=BaseOptions(model_asset_buffer = open(gesture_model, "rb").read()),
running_mode=VisionRunningMode.VIDEO)
#create instance of hand tracker and gesture recognizer
with GestureRecognizer.create_from_options(gesture_options) as recognizer:
writer = cv2.VideoWriter("demo.avi", cv2.VideoWriter_fourcc(*"MJPG"), 12.5,(640,480)) # algo makes a frame every ~80ms = 12.5 fps
while cap.isOpened():
success, image = cap.read()
# if cannot open video file
if not success:
break
# To improve performance, optionally mark the image as not writeable to
# pass by reference.
image.flags.writeable = False
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = hands.process(image)
# gesture classification data arrays
current_gestures = []
current_handedness = []
current_score = []
# recognize gestures
mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
gesture_recognition_result = recognizer.recognize_for_video(mp_image, frame_count)
frame_count += 1
# obtain neccesary data into array for display (using array because there are two hands)
if gesture_recognition_result is not None and any(gesture_recognition_result.gestures):
print("Recognized gestures:")
for single_hand_gesture_data in gesture_recognition_result.gestures:
gesture_name = single_hand_gesture_data[0].category_name
current_gestures.append(gesture_name)
for single_hand_handedness_data in gesture_recognition_result.handedness:
hand_name = single_hand_handedness_data[0].category_name
current_handedness.append(hand_name)
for single_hand_score_data in gesture_recognition_result.gestures:
score = single_hand_score_data[0].score
current_score.append(round(score, 2))
# display classified gesture data on frames
y_pos = image.shape[0] - 70
for x in range(len(current_gestures)):
if current_handedness[x] != "Left":
txt = current_handedness[x] + ": " + current_gestures[x] + " " + str(current_score[x])
if current_gestures[x] == "supination":
cv2.putText(image, txt, (image.shape[1] - 400, y_pos), cv2.FONT_HERSHEY_SIMPLEX, 1, (218,10,3), 2, cv2.LINE_AA)
print(txt)
break
else:
cv2.putText(image, txt, (image.shape[1] - 400, y_pos), cv2.FONT_HERSHEY_SIMPLEX, 1, (37,245,252), 2, cv2.LINE_AA)
print(txt)
break
# displaying frame data
image = ResizeWithAspectRatio(image, height=800)
image = cv2.putText(
image,
"Frame {}".format(frame_count),
(10, 50),
cv2.QT_FONT_NORMAL,
1,
(0, 0, 255),
1,
cv2.LINE_AA
)
# Resize to original dimension before writing
resized_frame = cv2.resize(image, (640, 480))
writer.write(resized_frame)
cv2.imshow('MediaPipe Hands', image)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
writer.release()
cv2.destroyAllWindows()
Conclusion
In this tutorial, we covered the steps to capture gesture data, train a custom gesture recognition model using MediaPipe, and integrate it for video playback control. This can be expanded with additional gestures and more advanced models for better accuracy.
Hope you enjoy this tutorial and happy coding with MediaPipe!
Remember to replace placeholders such as `'path_to_your_video.mp4'` with actual paths relevant to your environment. This tutorial assumes a basic understanding of Python and familiarity with machine learning concepts.