Page cover image

Custom Gesture Recognition Model for Video Playback using MediaPipe

Introduction

What is MediaPipe?

MediaPipe is Google's open-source framework for building multimodal (e.g., video, audio, etc.) machine learning pipelines. It is highly efficient and versatile, making it perfect for tasks like gesture recognition.

This is a tutorial on how to make a custom model for gesture recognition tasks based on the Google MediaPipe API. This tutorial is specifically for video-playback, though could be generalized to image and live-video feed recognition.

For more information, visit https://ai.google.dev/edge/mediapipe/solutions/vision/gesture_recognizer

Prerequisites

  • Basic Python programming skills

  • Familiarity with machine learning concepts

  • A Google account to use Colab

  • Basic knowledge of OpenCV

Chapter 1: Setup Environment

Install Dependencies

To get started, ensure you have the required libraries installed. We'll be using MediaPipe and OpenCV.

Open your command line terminal or Colab notebook and run:

!pip install mediapipe opencv-python

Chapter 2: Capturing Gesture Data

Configure MediaPipe Hands Create a Python script to detect hands using MediaPipe and OpenCV.

import cv2 
import mediapipe as mp

mp_hands = mp.solutions.hands hands = mp_hands.Hands() mp_drawing = mp.solutions.drawing_utils
cap = cv2.VideoCapture(0)

while cap.isOpened(): ret, frame = cap.read() if not ret: break
image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
results = hands.process(image)

if results.multi_hand_landmarks:
    for hand_landmarks in results.multi_hand_landmarks:
        mp_drawing.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)

cv2.imshow('Hand Tracking', frame)
if cv2.waitKey(10) & 0xFF == ord('q'):
    break

Chapter 3: Train a Custom Gesture Recognition Model

Step 1: setup dataset for training

we will train a custom model using google colab.

First, install the required packages:

!pip install --upgrade pip
!pip install mediapipe-model-maker

import the required libraries:

from google.colab import files
import os
import tensorflow as tf
assert tf.__version__.startswith('2')

from mediapipe_model_maker import gesture_recognizer

import matplotlib.pyplot as plt

upload dataset zip file to google collab and unzip:

!unzip rps_data_sample.zip
dataset_path = "rps_data_sample"

Load the dataset:

data = gesture_recognizer.Dataset.from_folder(
    dirname=dataset_path,
    hparams=gesture_recognizer.HandDataPreprocessingParams()
)
train_data, rest_data = data.split(0.8)
validation_data, test_data = rest_data.split(0.5)

we split the dataset as 80% for training, 10% for validation, and 10% for testing

Step 2: Train the Model

Train the model:

hparams = gesture_recognizer.HParams(export_dir="exported_model")
options = gesture_recognizer.GestureRecognizerOptions(hparams=hparams)
model = gesture_recognizer.GestureRecognizer.create(
    train_data=train_data,
    validation_data=validation_data,
    options=options
)

Note, it is possible to fine tune training using hyperparameters:

  1. dropout_rate: The percentage of input units to ignore in a dropout layer. Default is 5%.

  2. layer_widths: A list specifying the number of units in each hidden layer for the gesture model. Each value creates a new hidden layer with that number of units. These hidden layers include BatchNorm, Dropout, and ReLU. Default is an empty list (no hidden layers).

Customizable Parameters for HParams (affecting model accuracy):

  1. learning_rate: The speed at which the model learns during training. Default is 0.001.

  2. batch_size: The number of samples processed before the model updates. Default is 2.

  3. epochs: The number of times the model will see the entire dataset during training. Default is 10.

  4. steps_per_epoch: (Optional) The number of steps (batches) to run per epoch. If not set, the default is the size of the training dataset divided by the batch size.

  5. shuffle: Whether the dataset is mixed before training. Default is False.

  6. lr_decay: The rate at which the learning rate decreases over time. Default is 0.99.

  7. gamma: A parameter used for focal loss. Default is 2.

eg. the following trains a model with dropout rate of 0.02 and learning rate of 0.003

hparams = gesture_recognizer.HParams(learning_rate=0.003, export_dir="exported_model_2")
model_options = gesture_recognizer.ModelOptions(dropout_rate=0.2)
options = gesture_recognizer.GestureRecognizerOptions(model_options=model_options, hparams=hparams)
model_2 = gesture_recognizer.GestureRecognizer.create(
    train_data=train_data,
    validation_data=validation_data,
    options=options
)

evaluate model accuracy and retrain if neccesary:

loss, accuracy = model_2.evaluate(test_data)
print(f"Test loss:{loss}, Test accuracy:{accuracy}")

export the model:

model.export_model()
!ls exported_model

download the model:

files.download('exported_model/gesture_recognizer.task')

Chapter 4: Integrate with Video Playback

Step 1: Implement Gesture Control

Modify the previous hand tracking script to recognize gestures and control video playback.

import cv2
import mediapipe as mp
import numpy as np
from mediapipe.tasks.python.components.containers.landmark import NormalizedLandmark
from mediapipe.framework.formats import landmark_pb2

#modify path with your custom model and video path
video_file_path = 'your_video.mp4'
gesture_model = 'gesture_recognizer.task'

#create gesture recognizer
GestureRecognizer = mp.tasks.vision.GestureRecognizer
GestureRecognizerOptions = mp.tasks.vision.GestureRecognizerOptions

#Using your custom model to create the options for the gesture recognizer
gesture_options = GestureRecognizerOptions(
    base_options=BaseOptions(model_asset_buffer = open(gesture_model, "rb").read()),
    running_mode=VisionRunningMode.VIDEO)

#create instance of hand tracker and gesture recognizer
with GestureRecognizer.create_from_options(gesture_options) as recognizer:

    writer = cv2.VideoWriter("demo.avi", cv2.VideoWriter_fourcc(*"MJPG"), 12.5,(640,480)) # algo makes a frame every ~80ms = 12.5 fps
    while cap.isOpened():
        success, image = cap.read()
        # if cannot open video file
        if not success:
            break
  
        # To improve performance, optionally mark the image as not writeable to
        # pass by reference.
        image.flags.writeable = False
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        results = hands.process(image)

        # gesture classification data arrays
        current_gestures = []
        current_handedness = []
        current_score = []

        # recognize gestures
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
        gesture_recognition_result = recognizer.recognize_for_video(mp_image, frame_count)
        frame_count += 1

        # obtain neccesary data into array for display (using array because there are two hands)
        if gesture_recognition_result is not None and any(gesture_recognition_result.gestures):
            print("Recognized gestures:")
            for single_hand_gesture_data in gesture_recognition_result.gestures:
                gesture_name = single_hand_gesture_data[0].category_name
                current_gestures.append(gesture_name)

            for single_hand_handedness_data in gesture_recognition_result.handedness:
                hand_name = single_hand_handedness_data[0].category_name
                current_handedness.append(hand_name)

            for single_hand_score_data in gesture_recognition_result.gestures:
                score = single_hand_score_data[0].score
                current_score.append(round(score, 2))


        # display classified gesture data on frames
        y_pos = image.shape[0] - 70
        for x in range(len(current_gestures)):
            if current_handedness[x] != "Left":
                txt = current_handedness[x] + ": " + current_gestures[x] + " " + str(current_score[x])
                if current_gestures[x] == "supination":
                    cv2.putText(image, txt, (image.shape[1] - 400, y_pos), cv2.FONT_HERSHEY_SIMPLEX, 1, (218,10,3), 2, cv2.LINE_AA)
                    print(txt)
                    break
                else:
                    cv2.putText(image, txt, (image.shape[1] - 400, y_pos), cv2.FONT_HERSHEY_SIMPLEX, 1, (37,245,252), 2, cv2.LINE_AA)
                    print(txt)
                    break
                    
        # displaying frame data
        image = ResizeWithAspectRatio(image, height=800)
        image = cv2.putText(
            image,
            "Frame {}".format(frame_count),
            (10, 50),
            cv2.QT_FONT_NORMAL,
            1,
            (0, 0, 255),
            1,
            cv2.LINE_AA
        )
      
        # Resize to original dimension before writing
        resized_frame = cv2.resize(image, (640, 480))

        writer.write(resized_frame)
        cv2.imshow('MediaPipe Hands', image)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    writer.release()
    cv2.destroyAllWindows()

Conclusion

In this tutorial, we covered the steps to capture gesture data, train a custom gesture recognition model using MediaPipe, and integrate it for video playback control. This can be expanded with additional gestures and more advanced models for better accuracy.

Hope you enjoy this tutorial and happy coding with MediaPipe!

Remember to replace placeholders such as `'path_to_your_video.mp4'` with actual paths relevant to your environment. This tutorial assumes a basic understanding of Python and familiarity with machine learning concepts.

Last updated